I implemented structured outputs for Claude that way here: https://github.com/simonw/llm-anthropic/blob/500d277e9b4bec6...
Nice to see them support it officially; however, OpenAI has officially supported this for a while but, at least historically, I have been unable to use it because it adds deterministic validation that errors on certain standard JSON Schema elements that we used. The lack of "official" support is the feature that pushed us to use Claude in the first place.
It's unclear to me that we will need "modes" for these features.
Another example: I used to think that I couldn't live without Claude Code "plan mode". Then I used Codex and asked it to write a markdown file with a todo list. A bit more typing but it works well and it's nice to be able to edit the plan directly in editor.
Agree or Disagree?
My favorite one is going through the plan interactively. It turns it into a multiple choice / option TUI and the last choose is always reprompt that section of the plan.
I had switch back to codex recently and not being able to do my planning solely in the CLI feels like the early 1900s.
To trigger the interactive mode. Do something like:
Plan a fix for:
<Problem statement>
Please walk me through any options or questions you might have interactively.
I would hope that this is not what OpenAI/Anthropic do under the hood, because otherwise, what if one of the strings needs a lot of \escapes? Is it also supposed to newer write actual newlines in strings? It's awkward.
The ideal solution would be to have some special tokens like [object_start] [object_end] and [string_start] [string_end].
I think the new feature goes on to limit which token can be output, which brings a guarantee, where the tools are a suggestion.
Structured outputs are the most underappreciated LLM feature. If you're building anything except a chatbot, it's definitely worth familiarizing yourself without them.
They're not too easy to use well, and there aren't that much resources on the internet explaining how to get the most out of them you can.
You could get this working very consistently with GPT-4 in mid 2023. The version before June, iirc. No JSON output, no tool calling fine tuning... just half a page of instructions and some string matching code. (Built a little AI code editing tool along these lines.)
With the tool calling RL and structured outputs, I think the main benefit is peace of mind. You know you're going down the happy path, so there's one less thing to worry about.
Reliability is the final frontier!
But also Gemini supports contrained generation which can't fail to match a schema, so why not use that instead of prompting?
In general you can break any model by using a sampler that chooses bad enough tokens sometimes. I don't think it's well studied how well different models respond to this.
Even asking for JSON (without constrained sampling) sometimes degrades output, but also even the name and order of keys can affect performance or even act as structured thinking.
At the end of the day current models have enough problems with generalization that they should establish a baseline and move from there.
IMO this was the more elegant design if you think about it: tool calling is really just structured output and structured output is tool calling. The "do not provide multiple ways of doing the same thing" philosophy.
I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.