Preferences

I feel like this is so core to any LLM automation it was crazy that anthropic is only adding it now.

I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.


Prior to this it was possible to get the same effect by defining a tool with the schema that you wanted and then telling the Anthropic API to always use that tool.

I implemented structured outputs for Claude that way here: https://github.com/simonw/llm-anthropic/blob/500d277e9b4bec6...

We've been running structured outputs via Claude on Bedrock in production for a year now and it works great. Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response. GG

Nice to see them support it officially; however, OpenAI has officially supported this for a while but, at least historically, I have been unable to use it because it adds deterministic validation that errors on certain standard JSON Schema elements that we used. The lack of "official" support is the feature that pushed us to use Claude in the first place.

It's unclear to me that we will need "modes" for these features.

Another example: I used to think that I couldn't live without Claude Code "plan mode". Then I used Codex and asked it to write a markdown file with a todo list. A bit more typing but it works well and it's nice to be able to edit the plan directly in editor.

Agree or Disagree?

Before Claude Code shipped with plan mode, the workflow for using most coding agents was to have it create a `PLAN.md` and update/execute that plan. Planning mode was just a first class version of what users were already doing.
Claude Code keeps coming out with a lot of really nice tools that others haven't started to emulate from what I've seen.

My favorite one is going through the plan interactively. It turns it into a multiple choice / option TUI and the last choose is always reprompt that section of the plan.

I had switch back to codex recently and not being able to do my planning solely in the CLI feels like the early 1900s.

To trigger the interactive mode. Do something like:

Plan a fix for:

<Problem statement>

Please walk me through any options or questions you might have interactively.

> Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response

I would hope that this is not what OpenAI/Anthropic do under the hood, because otherwise, what if one of the strings needs a lot of \escapes? Is it also supposed to newer write actual newlines in strings? It's awkward.

The ideal solution would be to have some special tokens like [object_start] [object_end] and [string_start] [string_end].

I don't think the tool input schema thing does that inference-time trick. I think it just dumps the JSON schema into the context, and tells the model to conform to that schema.
Same, but it’s a PITA when you also want to support tool calling at the same time. Had to do a double call: call and check if it will use tools. If not, call again and force the use of the (now injected) return schema tool.
It's not 100% success, I've had responses that didn't match my schema.

I think the new feature goes on to limit which token can be output, which brings a guarantee, where the tools are a suggestion.

So, so much this.

Structured outputs are the most underappreciated LLM feature. If you're building anything except a chatbot, it's definitely worth familiarizing yourself without them.

They're not too easy to use well, and there aren't that much resources on the internet explaining how to get the most out of them you can.

In Python, they're very easy to use. Define your schema with Pydantic and pass the class to your client calls. There are some details to know (eg field order can affect performance), but it's very easy overall. Other languages probably have something similar.
It's nice but I don't know how necessary it is.

You could get this working very consistently with GPT-4 in mid 2023. The version before June, iirc. No JSON output, no tool calling fine tuning... just half a page of instructions and some string matching code. (Built a little AI code editing tool along these lines.)

With the tool calling RL and structured outputs, I think the main benefit is peace of mind. You know you're going down the happy path, so there's one less thing to worry about.

Reliability is the final frontier!

Using structured outputs pretty extensively for a while now, my impression has been that the newer models take less of a quality hit while conforming to a specific schema. Just giving instructions and output examples totally worked, however it came at a considerable cost of quality in the output. My impression is that this effect has diminished over time with models that have been more explicitly trained to produce them.
I have had fairly bad luck specifying the JSONSchema for my structured outputs with Gemini. It seems like describing the schema with natural language descriptions works much better, though I do admit to needing that retry hack at times. Do you have any tips on getting the most out of a schema definition?
Always have a top level object for one.

But also Gemini supports contrained generation which can't fail to match a schema, so why not use that instead of prompting?

Constrained generation makes models somewhat less intelligent. Although it shouldn't be an issue in thinking mode, since it can prepare an unconstrained response and then fix it up.
Not true and citation needed. Whatever you cite there are competing papers claiming that structured and constrained generation does zero harm to output diversity/creativity (within a schema).
That is clearly not possible. Imagine if you asked a model yes/no questions with a schema that didn't contain "yes".

In general you can break any model by using a sampler that chooses bad enough tokens sometimes. I don't think it's well studied how well different models respond to this.

I mean that's too reductionist if you're being exact and not a worry if you're not.

Even asking for JSON (without constrained sampling) sometimes degrades output, but also even the name and order of keys can affect performance or even act as structured thinking.

At the end of the day current models have enough problems with generalization that they should establish a baseline and move from there.

Agree, it feels so fundamental. Any idea why? Gemini has also had it for a long time
The way you get structured output with Claude prior to this is via tool use.

IMO this was the more elegant design if you think about it: tool calling is really just structured output and structured output is tool calling. The "do not provide multiple ways of doing the same thing" philosophy.

and they've done super well without it. makes you really question if this is really that core.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal