To me, it's a criminally underused tool. While "raw" LLMs are cool, they're annoying to use as anything but chatbots, as their output is unpredictable and basically impossible to parse programmatically.
Structured outputs solve that problem neatly. In a way, they're "neural networks without the training". They can be used to solve similar problems as traditional neural networks, things like image classification or extracting information from messy text, but all they require is a Zod or Pydantic type definition and a prompt. No renting GPUs, labeling data and tuning hyperparameters necessary.
They often also improve LLM performance significantly. Imagine you're trying to extract calories per 100g of product, but some product give you calories per serving and a serving size, calories per pound etc. The naive way to do this is a prompt like "give me calories per 100g", but that forces the LLM to do arithmetic, and LLMs are bad at arithmetic. With structured outputs, you just give it the fifteen different formats that you expect to see as alternatives, and use some simple Python to turn them all into calories per 100g on the backend side.
One way teams exploit that - force LLM to go through a predefined task-specific checklist before answering. This custom hard-coded chain of thought boosts the accuracy and makes reasoning more auditable.
I was trying to make a decent cocktail recipe database, and scraped the text of cocktails from about 1400 webpages. Note that this was just the text of the cocktail recipe, and cocktail recipes are comparatively small. I sent the text to an LLM for JSON structuring, and the LLM routinely miscategorized liquor types. It also failed to normalize measurements with explicit instructions and the temperature set to zero. I gave up.
the idea is that instead of using JSON.parse, we create a custom Type.parse for each type you define.
so if you want a:
class Job { company: string[] }
And the LLM happens to output: { "company": "Amazon" }
We can upcast "Amazon" -> ["Amazon"] since you indicated that in your schema.https://www.boundaryml.com/blog/schema-aligned-parsing
and since its only post processing, the technique will work on every model :)
for example, on BFCL benchmarks, we got SAP + GPT3.5 to beat out GPT4o ( https://www.boundaryml.com/blog/sota-function-calling )
so if you want a:
class Job { company: string[] }
We can upcast "Amazon" -> ["Amazon"] since you indicated that in your schema.
Congratulations! You've discovered Applicative Lifting.but applicative lifting is a big part of it as well!
gloochat.notion.site/benefits-of-baml
Client: Ollama (phi4) - 90164ms. StopReason: stop. Tokens(in/out): 365/396
---PROMPT---
user: Extract from this content:
Grave Digger:
Ingredients
- 1 1/2 ounces vanilla-infused brandy*
- 3/4 ounce coffee liqueur
- 1/2 ounce Grand Marnier
- 1 ounce espresso, freshly brewed
- Garnish: whipped cream
- Garnish: oreo cookies, crushed
Steps
1. Add all ingredients into a shaker with ice and shake until
well-chilled.
2. Strain into a coupe.
3. Top with whipped cream and crushed Oreo cookies (discarding cream in
center).
*Vanilla-infused brandy: Cut 2 fresh vanilla pods lengthwise and place
into a 750 mL bottle of brandy. Let sit for 3 to 5 days, shaking
occasionally. Strain out pods and store.
IngredientType
----
- Alcohol
- Sweetener
- Sour
- Aromatic
- Bittering_agent
- Food
- Dilution
LiquorType
----
- Gin
- Vodka
- Rum
- Whiskey
- Tequila
- Mezcal
- Brandy
- Cognac
- Liqueur
- Wine
- Absinthe
Answer in JSON using this schema:
{
Name: string,
Ingredients: [
{
Text: string,
Type: IngredientType,
Liquor_type: LiquorType or null,
Name_brand: string or null,
Unit_of_measure: string,
Measurement_or_unit_count: string,
}
],
Steps: [
{
Number: int,
Instruction: string,
}
],
Garnishes: string[],
}
---LLM REPLY---
```json
{
"Name": "Grave Digger",
"Ingredients": [
{
"Text": "vanilla-infused brandy*",
"Type": "Alcohol",
"Liquor_type": "Brandy",
"Name_brand": null,
"Unit_of_measure": "ounces",
"Measurement_or_unit_count": "1 1/2"
},
{
"Text": "coffee liqueur",
"Type": "Liqueur",
"Liquor_type": "Liqueur",
"Name_brand": null,
"Unit_of_measure": "ounce",
"Measurement_or_unit_count": "3/4"
},
{
"Text": "Grand Marnier",
"Type": "Liqueur",
"Liquor_type": "Liqueur",
"Name_brand": "Grand Marnier",
"Unit_of_measure": "ounce",
"Measurement_or_unit_count": "1/2"
},
{
"Text": "espresso, freshly brewed",
"Type": "Bittering_agent",
"Liquor_type": null,
"Name_brand": null,
"Unit_of_measure": "ounces",
"Measurement_or_unit_count": "1"
}
],
"Steps": [
{
"Number": 1,
"Instruction": "Add all ingredients into a shaker with ice and shake until well-chilled."
},
{
"Number": 2,
"Instruction": "Strain into a coupe."
},
{
"Number": 3,
"Instruction": "Top with whipped cream and crushed Oreo cookies (discarding cream in center)."
}
],
"Garnishes": [
"whipped cream",
"oreo cookies, crushed"
]
}
```
---Parsed Response (class Recipe)---
{
"Name": "Grave Digger",
"Ingredients": [
{
"Text": "vanilla-infused brandy*",
"Type": "Alcohol",
"Liquor_type": "Brandy",
"Name_brand": null,
"Unit_of_measure": "ounces",
"Measurement_or_unit_count": "1 1/2"
},
{
"Text": "espresso, freshly brewed",
"Type": "Bittering_agent",
"Liquor_type": null,
"Name_brand": null,
"Unit_of_measure": "ounces",
"Measurement_or_unit_count": "1"
}
],
"Steps": [
{
"Number": 1,
"Instruction": "Add all ingredients into a shaker with ice and shake until well-chilled."
},
{
"Number": 2,
"Instruction": "Strain into a coupe."
},
{
"Number": 3,
"Instruction": "Top with whipped cream and crushed Oreo cookies (discarding cream in center)."
}
],
"Garnishes": [
"whipped cream",
"oreo cookies, crushed"
]
}
Processed Recipe: {
Name: 'Grave Digger',
Ingredients: [
{
Text: 'vanilla-infused brandy*',
Type: 'Alcohol',
Liquor_type: 'Brandy',
Name_brand: null,
Unit_of_measure: 'ounces',
Measurement_or_unit_count: '1 1/2'
},
{
Text: 'espresso, freshly brewed',
Type: 'Bittering_agent',
Liquor_type: null,
Name_brand: null,
Unit_of_measure: 'ounces',
Measurement_or_unit_count: '1'
}
],
Steps: [
{
Number: 1,
Instruction: 'Add all ingredients into a shaker with ice and shake until well-chilled.'
},
{ Number: 2, Instruction: 'Strain into a coupe.' },
{
Number: 3,
Instruction: 'Top with whipped cream and crushed Oreo cookies (discarding cream in center).'
}
],
Garnishes: [ 'whipped cream', 'oreo cookies, crushed' ]
}So, yeah, the main issue being that it dropped some ingredients that were present in the original LLM reply. Separately, the original LLM Reply misclassified the `Type` field in `coffee liqueur`, which should have been `Alcohol`.
computers -> assembly -> HLL -> web -> cloud -> AI
Nothing on that list has disappeared, but the work has changed enough to warrant a few major versions imo.
V1.0: describing solutions to specific problems directly, precisely, for machines to execute.
V2.0: giving machine examples of good and bad answers to specific problems we don't know how to describe precisely, for machine to generalize from and solve such indirectly specified problem.
V3.0: telling machine what to do in plain language, for it to figure out and solve.
V2 was coded in V1 style, as a solution to problem of "build a tool that can solve problems defined as examples". V3 was created by feeding everything and the kitchen sink into V2 at the same time, so it learns to solve the problem of being general-purpose tool.
My Hail Mary is it’s going to be groups of machines gathering real world data, creating their own protocols or forms of language isolated to their own systems in order to optimize that particular system’s workflow and data storage.
Exactly what I felt. Semver like naming analogies bring their own set of implicit meanings, like major versions having to necessarily supersede or replace the previous version, that is, it doesn't account for coexistence further than planning migration paths. This expectation however doesn't correspond with the rest of the talk, so I thought I might point it out. Thanks for taking the time to reply!
Training constraints: you need lots, and lots of data to build complex neural network systems. There are plenty of situations where the data just isn't available to you (whether for legal reasons, technical reasons, or just because it doesn't exist).
Legibility constraints: it is extremely hard to precisely debug and fix those systems. Let's say you build a software system to fill out tax forms - one the "traditional" way, and one that's a neural network. Now your system exhibits a bug where line 58(b) gets sometimes improperly filled out for software engineers who are married, have children, and also declared a source of overseas income. In a traditionally implemented system, you can step through the code and pinpoint why those specific conditions lead to a bug. In a neural network system, not so much.
So totally agreed with you that those are extra tools in the toolbelt - but their applicability is much, much more constrained than that of traditional code.
In short, they excel at situations where we are trying to model an extremely complex system - one that is impossible to nail down as a list of formal requirements - and where we have lots of data available. Signal processing (like self driving, OCR, etc) and human language-related problems are great examples of such problems where traditional programming approaches have failed to yield the kind of results we wanted (ie, beyond human performance) in 70+ years of research and where the modern, neural network approach finally got us the kind of results we wanted.
But if you can define the problem you're trying to solve as formal requirements, then those tools are probably ill-suited.
LLMs give us another tool only this time it's far more accessible and powerful.
However I think it's important to make it clear that given the hardware constraints of many environments the applicability of what's being called software 2.0 and 3.0 will be severely limited.
So instead of being replacements, these paradigms are more like extra tools in the tool belt. Code and prompts will live side by side, being used when convenient, but none a panacea.