Preferences

gchamonlive parent
I think it's interesting to juxtapose traditional coding, neural network weights and prompts because in many areas -- like the example of the self driving module having code being replaced by neural networks tuned to the target dataset representing the domain -- this will be quite useful.

However I think it's important to make it clear that given the hardware constraints of many environments the applicability of what's being called software 2.0 and 3.0 will be severely limited.

So instead of being replacements, these paradigms are more like extra tools in the tool belt. Code and prompts will live side by side, being used when convenient, but none a panacea.


karpathy
I kind of say it in words (agreeing with you) but I agree the versioning is a bit confusing analogy because it usually additionally implies some kind of improvement. When I’m just trying to distinguish them as very different software categories.
miki123211
What do you think about structured outputs / JSON mode / constrained decoding / whatever you wish to call it?

To me, it's a criminally underused tool. While "raw" LLMs are cool, they're annoying to use as anything but chatbots, as their output is unpredictable and basically impossible to parse programmatically.

Structured outputs solve that problem neatly. In a way, they're "neural networks without the training". They can be used to solve similar problems as traditional neural networks, things like image classification or extracting information from messy text, but all they require is a Zod or Pydantic type definition and a prompt. No renting GPUs, labeling data and tuning hyperparameters necessary.

They often also improve LLM performance significantly. Imagine you're trying to extract calories per 100g of product, but some product give you calories per serving and a serving size, calories per pound etc. The naive way to do this is a prompt like "give me calories per 100g", but that forces the LLM to do arithmetic, and LLMs are bad at arithmetic. With structured outputs, you just give it the fifteen different formats that you expect to see as alternatives, and use some simple Python to turn them all into calories per 100g on the backend side.

abdullin
Even more than that. With Structured Outputs we essentially control layout of the response, so we can force LLM to go through different parts of the completion in a predefined order.

One way teams exploit that - force LLM to go through a predefined task-specific checklist before answering. This custom hard-coded chain of thought boosts the accuracy and makes reasoning more auditable.

solaire_oa
I also think that structured outputs are criminally underused, but it isn't perfect... and per your example, it might not even be good, because I've done something similar.

I was trying to make a decent cocktail recipe database, and scraped the text of cocktails from about 1400 webpages. Note that this was just the text of the cocktail recipe, and cocktail recipes are comparatively small. I sent the text to an LLM for JSON structuring, and the LLM routinely miscategorized liquor types. It also failed to normalize measurements with explicit instructions and the temperature set to zero. I gave up.

hellovai
have you tried schema-aligned parsing yet?

the idea is that instead of using JSON.parse, we create a custom Type.parse for each type you define.

so if you want a:

   class Job { company: string[] }
And the LLM happens to output:

   { "company": "Amazon" }
We can upcast "Amazon" -> ["Amazon"] since you indicated that in your schema.

https://www.boundaryml.com/blog/schema-aligned-parsing

and since its only post processing, the technique will work on every model :)

for example, on BFCL benchmarks, we got SAP + GPT3.5 to beat out GPT4o ( https://www.boundaryml.com/blog/sota-function-calling )

solaire_oa
Interesting! I was using function calling in OpenAI and JSON mode in Ollama with zod. I may revisit the project with SAP.
instig007

    so if you want a:

       class Job { company: string[] }

    We can upcast "Amazon" -> ["Amazon"] since you indicated that in your schema.
Congratulations! You've discovered Applicative Lifting.
hellovai
its a bit more nuanced than applicative lifting. parts of of SAP is that, but there's also supporting strings that don't have quotation marks, supporting recursive types, supporting unescaped quotes like: `"hi i wanted to say "hi""`, supporting markdown blocks inside of things that look like "json", etc.

but applicative lifting is a big part of it as well!

gloochat.notion.site/benefits-of-baml

solaire_oa
Ok. Tried it, I'm not super impressed.

    Client: Ollama (phi4) - 90164ms. StopReason: stop. Tokens(in/out): 365/396
    ---PROMPT---
    user: Extract from this content:
    Grave Digger: 
     Ingredients
    
    - 1 1/2 ounces vanilla-infused brandy*
    
    - 3/4 ounce coffee liqueur
    
    - 1/2 ounce Grand Marnier
    
    - 1 ounce espresso, freshly brewed
    
    - Garnish: whipped cream
    
    - Garnish: oreo cookies, crushed
    
    Steps
    
    1.  Add all ingredients into a shaker with ice and shake until
        well-chilled.
    
    2.  Strain into a coupe.
    
    3.  Top with whipped cream and crushed Oreo cookies (discarding cream in
        center).
    
    *Vanilla-infused brandy: Cut 2 fresh vanilla pods lengthwise and place
    into a 750 mL bottle of brandy. Let sit for 3 to 5 days, shaking
    occasionally. Strain out pods and store.
    
    
    
    IngredientType
    ----
    - Alcohol
    - Sweetener
    - Sour
    - Aromatic
    - Bittering_agent
    - Food
    - Dilution
    
    LiquorType
    ----
    - Gin
    - Vodka
    - Rum
    - Whiskey
    - Tequila
    - Mezcal
    - Brandy
    - Cognac
    - Liqueur
    - Wine
    - Absinthe
    
    Answer in JSON using this schema:
    {
      Name: string,
      Ingredients: [
        {
          Text: string,
          Type: IngredientType,
          Liquor_type: LiquorType or null,
          Name_brand: string or null,
          Unit_of_measure: string,
          Measurement_or_unit_count: string,
        }
      ],
      Steps: [
        {
          Number: int,
          Instruction: string,
        }
      ],
      Garnishes: string[],
    }
    
    ---LLM REPLY---
    ```json
    {
      "Name": "Grave Digger",
      "Ingredients": [
        {
          "Text": "vanilla-infused brandy*",
          "Type": "Alcohol",
          "Liquor_type": "Brandy",
          "Name_brand": null,
          "Unit_of_measure": "ounces",
          "Measurement_or_unit_count": "1 1/2"
        },
        {
          "Text": "coffee liqueur",
          "Type": "Liqueur",
          "Liquor_type": "Liqueur",
          "Name_brand": null,
          "Unit_of_measure": "ounce",
          "Measurement_or_unit_count": "3/4"
        },
        {
          "Text": "Grand Marnier",
          "Type": "Liqueur",
          "Liquor_type": "Liqueur",
          "Name_brand": "Grand Marnier",
          "Unit_of_measure": "ounce",
          "Measurement_or_unit_count": "1/2"
        },
        {
          "Text": "espresso, freshly brewed",
          "Type": "Bittering_agent",
          "Liquor_type": null,
          "Name_brand": null,
          "Unit_of_measure": "ounces",
          "Measurement_or_unit_count": "1"
        }
      ],
      "Steps": [
        {
          "Number": 1,
          "Instruction": "Add all ingredients into a shaker with ice and shake until well-chilled."
        },
        {
          "Number": 2,
          "Instruction": "Strain into a coupe."
        },
        {
          "Number": 3,
          "Instruction": "Top with whipped cream and crushed Oreo cookies (discarding cream in center)."
        }
      ],
      "Garnishes": [
        "whipped cream",
        "oreo cookies, crushed"
      ]
    }
    ```
    ---Parsed Response (class Recipe)---
    {
      "Name": "Grave Digger",
      "Ingredients": [
        {
          "Text": "vanilla-infused brandy*",
          "Type": "Alcohol",
          "Liquor_type": "Brandy",
          "Name_brand": null,
          "Unit_of_measure": "ounces",
          "Measurement_or_unit_count": "1 1/2"
        },
        {
          "Text": "espresso, freshly brewed",
          "Type": "Bittering_agent",
          "Liquor_type": null,
          "Name_brand": null,
          "Unit_of_measure": "ounces",
          "Measurement_or_unit_count": "1"
        }
      ],
      "Steps": [
        {
          "Number": 1,
          "Instruction": "Add all ingredients into a shaker with ice and shake until well-chilled."
        },
        {
          "Number": 2,
          "Instruction": "Strain into a coupe."
        },
        {
          "Number": 3,
          "Instruction": "Top with whipped cream and crushed Oreo cookies (discarding cream in center)."
        }
      ],
      "Garnishes": [
        "whipped cream",
        "oreo cookies, crushed"
      ]
    }
Processed Recipe: { Name: 'Grave Digger', Ingredients: [ { Text: 'vanilla-infused brandy*', Type: 'Alcohol', Liquor_type: 'Brandy', Name_brand: null, Unit_of_measure: 'ounces', Measurement_or_unit_count: '1 1/2' }, { Text: 'espresso, freshly brewed', Type: 'Bittering_agent', Liquor_type: null, Name_brand: null, Unit_of_measure: 'ounces', Measurement_or_unit_count: '1' } ], Steps: [ { Number: 1, Instruction: 'Add all ingredients into a shaker with ice and shake until well-chilled.' }, { Number: 2, Instruction: 'Strain into a coupe.' }, { Number: 3, Instruction: 'Top with whipped cream and crushed Oreo cookies (discarding cream in center).' } ], Garnishes: [ 'whipped cream', 'oreo cookies, crushed' ] }

So, yeah, the main issue being that it dropped some ingredients that were present in the original LLM reply. Separately, the original LLM Reply misclassified the `Type` field in `coffee liqueur`, which should have been `Alcohol`.

handfuloflight
Which LLM?
coderatlarge
note the per 100g prompt might lead the llm to reach for the part of its training distribution that is actually written in terms of the 100g standard and just lead to different recall rather than a suboptimal calculation based on non-standardized per 100g training examples.
BobbyJo
The versioning makes sense to me. Software has a cycle where a new tool is created to solve a problem, and the problem winds up being meaty enough, and the tool effective enough, that the exploration of the problem space the tool unlocks is essentially a new category/skill/whatever.

computers -> assembly -> HLL -> web -> cloud -> AI

Nothing on that list has disappeared, but the work has changed enough to warrant a few major versions imo.

TeMPOraL
For me it's even simpler:

V1.0: describing solutions to specific problems directly, precisely, for machines to execute.

V2.0: giving machine examples of good and bad answers to specific problems we don't know how to describe precisely, for machine to generalize from and solve such indirectly specified problem.

V3.0: telling machine what to do in plain language, for it to figure out and solve.

V2 was coded in V1 style, as a solution to problem of "build a tool that can solve problems defined as examples". V3 was created by feeding everything and the kitchen sink into V2 at the same time, so it learns to solve the problem of being general-purpose tool.

BobbyJo
That's less a versioning of software and more a versioning of AI's role in software. None -> Partial -> Total. Its a valid scale with regard to AI's role specifically, but I think Karpathy was intending to make a point about software as a whole, and even the details of how that middle "Partial" era evolves.
lymbo
What are some predictions people are anticipating for V4?

My Hail Mary is it’s going to be groups of machines gathering real world data, creating their own protocols or forms of language isolated to their own systems in order to optimize that particular system’s workflow and data storage.

lodovic
But that means AGI is going to write itself
gchamonlive OP
> versioning is a bit confusing analogy because it usually additionally implies some kind of improvement

Exactly what I felt. Semver like naming analogies bring their own set of implicit meanings, like major versions having to necessarily supersede or replace the previous version, that is, it doesn't account for coexistence further than planning migration paths. This expectation however doesn't correspond with the rest of the talk, so I thought I might point it out. Thanks for taking the time to reply!

poorcedural
Andrej, maybe Software 3.0 is not written in spoken language like code or prompts. Software 3.0 is recorded in behavior, a behavior that today's software lacks. That behavior is written and consumed by machine and annotated by human interaction. Skipping to 3.0 is premature, but Software 2.0 is a ramp.
mclau157
Would this also be more of a push towards robotics and getting physical AI in our every day lives
poorcedural
Very insightful! How you would describe boiling an egg is different than how a machine would describe it to another machine.
fc417fc802
Funny that you should use boiling an egg as an example. https://www.nature.com/articles/s44172-024-00334-w
no no, it actually is a good analogy in 2 ways:

1) it is a breaking change from the prior version

2) it is an improvement in that, in its ideal/ultimate form, it is a full superset of capabilities of the previous version

gyomu
It's not just the hardware constraints - it's also the training constraints, and the legibility constraints.

Training constraints: you need lots, and lots of data to build complex neural network systems. There are plenty of situations where the data just isn't available to you (whether for legal reasons, technical reasons, or just because it doesn't exist).

Legibility constraints: it is extremely hard to precisely debug and fix those systems. Let's say you build a software system to fill out tax forms - one the "traditional" way, and one that's a neural network. Now your system exhibits a bug where line 58(b) gets sometimes improperly filled out for software engineers who are married, have children, and also declared a source of overseas income. In a traditionally implemented system, you can step through the code and pinpoint why those specific conditions lead to a bug. In a neural network system, not so much.

So totally agreed with you that those are extra tools in the toolbelt - but their applicability is much, much more constrained than that of traditional code.

In short, they excel at situations where we are trying to model an extremely complex system - one that is impossible to nail down as a list of formal requirements - and where we have lots of data available. Signal processing (like self driving, OCR, etc) and human language-related problems are great examples of such problems where traditional programming approaches have failed to yield the kind of results we wanted (ie, beyond human performance) in 70+ years of research and where the modern, neural network approach finally got us the kind of results we wanted.

But if you can define the problem you're trying to solve as formal requirements, then those tools are probably ill-suited.

radicalbyte
Weights are code being replaced by data; something I've been making heavy use of since the early 00s. After coding for 10 years you start to see the benefits of it and understand where you should use it.

LLMs give us another tool only this time it's far more accessible and powerful.

dcsan
LLMs have already replaced some code directly for me eg NLP stuff. Previously I might write a bunch of code to do clustering now I just ask the LLM to group things. Obviously this is a very basic feature native to LLMs but there will be more first class LLM callable functions over time.

This item has no comments currently.