Comment by gwern - Hacker Neue

gwern Jul 9, 2024 parent

An entertaining outcome here is that LLMs may render the docs vs code debate largely moot: as LLM coding capabilities increase and the cost per token plummets, it becomes increasingly possible to simply stop writing code at all, and instead write docs which are 'compiled' each time by a LLM to code which is then compiled normally and the code thrown away. The code can never get out of sync with the docs because it is always generated from the docs, in a way that previous brittle fragile complicated 'generate code from docs' approaches could only vaguely dream of.

To do bug fixes, one simply updates the docs to explain the new behavior and intentions, and perhaps include an example (ie. unit test) or a property. This is then reflected in the new version of the codebase - the codebase as a whole, not simply one function or module. So the global refactoring or rewrites happen automatically, simply from conditioning on the new docs as a whole.

This might sound breathtaking inefficient and expensive, but it's just the next step in the long progression from the raw machine ops to assembler to low-level languages like C or LLVM to high-level languages to docs/specifications... I'm sure at each step, the masters of the lower stage were horrified by the profligacy and waste of just throwing away the lower stage each time and redoing everything from scratch.

davidthewatson Jul 9, 2024

Reacting to this and the previous comment, I've pursued the idea of iterative human-computer interaction via GPT as perhaps not the ultimate solution, but an extant solution to the problem posed by Knuth in literate programming before we had the magic to do it in HCI where the spectrum extends from humans to machines and assumes that the humans are as tolerant of the machines as the machines are of the human (Postel's law), in a frame that Peter Pirolli described here:

https://www.efsa.europa.eu/sites/default/files/event/180918-...

Which is to say that with an iterative, human-computer interaction (HCI), that is back-ended by a GPT (API) algorithm which can learn from the conversation and perhaps be enhanced by RAG (retrieval augmented generation) of code AND documentation (AKA prompt engineering), results beyond the average intern-engineer pair are not easily achievable, but increasingly probable given how both humans and computers are learning iteratively as we interact with emergent technology.

The key is that realizing that the computer can generate code, but that code is going to be frequently bad, if not hallucinatory in its compilability and perhaps computability and therefore, the human MUST play a DevOps or SRE or tech writer role pairing with the computer to produce better code, faster and cheaper.

Subtract either the computer or the human and you wind up with the same old, same old. I think what we want is GPT-backed metaprogramming produce white box tests precisely because it can see into the design and prove the code works before the code is shared with the human.

I don't know about you, but I'd trust AI a lot further if anything it generated was provable BEFORE it reached my cursor, not after.

The same is true here today.

Why doesn't every GPT interaction on the planet, when it generates code, simply generate white box tests proving that the code "works" and produces "expected results" to reach consensus with the human in its "pairing"?

I'm still guessing. I've posed this question to every team I've interacted with since this emerged, which includes many names you'd recognize.

Not trivial, but increasingly straightforward given the tools and the talent.

gwern OP Jul 9, 2024

> Why doesn't every GPT interaction on the planet, when it generates code, simply generate white box tests proving that the code "works" and produces "expected results" to reach consensus with the human in its "pairing"? I'm still guessing. I've posed this question to every team I've interacted with since this emerged, which includes many names you'd recognize.

My guess would be that it's simply rare in the training data to have white box tests right there next to the new snippet of code, rather than a lack of capability. Even when code does have tests, it's usually in other modules or source code files, written in separate passes, and not the next logical thing to write at any given point in an interactive chatbot assistant session. (Although Claude-3.5-sonnet seems to be getting there with its mania for refactoring & improvement...)

When I ask GPT-4 or Claude-3 to write down a bunch of examples and unit-test them and think of edge-cases, they are usually happy to oblige. For example, my latex2unicode.py mega-prompt is composed almost 100% of edge cases that GPT-4 came up with when I asked it to think of any confusing or uncertain LaTeX constructs: https://github.com/gwern/gwern.net/blob/f5a215157504008ddbc8... There's no reason they couldn't do this themselves and come up with autonomous test cases, run it in an environment, change the test cases and/or code, come up with a finalized test suite, and add that to existing code to enrich the sample. They just haven't yet.

euroderf Jul 9, 2024

I'd think that some combination of user-facing documentation (for the outside of the software) and requirements specs (for the inside of the software) oughta do the trick.

flunhat Jul 9, 2024

Not sure why you got downvoted, this is basically the logical conclusion of programming in some sense. Sure, generating code from docs via an LLM will be riddled with bugs, but it's not like the sloppy Python code some postdoc in a biology lab writes is much better. A lot of their code gets to be correct via trial and error anyway.

"Professional" programmers won't rely on this level of abstraction, but that's similar in principle to how professional programmers don't spend their time doing data analysis with Python & pandas. i.e. the programming is an incidental inconvenience for the research analyst or data scientist or whatever and being able to generate code by just writing english docs and specs makes it much easier.

The real issue is debuggability, and in particular knowing your code is "generally" correct and not overfit on whatever specs you provided. But we are discussing a tractable problem at this point.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous