Preferences

I agree with the author overall. Manual testing is what I call "vibe testing" and I think by itself is insufficient, no matter if you or the agent wrote the code. If you build your tests well, using the coding agent becomes smooth and efficient, and the agent is safe to do longer stretches of work. If you don't do testing, the whole thing is just a bomb ticking in your face.

My approach to coding agents is to prepare a spec at the start, as complete as possible, and develop a beefy battery of tests as we make progress. Yesterday there was a story "I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in hours". They had 9000+ tests. That was the secret juice.

So the future of AI coding as I see it ... it will be better than pre-2020, we will learn to spec and plan good tests, and the tests are actually our contract the code does what is supposed to do. You can throw away the code and keep the specs and tests and regenerate any time.


This depends on the type of software you make. Testing the usability of a user interface for example, is something you can't automate (yet). So, ehm, it depends :)
It will come around, we have rudimentary computer use agents and ability to record UIs for LLM agents. They will me refined and the agent can test UIs as well.

For UIs I do a different trick - live diagnostic tests - I ask the agent to write tests that run in the app itself, check consistencies, constraints and expected behaviors. Having the app running in its natural state makes it easier to test, you can have complex constraints encoded in your diagnostics.

> Yesterday there was a story "I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in hours".

Yes, from the same author, in fact.

There are always unknown unknowns which a rigorous testing implementation would just hide under the rug (until they become visible on live, that is).

> They had 9000+ tests.

They were most probably also written by AI, there's no other (human) way. The way I see it we're putting turtles upon turtles hoping that everything will stick together, somehow.

No, those 9,000 tests are part of a legendary test suite built by real humans over the course of more than a decade: https://github.com/html5lib/html5lib-tests
Sadly, JustHTML doesn't appear to be truly passing those tests.

It looks like the code doesn't always check whether expected errors in the testsuite match the returned errors - which is rather important to ensure one isn't just incidentally getting the expected output.

So while JustHTML looks sort of right, it'll actually do things like emit errors on perfectly valid html.

Plus, the test suite isn't actually comprehensive, so if one only writes code to pass the tests, it can fail in the real world where other parsers that actually wrote against the spec wouldn't have trouble.

For instance, the html5lib-tests only tests a small number of meta charsets and as a result, JustHTML can't handle a whole slew of valid HTML5 character encodings like windows-1250 or koi8-r - which parsers like html5lib will happily handle. There's even a unit test added by the AI that ensures koi8-r doesn't work, for some reason.

I thought we were talking about human-scale (as in not multi-human) projects, my bad.
I tabbed back to Visual Studio (C#): 24990 "unit" tests, all written by hand over the past years.

Behind that is a smaller number of larger integration tests, and the even longer running regression tests that are run every release but not on every commit.

> They were most probably also written by AI, there's no other (human) way.

Yes. They came from the existing project being ported, which was also AI-written.

They were not, and they did not.

Those human tests are why your browser properly renders diversely messy HTML.

Oh, I misunderstood the previous submission, then.

This item has no comments currently.