- Yeah I’ve found that the only way to let AI build any larger amount of useful code and data for a user that does not review all of it requires a lot of “gutter rails”. Not just adding more prompting, because it is an after-the-fact solution. Not just verifying and erroring a turn, because it adds latency and allows the model to start spinning out of control. But also isolating tasks and autofixing output keep the model on track.
Models definitely need less and less of this for each version that comes out but it’s still what you need to do today if you want to be able to trust the output. And even in a future where models approach perfect, I think this approach will be the way to reduce latency and keep tabs on whether your prompts are producing the output you expected on a larger scale. You will also be building good evaluation data for testing alternative approaches, or even fine tuning.
- Extrapolating and wildly guessing, we could end up with using all that mostly idle CPU/RAM (the non-VRAM) on the beefy GPUs doing inference on agentic loops where the AI runs small JS scripts in a sandbox (which Bun is the best at, with its faster startup times and lower RAM use, not to mention its extensive native bindings that Node.js/V8 do not have) essentially allowing multiple turns to happen before yielding to the API caller. It would also go well with Anthropic's advanced tool use that they recently announced. This would be a big competitive advantage in the age of agents.
- 2 points
- It's a bit odd to come from the outside to judge the internal process of an organization with many very complex moving parts, only a fraction of which we have been given context for, especially so soon after the incident and the post-mortem explaining it.
I think the ultimate judgement must come from whether we will stay with Cloudflare now that we have seen how bad it can get. One could also say that this level of outage hasn't happened in many years, and they are now freshly frightened by it happening again so expect things to get tightened up (probably using different questions than this blog post proposes).
As for what this blog post could have been: maybe a page out of how these ideas were actively used by the author at e.g. Tradera or Loop54.
- The quirks of field values not matching expectations reminds me of a rabbit hole when I was reverse engineering the Starbound engine[1] and eventually figured out the game was using a flawed implementation of SHA-256 hashing and had to create a replica of it [2]. Originally I used Python [3] which is a really nice language for reverse engineering data formats thanks to its flexibility.
[1] Starbounded was supposed to become an editor: https://github.com/blixt/starbounded
- Wow, blast from the past. There's a fairly recent game on Steam called "Stick it to the Stickman" which practically puts you in control of the character in these animations. In fact I think the game was directly inspired by them (There's a Devolver interview with someone working on the game mentioning it[1]).
- I see other people mentioning env and mise does this too, with additional support to add on extra env overrides with a dedicated file such as for example .mise.testing.toml config and running something like:
MISE_ENV=testing bun run test
(“testing” in this example can be whatever you like)
- Slightly related but mise, a tool you can use instead of eg make, has “on enter directory” hooks that can reconfigure your system quite a bit whenever you enter the project directory in the terminal. Initially I was horrified by this idea but I have to admit it’s been quite nice to enter into a directory and everything is set up just right, also for new people joining. It has built in version management of just about every command line tool you could imagine, so that an entire team can be on a consistent setup of Python, Node, Go, etc.
- I thought exactly the same thing. I use errgroup in practically every Go project because it does something you'd most likely do by hand otherwise, and it does it cleaner.
I discovered it after I had already written my own utility to do exactly the same thing, and the code was almost line for line the same, which was pretty funny. But it was a great opportunity to delete some code from the repo without having to refactor anything!
- It would be interesting to do a wider test like this but instead of trying to clump people together into "American English" and "British English" it would be interesting if the data point was "in which city do people speak like you do?" and create a geographic map of accents.
I'm from the south of Sweden and I've had my "accent" made fun of by people from Malmö just because I grew up outside of Helsingborg, because the accent changes that much in just 60 kilometers.
- Yes we had the same issue with our coding agent. We found that instead of replacing large tool results in the context it was sometimes better to have two agents, one long lived with smaller tool results produced by another short lived agent that would actually be the one to read and edit large chunks. The downside of this is you always have to manage the balance of which agent gets what context, and you also increase latency and cost a bit (slightly less reuse of prompt cache)
- From what I can tell the new context editing and memory APIs are essentially formalization of common patterns:
Context editing: Replace tool call results in message history (i.e replace a file output with an indicator that it’s no longer available).
Memory: Give LLM access to read and write .md files like a virtual file system
I feel like these formalizations of tools are on the path towards managing message history on the server, which means better vendor lock in, but not necessarily a big boon to the user of the API (well, bandwidth and latency will improve). I see the ChatGPT Responses API going a similar path, and together these changes will make it harder to swap transparently between providers, something I enjoy having the ability to do.
- They reference a paper using initial noisy data as a key, mapping to a "jump-ahead" value of a previous example. I think this is very cool and clever, and does use a diffusion model.
But I don't see how this Deep Researcher actually uses diffusion at all. So it seems wrong to say "test-time diffusion" just because you liken an early text draft with noise in a diffusion model, then use RAG to retrieve a potential polished version of said text draft?
- I found it interesting that in the segment where two people were communicating "telepathically", they seem to be producing text, which is then put through text-to-speech (using what appeared to be a voice trained on their own -- nice touch).
I have to wonder, if they have enough signal to produce what essentially looks like speech-to-text (without the speech), wouldn't it be possible to use the exact same signal to directly produce the synthesized speech? It could also lower latency further by not needing extra surrounding context for the text to be pronounced correctly.
- Yeah maybe they're overrated, but they seem like the agreed-upon set of types to avoid null and to standardize error handling (with some support for nice sugars like Rust's ? operator).
I quite often see devs introducing them in other languages like TypeScript, but it just doesn't work as well when it's introduced in userland (usually you just end up with a small island of the codebase following this standard).
- I've been using Go more or less in every full-time job I've had since pre-1.0. It's simple for people on the team to pick up the basics, it generally chugs along (I'm rarely worried about updating to latest version of Go), it has most useful things built in, it compiles fast. Concurrency is tricky but if you spend some time with it, it's nice to express data flow in Go. The type system is most of the time very convenient, if sometimes a bit verbose. Just all-around a trusty tool in the belt.
But I can't help but agree with a lot of points in this article. Go was designed by some old-school folks that maybe stuck a bit too hard to their principles, losing sight of the practical conveniences. That said, it's a _feeling_ I have, and maybe Go would be much worse if it had solved all these quirks. To be fair, I see more leniency in fixing quirks in the last few years, like at some point I didn't think we'd ever see generics, or custom iterators, etc.
The points about RAM and portability seem mostly like personal grievances though. If it was better, that would be nice, of course. But the GC in Go is very unlikely to cause issues in most programs even at very large scale, and it's not that hard to debug. And Go runs on most platforms anyone could ever wish to ship their software on.
But yeah the whole error / nil situation still bothers me. I find myself wishing for Result[Ok, Err] and Optional[T] quite often.
- There's also the announcement YouTube here: https://www.youtube.com/watch?v=3T4OwBp6d90
- 5 points
- I've been very impressed by Jazz -- it enables great DX (you're mostly writing sync, imperative code) and great UX (everything feels instant, you can work offline, etc).
Main problems I have are related to distribution and longevity -- as the article mentions, it only grows in data (which is not a big deal if most clients don't have to see that), and another thing I think is more important is that it's lacking good solutions for public indexes that change very often (you can in theory have a public readable list of ids). However, I recently spoke with Anselm, who said these things have solutions in the works.
All in all local-first benefits often come with a lot of costs that are not critical to most use cases (such as the need for much more state). But if Jazz figures out the main weaknesses it has compared to traditional central server solutions, it's basically a very good replacement for something like Firebase's Firestore in just about every regard.
- I haven't used XSLT since 2007 but I used it as an alternative to ASP.NET for building some dynamic but not too advanced websites and it had a very impressive performance when cached properly (I believe it was over an order of magnitude faster than just default ASP.NET). I went sleuthing for the framework but I think it's lost. Also the Google Code archive is barely functional anymore, but I did find the XSLT cache function I built for it: https://github.com/blixt/old-google-code-svn/blob/main/trunk...
Pretty cool to see XSLT mentioned in 2025!
- Yeah ultimately it depends on the problem. Reading an article like this, it's easy to conclude that the context should always be reduced, all context relegated to a vector database[1], and retrieved on demand such that the context is as small as possible. Seeing it makes me want to refer to situations where conversely growing the context helps a lot to improve performance.
It really depends on the task, but I imagine most real world scenarios have a mixed bag of requirements, such that it's not a needle-in-a-haystack problem, but closer to ICL. Even memory retrieval (an example given in the post) can be tricky because you cannot always trust cosine similarity on short text snippets to cleanly map to relevant memories, and so you may end up omitting good data and including bad data (which heavily skews the LLM the wrong way).
[1]: Coincidentally what the post author is selling
- This is one type of problem of information retrieval, but I think the change in performance with context length may be different for non-retrieval answers (such as “what is the edited code for making this button red?” or “which of the above categories does the sentence ‘…’ fall under?”).
One paper that stood out to me a while back was Many-Shot In-Context Learning[1] which showed large positive jumps in performance from filling the context with examples.
As always, it’s important to test one’s problem to know how the LLM changes in behavior for different context contents/lengths — I wouldn’t assume a longer context is always worse.
- I think most people have read it and agree it makes an astute observation about surviving methods, but my point is that now we use it to complain about new methods that should just skip all that in between stuff so that The Bitter Lesson doesn't come for them. At best you can use it as an inspiration. Anyway, this was mostly a complaint about the use of "The Bitter Lesson" in the context of this article, it still deserves credit for all the great information about tokenization methods and how one evolutionary branch of them is the Byte Latent Transformer.
- I’m starting to think “The Bitter Lesson” is a clever sounding way to give shade to people that failed to nail it on their first attempt. Usually engineers build much more technology than they actually end up needing, then the extras shed off with time and experience (and often you end up building it again from scratch). It’s not clear to me that starting with “just build something that scales with compute” would get you closer to the perfect solution, even if as you get closer to it you do indeed make it possible to throw more compute at it.
That said the hand coded nature of tokenization certainly seems in dire need of a better solution, something that can be learned end to end. And It looks like we are getting closer with every iteration.
I don't have a lot of public examples of this, but here's a larger project where I used this strategy for a relatively large app that has TypeScript annotations for easy VSCode use, Tailwind for design, and it even loads in huge libraries like the Monaco code editor etc, and it all just works quite well 100% statically:
HTML file: https://github.com/blixt/go-gittyup/blob/main/static/index.h...
Main entrypoint file: https://github.com/blixt/go-gittyup/blob/main/static/main.js