Preferences

> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.

Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)

I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output


cowboy_henk
Interestingly the internet is full of "slack clone" dev tutorials. I used to work for a company that provides chat backend/frontend components as a service. It was one of their go-to examples, and the same is true for their competitors.

While it's impressive that you can now just have an llm build this, I wouldn't be surprised if the result of these 30 hours is essentially just a re-hash of one of those example Slack clones. Especially since all of these models have internet access nowadays; I honestly think 30 hours isn't even that fast for something like this, where you can realistically follow a tutorial and have it done.

In fact, I just did a quick google search and found this 15 hour course about building a slack clone: https://www.codewithantonio.com/projects/slack-clone

sigmoid10
This is obviously much more than just taking an LLM an letting it run for 30 hours. You have to build a whole environment together with external tool integration and context management and then tune the prompts and perhaps even set up a multi-agent system. I believe that if someone puts a ton of work into this you can have an LLM run for that long and still produce sellable outputs, but let's not pretend like this is something that average devs can do by buying some API tokens and kicking off a frontier model.
Philpax
Well, yes, that's Claude Code. And OpenAI Codex. And Google Gemini CLI.

Your average dev can just use those.

Yes but you need to setup quite a bit of tooling to provide feedback loops.

It's one thing to get an llm to do something unattended for long durations, it's a other to give it the means of verification.

For example I'm busy upgrading a 500k LoC rails 1 codebase to rails 8 and built several DSLs that give it proper authorised sessions in a headless browser with basic html parsing tooling so it can "see" what affect it's fixes have. Then you somehow need to also give it a reliable way to keep track of the past and it's own learnings, which sound simple but I have yet to see any tool or model solve it on this scale...will give sonnet 4.5 a try this weekend, but yeah none of the models I tried are able to produce meaningful results over long periods on this upgrade task without good tooling and strong feedback loops

Btw I have upgraded the app and taking it to alpha testing now so it is possible

majortennis
I've tried asking it to log every request and response to a project_log.md but it routinely ignores that.

I've also tried using playwright for testing in a headless browser and taking screenshots for a blog that can effectively act as a log , it just seems like too tall an order for it.

It sounds like you're streets ahead of where I am could you give me some pointers on getting started with a feed back loop please

grncdr
> rails 1 codebase to rails 8

A bit off topic, but Rails *1* ? I hope this was an internal app and not on the public internet somewhere …

haha no it's an old (15years old) abandoned enterprise app running on-prem that hasn't seen updates in more than a decade.
ewoodrich
But then that goes back to the original question, considering my own experiences observing the amount of damage CC or Codex can do in a working code base with a couple tiny initial mistakes or confusion about intent while being left unattended for ten minutes, let alone 30 hours....
sigmoid10
If you had used any of those, you'd know they clearly don't work well enough for such long tasks. We're not yet at the point where we have general purpose fire-and-forget frameworks. But there have been a few research examples from constrained environments with a complex custom setup.
ChadMoran
Claude Code with a good prompt can run for hours.
NaomiLehman
That sounds to me like a full room of guys trying to figure out the most outrageous thing they can say about the update, without being accused of lying. Half of them on ketamine, the other on 5-MeO-DMT. Bat country. 2 months of 007 work.

Imagine reviewing 30 hours of 2025-LLM code.

shanecp
What they don't mention is all the tooling, MCPs and other stuff they've added to make this work. It's not 30 hours out of the box. It's probably heavily guard-railed, with a lot of validated plans, checklists and verification points they can check. It's similar to 'lab conditions', you won't get that output in real-world situations.
Bjorkbat OP
Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.

Unless the main area of improvement was tools and scaffolding rather than the model itself.

gapeslape
“30 hours of unattended work” is totally vague and it doesn’t mean anything on its own. It - at the very least - highly depends on the amount of tokens you were able to process.

Just to illustrate, say you are running on a slow machine that outputs 1 token per hour. At that speed you would produce approximately one sentence.

zelphirkalt
"Slack clone" is also super vague:

(First of all: Why would anyone in their right mind want a Slack clone? Slack is a cancer. The only people who want it are non-technical people, who inflict it upon their employees.)

Is it just a chat with a group or 1on1 chat? Or does it have threads, emojis, voice chat calls, pinning of messages, all the CSS styling (which probably already is 11k lines or more for the real Slack), web hooks/apps?

Also, of course it is just a BS announcement, without honesty, if they don't publish a reproducible setup, that leads to the same outcome they had. It's the equivalent of "But it worked on my machine!" or "scientific" papers that prove anti gravity with superconductors and perpetuum mobile infinite energy, that only worked in a small shed where some supposed physics professor lives.

Has their comment has been edited? A few words later it says it resulted in 11,000 LoC.

> [..] left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code [..]

throwaway0123_5
Their point still stands though? They said the 1 tok/hr example was illustrative only. 11,000 LoC could be generated line-by-line in one shot, taking not much more than 11,000 * avg_tokens_per_line tokens. Or the model could be embedded in an agent and spend a million tokens contemplating every line.
zmmmmm
> Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code

it's going to be an issue I think, now that lots of these agents support computer use, we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it.

The software world may find it's got more in common with book authors than they thought sooner rather than later once full clones of popular apps are popping out of coding tools. It will be interesting to see if this results in a war of attrition with counter measures and strict ToU that prohibit use by AI agents etc.

stravant
That just means that owning the walled gardens and network effects will become yet more important.
walthamstow
It has been trivial to build a clone of most popular services for years, even before LLMs. One of my first projects was Miguel Grinberg's Flask tutorial, in which a total noob can build a Twitter clone in an afternoon.

What keeps people in are network effects and some dark patterns like vendor lock-in and data unportability.

supern0va
There's a marked difference between running a Twitter-like application that scales to even a few hundred thousand users, and one that is a global scale application.

You may find quickly that, network effects aside, you would find yourself crushed under the weight and unexpected bottlenecks of that network you desire.

walthamstow
Agreed entirely but not sure that's relevant in what I'm replying to.

> we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it

That won't produce a global-scale application infrastructure either, it'll just reproduce the functionality available to the user.

technocrat8080
Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.
ChadMoran
Sub-agents. I've had Claude Code run a prompt for hours on end.
technocrat8080
What kind of agents do you have setup?
s900mhz
You can use the built in task agent. When you have a plan and ready for Claude to implement, just say something along the line of “begin implementation, split each step into their own subagent, run them sequentially”
fragmede
subagents are where Claude code shines and codex still lags behind. Claude code can do some things in parallel within a single session with subagents and codex cannot.
osn9363739
Have the released the code for this? Does it work? or are there x number of caviets and excuses. I'm kinda of sick of them (and others) getting a free pass at saying stuff like this.
haute_cuisine
They don't seem to link any source code or demo. They could have run Claude for 10 hours to write thousands of the verge articles as well.

This item has no comments currently.