They failed to provide any examples of facts with regard to Devin.
This is like arguing that it’s not fair to critique people claiming to have made superconductors because “some people said they are really superconductors” but no one can share samples with anyone for some reason.
A reasonable counter argument would be:
> Here is evidence of Devin actually doing things.
How, other than the available evidence was anyone supposed to evaluate Devin?
There is a broad opportunity for the developers to respond to this, but they haven’t.
Why is that?
It is because he’s right.
Regardless of what Devin can do that video was deceptive and misleading. There no two ways about it.
Hence I’m skeptical of people making claims about a product I can’t try out myself. It’s unclear if the tasks they are doing and the way they are using Agents is relevant to the work I do. Which is usually working on a team of engineers shipping code on a complex code base.
For AI I tend to put a lot more weight in benchmarks, such as SWE-bench, which is why I wrote an article about:
https://www.stepchange.work/blog/why-do-ai-software-engineer...
SWE-bench is mostly small python tasks evaluated solely by unit tests which require less than 15 line changes to a single file. Most of those it fails at and the ones it gets right it ignores all sorts of libraries and conventions used in the rest of the code base.
I’m Optimistic that agents will eventually agents will improve dramatically in a few years but today Devin is not good at making larger changes that build on one another like features.
That's a lie, pure and simple, and no statements made elsewhere can make that lie any less a lie.