- Anthropic was founded by a group of 7 former OpenAI employees who left over differences in opinions about AI Safety. I do not see any public documentation that the specific difference in opinion was that that group thought that OpenAI was too focused on scaling and that there needed to be a purely safety-focused org that still scaled, though that is my impression based on conversations I've had.
But regardless anthropic reasoning was extremely in the intellectual water supply of the Anthropic founders, and they explicitly were not aiming at producing a human-like model.
- ... I am pretty sure that the name "Anthropic" is as in "principle" not as in "pertaining to human beings".
- Same story, I think. Well-paid positions at sensible low drama companies are filled quickly, while companies with glaring issues may interview and make offers to dozens of candidates before finding one who accepts the offer. So as a candidate you also see a disproportionate number of bad interviews.
- It's also not limited to words pronounced poetically. Some words where both variants are common, like "wicked", have different numbers of syllables depending on meaning. e.g.
Beads of sweat wicked through the wicked witch's black robes a hot summer day
- Thanks ChatGPT
- > If you want to base your "ideas" of taxes (Do you own real estate?) on edge cases why not worry about eminent domain or property seizures without a warrant or charges being filed?
Particularly in the case of the latter example I would be pretty surprised to encounter someone in favor of both LVT and civil asset forfeiture. Are you sure this is a case of specific people having inconsistent policy preferences and not a case of a broad group containing people who hold incompatible views?
- Mm, doughnuts. I'll take the flip side of that bet, since I don't think capturing the typing cadence for individual words would be all that helpful. I'd bet the typing cadences here are distinguishable from the cadence of normal English text (as might be collected by a malicious browser extension which vacuums up keystroke data on popular UGC sites).
- It is striking how similar these answers are to each other, hitting the same points beat for beat in a slightly different tone.
- Google employees collectively have a lot of talent.
- There are some patterns you can use that help a bit with this problem. Lowest hanging fruit is to tell the LLM that its tests should test only through public interfaces where possible. Next after that is to add a "check if any non-public interfaces were used in places where a public interface exposes the same functionality the not-yet-committed tests - if so, rewrite tests to use only publicly exposed interfaces" step to the workflow. You could likely also add linter rules, though sometimes you genuinely need to test something like error conditions that can't reasonably be tested only through public interfaces.
- Are we using the same LLMs? I absolutely see cases of "hallucination" behavior when I'm invoking an LLM (usually sonnet 4) in a loop of "1 generate code, 2 run linter, 3 run tests, 4 goto 1 if 2 or 3 failed".
Usually, such a loop just works. In the cases where it doesn't, often it's because the LLM decided that it would be convenient if some method existed, and therefore that method exists, and then the LLM tries to call that method and fails in the linting step, decides that it is the linter that is wrong, and changes the linter configuration (or fails in the test step, and updates the tests). If in this loop I automatically revert all test and linter config changes before running tests, the LLM will receive the test output and report that the tests passed, and end the loop if it has control (or get caught in a failure spiral if the scaffold automatically continues until tests pass).
It's not an extremely common failure mode, as it generally only happens when you give the LLM a problem where it's both automatically verifiable and too hard for that LLM. But it does happen, and I do think "hallucination" is an adequate term for the phenomenon (though perhaps "confabulation" would be better).
Aside:
> I can't imagine an agent being given permission to iterate Terraform
Localstack is great and I have absolutely given an LLM free rein over terraform config pointed at localstack. It has generally worked fine and written the same tf I would have written, but much faster.
- Ultimately it's probably not a productive use of time to be commenting here at all from a strict EV perspective. Meaning that if you're posting here, you're probably getting something else out of it. The value of that "something else" determines how you should approach the problem of managing the gut reactions of your readers.
If someone asks for a better way to word something to reduce reader hostility to their point, I assume that they will be better off for knowing the answer to that question, and can decide for themselves whether they want to change their writing style or not - and, whether they do or do not, the effects of their writing will be more intentional.
- I think "Who, specifically, claims that [...]?" comes off as less condescending than "Who claims that [...]? Be specific." just by virtue of the latter using imperative language, which triggers a reflexive "you're not the boss of me" reaction.
- > What does it do better than other languages?
Shared nothing architecture. If you're using e.g. fastapi you can store some data in memory and that data will be available across requests, like so
This is often the fastest way to solve your immediate problem, at the cost of making everything harder to reason about. PHP persists nothing between requests, so all data that needs to persist between requests must be explicitly persisted to some specific external data store.import uvicorn, fastapi app = fastapi.FastAPI() counter = {"value": 0} @app.post("/counter/increment") async def increment_counter(): counter["value"] += 1 return {"counter": counter["value"]} @app.post("/counter/decrement") async def decrement_counter(): counter["value"] -= 1 return {"counter": counter["value"]} @app.get("/counter") async def get_counter(): return {"counter": counter["value"]} if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=9237)Non-php toolchains, of course, offer the same upsides if you hold them right. PHP is harder to hold wrong in this particular way, though, and in my experience the upside of eliminating that class of bug is shockingly large compared to how rarely I naively would have expected to see it in codebases written by experienced devs.
- I'm not sure - a lot of the top comments are saying that this article is great and they learned a lot of new things. Which is great, as long as the things they learned are true things.
- One thing that has worked for me when I have a long list of requirements / standards I want an LLM agent to stick to while executing a series of 5 instructions is to add extra steps at the end of the instructions like "6. check if any of the code standards are not met - if not, fix them and return to step 5" / "7. verify that no forbidden patterns from <list of things like no-op unit tests, n+1 query patterns, etc> exist in added code - if you find any, fix them and return to step 5" etc.
Often they're better at recognizing failures to stick to the rules and fixing the problems than they are at consistently following the rules in a single shot.
This does mean that often having an LLM agents so a thing works but is slower than just doing it myself. Still, I can sometimes kick off a workflow before joining a meeting, so maybe the hours I've spent playing with these tools will eventually pay for themselves in improved future productivity.
- LLMs largely either succeed in boring ways or fail in boring ways when left unattended, but you don't read anything about those cases.
- We can almost certainly use 100x as much code as is currently written. There's a ton of throwaway code that, if written, would produce small but nonzero value. Certainly 100x as much code wouldn't produce 100x as much value though. I suspect value per unit of code is one of those power law things.
- Yeah, the logic is basically "sure there are lots of structural or root issues, but I'm not confident I can make a substantial positive impact on those with the resources I have whereas I am confident that spending money to prevent people (mostly kids who would otherwise have survived to adulthood) from dying of malaria is a substantial positive impact at ~$5000 / life saved". I find that argument compelling, though I know many don't. Those many are free to focus on structural or root issues, or to try to make the case that addressing those issues is not just good, but better than reducing the impact of malaria.
- Writing new code it's probably 3x or so[1].
- Writing automated tests for reproducible bugs, it's probably 2x or so.
- Fixing those bugs I try every so often but it still seems to be a net negative even for Opus 4.5, so call it 0.95x because I mostly just do it myself.
- Figuring out how to reproduce an undesired behavior that was observed in the wild in a controlled environment is still net negative - call it 0.8x because I keep being tempted by this siren song[2]
- Code review it's hard to say, I definitely am able to give _better_ reviews now than I was able to before, but I don't think I spend significantly less time on them. Call it 1.2x.
- Taking some high-level feature request and figuring which parts of the feature request already exist and are likely to work, which parts should be built, which parts we tried to build 5+ years ago and abandoned due to either issues with the implementation or issues with the idea that only became apparent after we observed actual users using it, and which parts are in tension with other parts of the system: net negative. 0.95x, just from trying again every so often.
- Writing new one-off utility tools for myself and my team: 10x-100x. LLMs are amazing. I can say "I want to see a Gantt chart style breakdown of when jobs in a gitlab pipeline start and finish each step of execution, here's the network log, here's a link to the gitlab api docs, write me a bookmarklet I can click on when I'm viewing a pipeline" and go get coffee and come back and have a bookmarklet[3].
Unfortunately for me, a significant fraction of my tasks are of the form "hey so this weird bug showed up in feature X, and the last employee to work on feature X left 6 years ago, can you figure out what's going on and fix it" or "we want to change Y functionality, what's the level of risk and effort".
-----
[1] This number would be higher, but pre-LLMs I invested quite a bit of effort into tooling to make repetitive boilerplate tasks faster, so that e.g. creating the skeleton of a unit or functional test for a module was 5 keystrokes. There's a large speedup in the tasks that are almost boilerplate, but not quite worth it for me to write my own tooling, counterbalanced by a significant slowdown if some but not all tasks had existing tooling that I have muscle memory for but the LLM agent doesn't.
[2] This feels like the sort of thing that the models should be good at. After all, if I fed in the observed behavior, the relevant logs, and the relevant files, even Sonnet 3.7 was capable of identifying the problem most of the time. The issue is that by the time I've figured out what happened at that level of detail, I usually already know what the issue was.
[3] Ok, it actually took a coffee break plus 3 rounds of debugging over about 30 minutes. Still, it's a very useful little tool and one I probably wouldn't have spent the time building in the before times.