Preferences

thegeomaster
Joined 2,549 karma
Programmer and tinkerer; based in Belgrade, Serbia. I like building products, systems programming, distributed, and gamedev.

Building Carthagine (AI-powered Figma to React): https://carthagine.ai


  1. Article talks about all of this and references DeepSeek R1 paper[0], section 4.2 (first bullet point on PRM) on why this is much trickier to do than it appears.

    [0]: https://arxiv.org/abs/2501.12948

  2. You could think of supervised learning as learning against a known ground truth, which pretraining certainly is.
  3. It's interesting to also compare this to getting a bare metal instance and provisioning microVMs on it using Firecracker. (Obviously something you shouldn't roll yourself in most cases.)

    You can get a bare metal AX162 from Hetzner for 200 EUR/mo, with 48 cores and 128GB of RAM. For 4:1 virtual:physical oversubscription, you could run 192 guests on such a machine, yielding a cost of 200/192 = 1.04 EUR/mo, and giving each guest a bit over 1GiB of RAM. Interestingly, that's not groundbreakingly cheaper than just getting one of Hetzner's virtual machines!

  4. You didn't include the amortized cost of a Blackwell GPU, which is an order of magnitude larger expense than electricity.
  5. Warning: LLM-generated article, terribly difficult to follow and full of irrelevant details.
  6. Well this was a trip down the memory lane. I built a small game on Irrlicht at the time and I remember these discussions also.

    Irrlicht had its editor (irrEdit), a sound system (irrKlang), and some basic collision detection and FPS controller was built right into the engine. This was enough to get you a considerable way through a fully featured tech demo, at the very least. (I even remember Irrlicht including a beautiful first-person tech demo of traversing a large BSP-partitioned castle level.)

    However, for those not afraid to stitch these additional parts from other promising libraries (or derive them from first principles, as was fashionable), OGRE offered more raw rendering prowess: a working deferred shading system (this was the heyday of deferred shading), a pop-less terrain implementation with texture splatting, and more impressive shader and rendering pipeline support, with the Cg multi-platform shading language. I remember a fairly impressive ocean surface and Fresnel refraction/reflection demos from OGRE at the time.

  7. What an astounding achievement. In 6 years, this person has written not only a very well-designed microkernel, but a build system, UEFI bootloader, graphical shell, UI framework, and a browser engine.

    The story of 10x developers among us is not a myth... if anything, it's understated.

  8. Common sense:

    - The compute requirements would be massive compared to the rest of the industry

    - Not a single large open source lab has trained anything over 32B dense in the recent past

    - There is considerable crosstalk between researchers at large labs; notice how all of them seem to be going in similar directions all the time. If dense models of this size actually provided benefit compared to MoE, the info would've spread like wildfire.

  9. tok/s cannot in any way be used to estimate parameters. It's a tradeoff made at inference time. You can adjust your batch size to serve 1 user at a huge tok/s or many users at a slow tok/s.
  10. There's no way Sonnet 4 or Opus 4 are dense models.
  11. Are you saying that you think Sonnet 4 has 100B-200B _active_ params? And that Opus has 2T active? What data are you basing these outlandish assumptions on?
  12. Seems heavily vibe coded, down to the Claude-generated README and a lot of the LLM prompts themselves (which I have found works very poorly compared to human-written prompts). While none of this is necessarily bad, it requires a higher burden of proof that it actually works beyond toy problems [0]. I think everyone would appreciate some examples of vulnerabilities it can find. The missing JWT check showcased in the screenshot would've probably been caught with ordinary AI code review, so to my eye that by itself is not persuasive.

    Good luck!

    [0]: Why I say this --- a 10kLOC piece of software that was mostly human-written would require a large amount of testing, even manual, to ensure that it works, reliably, at all. All this testing and experimentation would naturally force a certain depth of exploration for the approach, the LLM prompts, etc across a variety of usecases. A mostly AI-written codebase of this size would've required much less testing to get it to "doesn't crash and runs reliably", and so this depth is not a given anymore.

  13. Thanks for sharing this! It's difficult to find good examples of useful codebases where coding agents have done most of the work. I'm always actively looking at how I can push these agents to do more for me and it's very instructive to hear from somebody who has had success on this level. (Would be nice to read a writeup, too)
  14. Well, Gemini Flash Lite is at least one, or likely two orders of magnitude larger than this model.
  15. Gemini 2.5 Pro is severely kneecapped in this evaluation. Limit of 4096 thinking tokens is way too low; I bet o3 is generating significantly more.
  16. It just matches the 90% discount that Claude models have had for quite a while. I don't see anything groundbreaking...
  17. o3 pricing: $8/Mtok out

    GPT-5 pricing: $10/Mtok out

    What am I missing?

  18. SWE-Bench Verified score, with thinking, ties Opus 4.1 without thinking.

    AIME scores do not appear too impressive at first glance.

    They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.

    This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.

  19. They have worse scores than recent open source releases on a number of agentic and coding benchmarks, so if absolute quality is what you're after and not just cost/efficiency, you'd probably still be running those models.

    Let's not forget, this is a thinking model that has a significantly worse scores on Aider-Polyglot than the non-thinking Qwen3-235B-A22B-Instruct-2507, a worse TAUBench score than the smaller GLM-4.5 Air, and a worse SWE-Bench verified score than the (3x the size) GLM-4.5. So the results, at least in terms of benchmarks, are not really clear-cut.

    From a vibes perspective, the non-reasoners Kimi-K2-Instruct and the aforementioned non-thinking Qwen3 235B are much better at frontend design. (Tested privately, but fully expecting DesignArena to back me up in the following weeks.)

    OpenAI has delivered something astonishing for the size, for sure. But your claim is just an exaggeration. And OpenAI have, unsurprisingly, highlighted only the benchmarks where they do _really_ well.

  20. GLM-4.5 seems to outperform it on TauBench, too. And it's suspicious OAI is not sharing numbers for quite a few useful benchmarks (nothing related to coding, for example).

    One positive thing I see is the number of parameters and size --- it will provide more economical inference than current open source SOTA.

  21. This can be dangerous, because Claude doesn't truly understand why it did something. Whatever it writes a post-hoc justification which may or may not be accurate to the "intent". This is because these are still autoregressive models --- they have only the context to go on, not prior intent.
  22. I never understood the hate. Beyond the stranger syntax, it's not terribly different from a language such as Pascal. It's an old imperative language without too much magic (beyond strange syntax sugar).
  23. Same here. Reading the article, I could not really relate to the experience of being a single-language developer for 10 years.

    In my early days, I identified strongly with my chosen programming language, but people way more experienced than me taught me that a programming language is a tool, and that this approach is akin to saying "well, I don't know about those pliers, I am a hammerer."

    My personal feeling from working across a wide range of programming languages is that it expands your horizons in a massive way (and hard to qualitatively describe), and I'm happy that I did this.

  24. You're assuming that the whole model has to be in SRAM.
  25. Yes, there have been multiple (very big) hints dropped by various people that they had no official cooperation.
  26. Just curious: why do you prefer/have a requirement of quad-based meshes?
  27. Mentioned in TFA as well.
  28. Post-training allows leveraging the considerable world and language understanding of the underlying pretrained model. Intuition is that this would be a boost to performance.
  29. A detail that is not mentioned is that Google models >= Gemini 2.0 are all explicitly post-trained for this task of bounding box detection: https://ai.google.dev/gemini-api/docs/image-understanding

    Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal