Profile: thegeomaster

thegeomaster

Joined May 1, 2014 2,549 karma

Programmer and tinkerer; based in Belgrade, Serbia. I like building products, systems programming, distributed, and gamedev.

Building Carthagine (AI-powered Figma to React): https://carthagine.ai

thegeomaster Nov 30, 2025 parent

Article talks about all of this and references DeepSeek R1 paper[0], section 4.2 (first bullet point on PRM) on why this is much trickier to do than it appears.
[0]: https://arxiv.org/abs/2501.12948
thegeomaster Nov 30, 2025 parent

You could think of supervised learning as learning against a known ground truth, which pretraining certainly is.
thegeomaster Oct 17, 2025 parent

It's interesting to also compare this to getting a bare metal instance and provisioning microVMs on it using Firecracker. (Obviously something you shouldn't roll yourself in most cases.)
You can get a bare metal AX162 from Hetzner for 200 EUR/mo, with 48 cores and 128GB of RAM. For 4:1 virtual:physical oversubscription, you could run 192 guests on such a machine, yielding a cost of 200/192 = 1.04 EUR/mo, and giving each guest a bit over 1GiB of RAM. Interestingly, that's not groundbreakingly cheaper than just getting one of Hetzner's virtual machines!
thegeomaster Oct 1, 2025 parent

You didn't include the amortized cost of a Blackwell GPU, which is an order of magnitude larger expense than electricity.
thegeomaster Sep 16, 2025 parent

Warning: LLM-generated article, terribly difficult to follow and full of irrelevant details.
thegeomaster Sep 15, 2025 parent

Well this was a trip down the memory lane. I built a small game on Irrlicht at the time and I remember these discussions also.
Irrlicht had its editor (irrEdit), a sound system (irrKlang), and some basic collision detection and FPS controller was built right into the engine. This was enough to get you a considerable way through a fully featured tech demo, at the very least. (I even remember Irrlicht including a beautiful first-person tech demo of traversing a large BSP-partitioned castle level.)
However, for those not afraid to stitch these additional parts from other promising libraries (or derive them from first principles, as was fashionable), OGRE offered more raw rendering prowess: a working deferred shading system (this was the heyday of deferred shading), a pop-less terrain implementation with texture splatting, and more impressive shader and rendering pipeline support, with the Cg multi-platform shading language. I remember a fairly impressive ocean surface and Fresnel refraction/reflection demos from OGRE at the time.
thegeomaster Sep 13, 2025 parent

What an astounding achievement. In 6 years, this person has written not only a very well-designed microkernel, but a build system, UEFI bootloader, graphical shell, UI framework, and a browser engine.
The story of 10x developers among us is not a myth... if anything, it's understated.
thegeomaster Aug 29, 2025 parent

Common sense:
- The compute requirements would be massive compared to the rest of the industry
- Not a single large open source lab has trained anything over 32B dense in the recent past
- There is considerable crosstalk between researchers at large labs; notice how all of them seem to be going in similar directions all the time. If dense models of this size actually provided benefit compared to MoE, the info would've spread like wildfire.
thegeomaster Aug 28, 2025 parent

tok/s cannot in any way be used to estimate parameters. It's a tradeoff made at inference time. You can adjust your batch size to serve 1 user at a huge tok/s or many users at a slow tok/s.
thegeomaster Aug 28, 2025 parent

There's no way Sonnet 4 or Opus 4 are dense models.
thegeomaster Aug 28, 2025 parent

Are you saying that you think Sonnet 4 has 100B-200B _active_ params? And that Opus has 2T active? What data are you basing these outlandish assumptions on?
thegeomaster Aug 18, 2025 parent

Seems heavily vibe coded, down to the Claude-generated README and a lot of the LLM prompts themselves (which I have found works very poorly compared to human-written prompts). While none of this is necessarily bad, it requires a higher burden of proof that it actually works beyond toy problems [0]. I think everyone would appreciate some examples of vulnerabilities it can find. The missing JWT check showcased in the screenshot would've probably been caught with ordinary AI code review, so to my eye that by itself is not persuasive.
Good luck!
[0]: Why I say this --- a 10kLOC piece of software that was mostly human-written would require a large amount of testing, even manual, to ensure that it works, reliably, at all. All this testing and experimentation would naturally force a certain depth of exploration for the approach, the LLM prompts, etc across a variety of usecases. A mostly AI-written codebase of this size would've required much less testing to get it to "doesn't crash and runs reliably", and so this depth is not a given anymore.
thegeomaster Aug 15, 2025 parent

Thanks for sharing this! It's difficult to find good examples of useful codebases where coding agents have done most of the work. I'm always actively looking at how I can push these agents to do more for me and it's very instructive to hear from somebody who has had success on this level. (Would be nice to read a writeup, too)
thegeomaster Aug 14, 2025 parent

Well, Gemini Flash Lite is at least one, or likely two orders of magnitude larger than this model.
thegeomaster Aug 8, 2025 parent

Gemini 2.5 Pro is severely kneecapped in this evaluation. Limit of 4096 thinking tokens is way too low; I bet o3 is generating significantly more.
thegeomaster Aug 7, 2025 parent

It just matches the 90% discount that Claude models have had for quite a while. I don't see anything groundbreaking...
thegeomaster Aug 7, 2025 parent

o3 pricing: $8/Mtok out
GPT-5 pricing: $10/Mtok out
What am I missing?
thegeomaster Aug 7, 2025 parent

SWE-Bench Verified score, with thinking, ties Opus 4.1 without thinking.
AIME scores do not appear too impressive at first glance.
They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.
This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.
thegeomaster Aug 5, 2025 parent

They have worse scores than recent open source releases on a number of agentic and coding benchmarks, so if absolute quality is what you're after and not just cost/efficiency, you'd probably still be running those models.
Let's not forget, this is a thinking model that has a significantly worse scores on Aider-Polyglot than the non-thinking Qwen3-235B-A22B-Instruct-2507, a worse TAUBench score than the smaller GLM-4.5 Air, and a worse SWE-Bench verified score than the (3x the size) GLM-4.5. So the results, at least in terms of benchmarks, are not really clear-cut.
From a vibes perspective, the non-reasoners Kimi-K2-Instruct and the aforementioned non-thinking Qwen3 235B are much better at frontend design. (Tested privately, but fully expecting DesignArena to back me up in the following weeks.)
OpenAI has delivered something astonishing for the size, for sure. But your claim is just an exaggeration. And OpenAI have, unsurprisingly, highlighted only the benchmarks where they do _really_ well.
thegeomaster Aug 5, 2025 parent

GLM-4.5 seems to outperform it on TauBench, too. And it's suspicious OAI is not sharing numbers for quite a few useful benchmarks (nothing related to coding, for example).
One positive thing I see is the number of parameters and size --- it will provide more economical inference than current open source SOTA.
thegeomaster Aug 4, 2025 parent

This can be dangerous, because Claude doesn't truly understand why it did something. Whatever it writes a post-hoc justification which may or may not be accurate to the "intent". This is because these are still autoregressive models --- they have only the context to go on, not prior intent.
thegeomaster Jul 23, 2025 parent

I never understood the hate. Beyond the stranger syntax, it's not terribly different from a language such as Pascal. It's an old imperative language without too much magic (beyond strange syntax sugar).
thegeomaster Jul 23, 2025 parent

Same here. Reading the article, I could not really relate to the experience of being a single-language developer for 10 years.
In my early days, I identified strongly with my chosen programming language, but people way more experienced than me taught me that a programming language is a tool, and that this approach is akin to saying "well, I don't know about those pliers, I am a hammerer."
My personal feeling from working across a wide range of programming languages is that it expands your horizons in a massive way (and hard to qualitatively describe), and I'm happy that I did this.
thegeomaster Jul 23, 2025 parent

You're assuming that the whole model has to be in SRAM.
thegeomaster Jul 21, 2025 parent

Yes, there have been multiple (very big) hints dropped by various people that they had no official cooperation.
thegeomaster Jul 20, 2025 parent

Just curious: why do you prefer/have a requirement of quad-based meshes?
thegeomaster Jul 18, 2025 parent

Mentioned in TFA as well.
1 point Jul 17, 2025

LLMs Are Bad at Being Forced

0 comments thegeomaster morphllm.com
thegeomaster Jul 10, 2025 parent

Post-training allows leveraging the considerable world and language understanding of the underlying pretrained model. Intuition is that this would be a boost to performance.
thegeomaster Jul 10, 2025 parent

A detail that is not mentioned is that Google models >= Gemini 2.0 are all explicitly post-trained for this task of bounding box detection: https://ai.google.dev/gemini-api/docs/image-understanding
Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous