If you scale that setup and add a couple of used RTX 3090s with heavy memory offloading, you can technically run something in the K2 class.
That's around 350,000 tokens in a day. I don't track my Claude/Codex usage, but Kilocode with the free Grok model does and I'm using between 3.3M and 50M tokens in a day (plus additional usage in Claude + Codex + Mistral Vibe + Amp Coder)
I'm trying to imagine a use case where I'd want this. Maybe running some small coding task overnight? But it just doesn't seem very useful.
Yesterday I asked Claude to write one function. I didn't ask it to do anything else because it wouldn't have been helpful.
Essentially migrating codebases, implementing features, as well as all of the referencing of existing code and writing tests and various automation scripts that are needed to ensure that the code changes are okay. Over 95% of those tokens are reads, since often there’s a need for a lot of consistency and iteration.
It works pretty well if you’re not limited by a tight budget.
Have a bunch of other side projects as well as my day job.
It's pretty easy to get through lots of tokens.
For coding related tasks I consume 30-80M tokens per day and I want something as fast as it gets
It's also unbeatable in price to performance as the next best 24GiB card would be the 4090 which, even used, is almost tripple the price these days while only offering about 25%-30% more performance in real-world AI workloads.
You can basically get an SLI-linked dual 3090 setup for less money than a single used 4090 and get about the same or even more performance and double the available VRAM.
> The tensor performance of the 3090 is also abysmal.
I for one compared my 50-series card's performance to my 3090 and didn't see "abysmal performance" on the older card at all. In fact, in actual real-world use (quantised models only, no one runs big fp32 models locally), the difference in performance isn't very noticeable at all. But I'm sure you'll be able to provide actual numbers (TTFT, TPS) to prove me wrong. I don't use diffusion models, so there might be a substantial difference there (I doubt it, though), but for LLMs I can tell you for a fact that you're just wrong.
Also, the 3090 supports NVLink, which is actually more useful for inference speed than native BF16 support.
Maybe if you're training bf16 matters?