Preferences

qeternity
Joined 8,693 karma

  1. It has been a thing. In a single request, this same cache is reused for each forward pass.

    It took a while for companies to start metering it and charging accordingly.

    Also companies invested in hierarchical caches that allow longer term and cross cluster caching.

  2. In a chat setting you hit the cache every time you add a new prompt: all historical question/answer pairs are part of the context and don’t need to be prefilled again.

    On the API side imagine you are doing document processing and have a 50k token instruction prompt that you reuse for every document.

    It’s extremely viable and used all the time.

  3. > It’s much easier to tax the general population than businesses, as they don’t push back as much.

    Businesses don't pay taxes. People do. Every dime that a corporation pays is a reduction of capital returns to shareholders, or a reduction of investment into business activity, both of which are taxed again by the people who ultimately receive the capital.

  4. Quantization is not some magical dial you can just turn. In practice you basically have 3 choices: fp16, fp8 and fp4.

    Also thinking time means more tokens which costs more especially at the API level where you are paying per token and would be trivially observable.

    There is basically no evidence that either of these are occurring in the way you suggest (boosting up and down).

  5. This does not prove, at all, what you are claiming.
  6. Or perhaps more likely like a large corporation simply has economies of scale that smaller retailers cannot compete with.
  7. > Reminder that infrastructure is ~3% of the budget, the military is ~13%. Almost all of the rest are benefits, either health or money, for old people or various poverty reduction schemes. Or debt.

    The US Government is an enormous welfare program with a military on the side.

    Whenever people talk about the rich paying their fair share, they simply fail to grasp the enormity of the problem. There is no taxing your way out of this problem.

    Society's expectations have far outpaced our fiscal strength.

  8. How about you cite something.
  9. It would help people to consider your point if you made even a modest attempt to explain and justify what you mean.
  10. > life is genuinely worse today than it was 20 years ago, mostly because of technology

    Extraordinary claims require extraordinary evidence. Almost everything today in absolute terms is better than 20 years ago, even more so outside the developed world.

    What specifically today is worse than 20 years ago?

  11. These are completely different. Agents (aside from the model inference) are not CPU bound. You gain much more by having a wider user base than whatever marginal CPU cycles you would gain in Rust/Go.

    Video games are of course a different story.

  12. > but V3 (from February) has a 32B parameter model that runs on "16GB or more" of VRAM[1]

    No. They released a distilled version of R1 based on a Qwen 32b model. This is not V3, and it's not remotely close to R1 or V3.2.

  13. > DeepSeek and Qwen will function on cheap GPUs that other models will simply choke on.

    Uh, Deepseek will not (unless you are referring to one of their older R1 finetuned variants). But any flagship Deepseek model will require 16x A100/H100+ with NVL in FP8.

  14. Yes, absolutely in deep learning. Custom fused CUDA kernels everywhere.
  15. This is not the case for LLMs. FP16/BF16 training precision is standard, with FP8 inference very common. But labs are moving to FP8 training and even FP4.
  16. PyTorch is only part of it. There is still a huge amount of CUDA that isn’t just wrapped by PyTorch and isn’t easily portable.
  17. > Also, all this vector stuff is going to fade away as context windows get larger (already started over the past 8 months or so).

    People who say this really have not thought this through, or simply don't understand what the usecases for vector search are.

    But even if you had infinite context, with perfect attention, attention isn't free. Even if you had linear attention. It's much much cheaper to index your data than it is to reprocess everything. You don't go around scanning entire databases when you're just interested in row id=X

  18. People really just go on the internet and say stuff.

    Code is speech. Speech is protected (at least in the US).

  19. If the client can generate a uuid4 they can also reuse a known uuid4
  20. > Are they trying to be a full cloud platform like everyone else?

    Yes.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal