Comment by criemen - Hacker Neue

criemen Oct 15, 2025 parent

I don't understand why we're paying for caching at all (except: model providers can charge for it). It's almost extortion - the provider stores some data for 5min on some disk, and gets to sell their highly limited GPU resources to someone else instead (because you are using the kv cache instead of GPU capacity for a good chunk of your input tokens). They charge you 10% of their GPU-level prices for effectively _not_ using their GPU at all for the tokens that hit the cache.

If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous