I don't understand why we're paying for caching at all (except: model providers can charge for it). It's almost extortion - the provider stores some data for 5min on some disk, and gets to sell their highly limited GPU resources to someone else instead (because you are using the kv cache instead of GPU capacity for a good chunk of your input tokens).
They charge you 10% of their GPU-level prices for effectively _not_ using their GPU at all for the tokens that hit the cache.
If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!
If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!