Preferences

I don't understand why we're paying for caching at all (except: model providers can charge for it). It's almost extortion - the provider stores some data for 5min on some disk, and gets to sell their highly limited GPU resources to someone else instead (because you are using the kv cache instead of GPU capacity for a good chunk of your input tokens). They charge you 10% of their GPU-level prices for effectively _not_ using their GPU at all for the tokens that hit the cache.

If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!


Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal