1) low latency desired, long user prompt 2) function runs many parallel requests, but is not fired with common prefix very often. OpenAI was very inconsistent about properly caching the prefix for use across all requests, but with Anthropic it’s very easy to pre-fire
a simple alternative approach is to introduce hysteresis by having both a high and low context limit. if you hit the higher limit, trim to the lower. this batches together the cache misses.
if users are able to edit, remove or re-generate earlier messages, you can further improve on that by keeping track of cache prefixes and their TTLs, so rather than blindly trimming to the lower limit, you instead trim to the longest active cache prefix. only if there are none, do you trim to the lower limit.
for example if a user sends a large number of tokens, like a file, and a question, and then they change the question.
if call #1 is the file, call #2 is the file + the question, call #3 is the file + a different question, then yes.
and consider that "the file" can equally be a lengthy chat history, especially after the cache TTL has elapsed.
As far as I can tell it will indeed reuse the cache up to the point, so this works:
Prompt A + B + C - uncached
Prompt A + B + D - uses cache for A + B
Prompt A + E - uses cache for A