Comment by patrick451

patrick451 Aug 9, 2025 parent

> The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!)

Kind of unrelated, but this comment made me wonder when we will start seeing side channel attacks that force queries to leak into each other.

jeffrallen 5 days ago

I asked a colleague about this recently and he explained it away with a wave of the hand saying, "different streams of tokens and their context are on different ranks of the matrices". And I kinda believed him, based on the diagrams I see on Welch Labs YouTube channel.

On the other hand, I've learned that when I ask questions about security to experts in a field (who are not experts in security) I almost always get convincing hand waves, and they are almost always proven to be completely wrong.

Sigh.

This item has no comments currently.