Comment by zargon - Hacker Neue

> Where do you move the weights anyway once they are loaded into the GPU VRAM?

The GPU can’t do anything with weights while they are in VRAM. They have to be moved into the GPU itself first.

So it is about memory round-trips, but not between RAM and VRAM. It’s the round trips between the VRAM and the registers in the GPU die. When batch processing, the calculations for all batched requests can be done while the model parameters are in the GPU registers. Compared to if they were done sequentially, you would multiply the number of trips between the VRAM and the GPU by the number of individual inferences.

Also, batched prompts and outputs are indeed mathematically independent from each other.

menaerus 3 hours ago

Round-trip between VRAM and GPU registers? That's what the cache hierarchies are for. I think you confused quite a bit of concepts here.

Moving data to and from VRAM is ~100ns of latency. Moving data from RAM to VRAM through PCIe 5.0 is 1-10us of latency. So, ~1 to ~2 orders of magnitude of difference.

And this is the reason why batching is used - you don't want to pay the price of that latency for each and every CPU-to-GPU request but you want to push as much data as you can through a single round-trip.

pests 1 hour ago

> That's what the cache hierarchies are for

That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.

I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.

This item has no comments currently.