Comment by pests - Hacker Neue

> That's what the cache hierarchies are for

That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.

I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.

menaerus 1 hour ago

Not really. Registers are irrelevant. They are not the bottleneck.

This item has no comments currently.