That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.
I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.
menaerus
Not really. Registers are irrelevant. They are not the bottleneck.
That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.
I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.