Usually the reason you'd want the network bandwidth would be for distributed training.
For inference you can probably get by with 2GB/s assuming you can split the layers up nicely.
The interconnect can be a bottleneck for inference but only for networks with loads of activations and large batch sizes, or if you are doing tensor level parallelism.
For inference you can probably get by with 2GB/s assuming you can split the layers up nicely.
The interconnect can be a bottleneck for inference but only for networks with loads of activations and large batch sizes, or if you are doing tensor level parallelism.