markstock parent
Each node has 4 GPUs, and each of those has a dedicated network interface card capable of 200 Gbps each way. Data can move right from one GPU's memory to another.
But it's not just bandwidth that allows the machine to run so well, it's a very low-latency network as well. Many science codes require very frequent synchronizations, and low latency permits them to scale out to tens of thousands of endpoints.
200 Gbps
Oh wow, that’s pretty bad.
That's 200Gbps from that card to any other point in the other 9,408 nodes in the system. Including file storage.
Within the node, bandwidth between the GPUs is considerably higher. There's an architecture diagram at <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html> that helps show the topology.
I see, OK, I misinterpreted it as per node bandwidth. Yes, this makes more sense, and is probably fast enough for most workloads.