Comment by markstock - Hacker Neue

markstock Sep 8, 2024 parent

Each node has 4 GPUs, and each of those has a dedicated network interface card capable of 200 Gbps each way. Data can move right from one GPU's memory to another. But it's not just bandwidth that allows the machine to run so well, it's a very low-latency network as well. Many science codes require very frequent synchronizations, and low latency permits them to scale out to tens of thousands of endpoints.

p1esk Sep 9, 2024

200 Gbps

Oh wow, that’s pretty bad.

wickberg Sep 9, 2024

That's 200Gbps from that card to any other point in the other 9,408 nodes in the system. Including file storage.

Within the node, bandwidth between the GPUs is considerably higher. There's an architecture diagram at <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html> that helps show the topology.

p1esk Sep 9, 2024

I see, OK, I misinterpreted it as per node bandwidth. Yes, this makes more sense, and is probably fast enough for most workloads.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous