- markstock parentSure, I usually measure performance of methods like these in terms of FLOP/s; getting 50-65% of theoretical peak FLOP/s for any given CPU or GPU hardware is close to ideal.
- The general algorithm used here (of computing attraction and repulsion forces between pairs of particles) is very similar to that used in simulations of many interesting phenomena in physics. Start with Smoothed Particle Hydrodynamics (https://en.wikipedia.org/wiki/Smoothed-particle_hydrodynamic...) and then check out Lagrangian Vortex Particle Methods and other N-Body problems (https://en.wikipedia.org/wiki/N-body_problem).
And the algorithms to solve these quickly is another deep area of research.
- Something doesn't add up here. The listed peak fp64 performance assumes one fp64 operation per clock per thread, yet there's very little description of how each PE performs 8 flops per cycle, only "threads are paired up such that one can take over processing when another one stalls...", classic latency-hiding. So the performance figures must assume that each PE has either an 8-wide SIMD unit (and 16-wide for fp32) or 8 separately schedulable execution units, neither of which seem likely given the supposed simplicity of the core (or 4 FMA EUs). Am I missing something?
- Um, no?
This is a fine collection of links - much to learn! - but the connection between flow and gravitation is (in my understanding) limited to both being Green's function solutions of a Poisson problem. https://en.wikipedia.org/wiki/Green%27s_function
There are n-body methods for both (gravitation and Lagrangian vortex particle methods), and I find the similarities and differences of those algorithms quite interesting.
But the Fedi paper misses that key connection: they're simply describing a source/sink in potential flow, not some newly discovered link.
- Yes, the author uses a globally-adaptive time stepper, which is only efficient for very small N. There are adaptive time step methods that are local, and those are used for large systems.
If you see bodies flung out after close passes, three solutions are available: reduce the time step, use a higher order time integrator, and (the most common method) add regularization. Regularization (often called "softening") removes the singularity by adding a constant to the squared distance. So 1 over zero becomes one over a small-ish and finite number.
- This has not been my experience in the academic/research side. Poison solver-based incompressible CFD regularly runs ~10x faster on equivalently-priced GPU systems, and has been doing so since I've been following it (since 2008). Some FFT-based solvers don't weak scale ideally, but that'd be even worse for CPU-based versions, as they use similar algorithms and would be spread over many more nodes.
- I'm surprised no one has mentioned Vc. I found ispc clunky and not as performant, and std::simd didn't support some useful math ops like rsqrt. Vc has been around for years, I have no trouble including it in my codes, it has masking and many of the most useful math ops, and I can get over 1 TF/s on a consumer-grade Ryzen and at least 3 TF/s on the big Epyc CPUs.
https://github.com/VcDevel/Vc https://github.com/Applied-Scientific-Research/nvortexVc
- The Earth is a multi-physics complex system and OP claiming to "Simulate the Earth" is misleading. Methods that work on the atmosphere may not work on other parts. There are numerous scientific projects working on simulation earthquakes, both using ML and more "traditional" physics.
- Each node has 4 GPUs, and each of those has a dedicated network interface card capable of 200 Gbps each way. Data can move right from one GPU's memory to another. But it's not just bandwidth that allows the machine to run so well, it's a very low-latency network as well. Many science codes require very frequent synchronizations, and low latency permits them to scale out to tens of thousands of endpoints.
- https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
This will have much of what you need.