Profile: markstock - Hacker Neue

markstock

Joined Sep 8, 2024 25 karma

HPC engineer, artist, designer

markstock Dec 23, 2025 parent

Sure, I usually measure performance of methods like these in terms of FLOP/s; getting 50-65% of theoretical peak FLOP/s for any given CPU or GPU hardware is close to ideal.
markstock Dec 22, 2025 parent

Quadtrees and octrees are themselves quite deep research areas. If the acceleration data structures interest you, I highly recommend Hanan Samet's book "Foundations of Multidimensional and Metric Data Structures". It's from 2006, but is basically the bible for the field.
markstock Dec 22, 2025 parent

Note that even without an acceleration structure ("direct summation" in N-body research terminology), a CUDA program or GLSL shader program can exceed 60 fps with 10,000 to 20,000 particles. And a parallel, C/C++/fortran vectorized CPU code can do the same with over 5 thousand.
markstock Dec 22, 2025 parent

The general algorithm used here (of computing attraction and repulsion forces between pairs of particles) is very similar to that used in simulations of many interesting phenomena in physics. Start with Smoothed Particle Hydrodynamics (https://en.wikipedia.org/wiki/Smoothed-particle_hydrodynamic...) and then check out Lagrangian Vortex Particle Methods and other N-Body problems (https://en.wikipedia.org/wiki/N-body_problem).
And the algorithms to solve these quickly is another deep area of research.
markstock Dec 22, 2025 parent

Thank you - I was just about to point out some of that.
The reason that the flocks are tight is because the separation "force" is normally computed as a repulsion between a target boid and all other nearby boids individually, not vs. the center of mass of all nearby boids.
markstock Sep 9, 2025 parent

Just a few volumes from my bookshelf related to this:
Network Analysis in Geography, Haggett and Chorley
Cities and Complexity, Batty
Urban Grids, Busquets et al
markstock Sep 9, 2025 parent

Let's be a little more clear: these are not "laws" as much as they are scaling relationships, this is not "new math" (see Ziph and others), and central planning has always had an impact on city development. Nevertheless, I appreciate this line of inquiry.
markstock Sep 8, 2025 parent

Something doesn't add up here. The listed peak fp64 performance assumes one fp64 operation per clock per thread, yet there's very little description of how each PE performs 8 flops per cycle, only "threads are paired up such that one can take over processing when another one stalls...", classic latency-hiding. So the performance figures must assume that each PE has either an 8-wide SIMD unit (and 16-wide for fp32) or 8 separately schedulable execution units, neither of which seem likely given the supposed simplicity of the core (or 4 FMA EUs). Am I missing something?
markstock Sep 8, 2025 parent

Exactly this. Whenever I talk about how I got started in computer art over 40 years ago, I always mention the fact that a screen back then was a one-way device: TV network to you. Basic home computers HAD to plug into the TV, and to a kid, this was magic and freedom.
markstock Sep 1, 2025 parent

Yes, this appears to use Stam's Stable Fluids algorithm. Look for the phrases "semi-Lagrangian advection" and "pressure correction" to see the important functions. The 3d version seems to use trilinear interpolation, which is pretty diffusive.
markstock Jun 26, 2025 parent

Um, no?
This is a fine collection of links - much to learn! - but the connection between flow and gravitation is (in my understanding) limited to both being Green's function solutions of a Poisson problem. https://en.wikipedia.org/wiki/Green%27s_function
There are n-body methods for both (gravitation and Lagrangian vortex particle methods), and I find the similarities and differences of those algorithms quite interesting.
But the Fedi paper misses that key connection: they're simply describing a source/sink in potential flow, not some newly discovered link.
markstock May 13, 2025 parent

It is a fudge if you really are trying to simulate true point masses. Mathematically, it's solving for the force between fuzzy blobs of mass.
markstock May 13, 2025 parent

Supercomputers will simulate trillions of masses. The HACC code, commonly used to verify the performance of these machines, uses a uniform grid (interpolation and a 3D FFT) and local corrections to compute the motion of ~8 trillion bodies.
markstock May 13, 2025 parent

Yes, the author uses a globally-adaptive time stepper, which is only efficient for very small N. There are adaptive time step methods that are local, and those are used for large systems.
If you see bodies flung out after close passes, three solutions are available: reduce the time step, use a higher order time integrator, and (the most common method) add regularization. Regularization (often called "softening") removes the singularity by adding a constant to the squared distance. So 1 over zero becomes one over a small-ish and finite number.
markstock Apr 2, 2025 parent

I can't recommend cards, but you are absolutely correct about porting CUDA to HIP: there was (is?) a hipify program in rocm that does most of the work.
markstock Apr 1, 2025 parent

The US Treasury has one, though. Not sure if that satisfies the above criteria.
markstock Mar 14, 2025 parent

Here's one that starts with the concept of a straight line and builds all the way to string theory. It's a monumental book, and it still challenges me. Roger Penrose's The Road To Reality.
markstock Mar 3, 2025 parent

If you love this aesthetic and the concepts beneath it, I highly recommend Paolo Soleri's Arcology: The City in the Image of Man.
markstock Jan 20, 2025 parent

I wasn't familiar with the "Wave32" term, but took "RDNA" to mean the smaller wavefront size. I've used both, and wave32 is still quite effective for CFD.
markstock Jan 20, 2025 parent

Maybe never by the big players, but RDNA and even fp32 are perfectly fine for a number of CFD algorithms and uses; Stable Fluids-like algorithms and Lagrangian Vortex Particle Methods to name two.
markstock Jan 20, 2025 parent

This has not been my experience in the academic/research side. Poison solver-based incompressible CFD regularly runs ~10x faster on equivalently-priced GPU systems, and has been doing so since I've been following it (since 2008). Some FFT-based solvers don't weak scale ideally, but that'd be even worse for CPU-based versions, as they use similar algorithms and would be spread over many more nodes.
markstock Dec 1, 2024 parent

I'm surprised no one has mentioned Vc. I found ispc clunky and not as performant, and std::simd didn't support some useful math ops like rsqrt. Vc has been around for years, I have no trouble including it in my codes, it has masking and many of the most useful math ops, and I can get over 1 TF/s on a consumer-grade Ryzen and at least 3 TF/s on the big Epyc CPUs.
https://github.com/VcDevel/Vc https://github.com/Applied-Scientific-Research/nvortexVc
markstock Sep 16, 2024 parent

The Earth is a multi-physics complex system and OP claiming to "Simulate the Earth" is misleading. Methods that work on the atmosphere may not work on other parts. There are numerous scientific projects working on simulation earthquakes, both using ML and more "traditional" physics.
markstock Sep 8, 2024 parent

Each node has 4 GPUs, and each of those has a dedicated network interface card capable of 200 Gbps each way. Data can move right from one GPU's memory to another. But it's not just bandwidth that allows the machine to run so well, it's a very low-latency network as well. Many science codes require very frequent synchronizations, and low latency permits them to scale out to tens of thousands of endpoints.
markstock Sep 8, 2024 parent

FYI: LUMI uses a nearly identical architecture as Frontier (AMD CPUs and GPUs), and was also made by HPE.
markstock Sep 8, 2024 parent

https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
This will have much of what you need.

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous