Comment by kaycebasques

kaycebasques Sep 8, 2024 parent

What's the documentation like for supercomputers? I.e. when a researcher gets approved to use a supercomputer, do they get lots of documentation explaining how to set up and run their program? I got the sense from a physicist buddy that a lot of experimental physics stuff is shared informally and never written down. Or maybe each field has a couple popular frameworks for running simulations, and the Frontier people just make sure that Frontier runs each framework well?

physicsguy Sep 9, 2024

Documentation is mixed but it’s usually similar between clusters.

You typically write a bash script with some metadata in rows at the top that say how many nodes, how many cores on those nodes you want, and what if any accelerator hardware you need.

Then typically it’s just setting up the environment to run your software. On most supercomputers you need to use environment modules (´module load gcc@10.4’) to load up compilers, parallelism libraries, and software, etc. You can sometimes set this stuff up on the login node to to try out and make sure things work, but generally you’ll get an angry email if you run processed for more than 10 minutes because login nodes are a shared resource.

There’s a tension because it’s often difficult to get this right, and people often want to do things like ´pip install <package>’ but you can leave a lot of performance on the table because pre-compiled software usually targets lowest common denominator systems rather than high end ones. But cluster admins can’t install every Python package ever and precompile it. Easybuild and Spack aim to be package managers that make this easier.

Source: worked in HPC in physics and then worked at a University cluster supporting users doing exactly this sort of thing.

piombisallow Sep 8, 2024

Take a look here if you're curious, as an example: https://docs.ncsa.illinois.edu/systems/delta/en/latest/

90% of my interactions are ssh'ing into a login node and running code with SLURM, then downloading the data.

ok123456 Sep 9, 2024

You run things more or less like you do on your Linux workstation. The only difference is you run your top-level script or program through a batch processing system on a headend node.

You typically develop programs with MPI/OpenMP to exploit multiple nodes and CPUs. In Fortran, this entails a few pragmas and compiler flags.

sega_sai Sep 8, 2024

I know that DOE's supercomputer NERSC has a lot of documentation https://docs.nersc.gov/getting-started/ . Plus they also have weekly events where you can ask any questions about how the code/optimisation etc (I have never attended those, but regularly get emails about those)

markstock Sep 8, 2024

https://docs.olcf.ornl.gov/systems/frontier_user_guide.html

This will have much of what you need.

tryauuum Sep 8, 2024

Google openmpi, mpirun, slurm. It's not complex.

It's like kubernetes but invented long ago before kubernetes

Enginerrrd Sep 10, 2024

My understanding is that usually there is a subject matter expert that will help you adapt your code to the specific machine to get optimal performance when it's your turn for compute time.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous