Aarch64 does indeed have a proper atomic max, but even on x86-64 you can get a wait-free atomic max as long as you only need to support integers up to 64. In that case you can simply do a `lock or` with 1 << i as your maximum. You can even support larger sizes by using multiple registers, e.g. four 64-bit registers for a u8 maximum.
In most cases it's even better to just store a maximum per thread separately and loop over all threads once to compute the current maximum if you really need it.
In most cases it's even better to just store a maximum per thread separately and loop over all threads once to compute the current maximum if you really need it.