That and https://github.com/b0nes164/GPUSorting have been a tremendous help for me, since CUB does not nicely work with the Cuda Driver Api. The author is doing amazing work.
At what order of magnitude in the number of elements to be sorted (I'm thinking to the overhead of the GPU setup cost) is the break-even point reached, compared to a pure CPU sort?
No idea unfortunately. For me it's mandatory to sort on the GPU because the data already resides on GPU, and copying it to CPU (and the results back to GPU) would be too costly.
This looks amazing, I've been shopping for an implementation of this I could play around with for a while now
They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?
Oh! I have a prefix sum laying around in SIMD in Rust, I use it for bitmap rasterization for fonts. Looking at the comments I guess this isn't a popular usecase, but useful nonetheless. Doing it on the GPU looks really fun
I'm working on a game that has a lot of units and I used to use the old Sebastian Lague + NVidia approach where you use 2d binning -> cells/keys -> sort -> being able to search for neighbours efficiently (along with some modifications such as using Morton encoding and so on that I added over time).
But then during a break the other day I read up on Radix sort and then right thereafter implemented a prefix sum for spatial partitioning that also incorporates a bit table, CAS operations for doing multithreaded modifications etc. After learning the core Radix concept I sort of came up with the idea of using it that way myself which was quite pleasing.
Props to the author, I'll definitely be spending some time scanning the collection to find some alternate options.
Is that relevant for 4x4 multiplications? Because at least for me, radix sort is way more important than multiplying matrices beyond 4x4. E.g. for Gaussian Splatting.
They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?
https://github.com/mooman219/fontdue/blob/master/src/platfor...
https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...
But then during a break the other day I read up on Radix sort and then right thereafter implemented a prefix sum for spatial partitioning that also incorporates a bit table, CAS operations for doing multithreaded modifications etc. After learning the core Radix concept I sort of came up with the idea of using it that way myself which was quite pleasing.
Props to the author, I'll definitely be spending some time scanning the collection to find some alternate options.