saagarjha parent
The thing I find really disappointing about CUDA is that Nvidia could provide the synchronization primitives needed to do this easily, but they don't. Scheduling on their cores remains really dumb, even though I know there is a bunch of work being done behind the scenes to service whatever async warp-specialized matrix multiplication instruction they added in this generation. It's just that there's no way to access it directly and you have to use the little bespoke bits that get exposed in each generation :(