refibrillator parent
Ha made me chuckle. For those wondering seriously about this, it’s not a viable optimization because weights are not readily compressible via JPEG/DCT, and there are a limited number of these units on the chip which bottlenecks throughout, meaning speed is dwarfed by simply reading uncompressed weights from HBM.
It seems like this is indeed possible using video codecs:
https://arxiv.org/abs/2407.00467v1
Yeah, but they could be.
I won an GPU hackathon back in 2019 doing something very similar to this; although the other way around, I was compressing weights using hardware modules.