The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.
That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.
That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.