And unlike a lot of research, the code actually runs well. I can reproduce the results using Modal GPUs, leaving the code here: https://github.com/mirage-project/mirage/pull/327/files
Triton + FlashInfer: Prompt length 39, generate length 264, per-token latency 19.189573345762312 ms
MPK: Prompt length 39, generate length 334, per-token latency 7.71875 ms
This item has no comments currently.
It looks like you have JavaScript disabled. This web app requires that JavaScript is enabled.
Please enable JavaScript to use this site (or just go read Hacker News).
And unlike a lot of research, the code actually runs well. I can reproduce the results using Modal GPUs, leaving the code here: https://github.com/mirage-project/mirage/pull/327/files
Triton + FlashInfer: Prompt length 39, generate length 264, per-token latency 19.189573345762312 ms
MPK: Prompt length 39, generate length 334, per-token latency 7.71875 ms