Preferences

The improvement is real!

And unlike a lot of research, the code actually runs well. I can reproduce the results using Modal GPUs, leaving the code here: https://github.com/mirage-project/mirage/pull/327/files

Triton + FlashInfer: Prompt length 39, generate length 264, per-token latency 19.189573345762312 ms

MPK: Prompt length 39, generate length 334, per-token latency 7.71875 ms


zhihaojia
Thanks for reproducing our results!

This item has no comments currently.