Comment by rshemet - Hacker Neue

rshemet Jul 10, 2025 parent

thank you! We're continue to add performance metrics as more data comes in.

A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes.

Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges.

pickettd Jul 11, 2025

I also want to add on that I really appreciate the benchmarks.

When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem.

Reebz Jul 10, 2025

Looking at the current benchmarks table, I was curious: what do you think is wrong with Samsung S25 Ultra?

Most of the standard mobile CPU benchmarks (GeekBench, AnTuTu, et al) show a 20-40% performance gain over S23/S24 Ultra. Also, this bucks the trend where most other devices are ranked appropriately (i.e. newer devices perform better).

Thanks for sharing your project.

rshemet OP Jul 11, 2025

great observation - this data is not from a controlled environment; these are metrics from our Cactus Chat use (we only collect tok/sec telemetry).

S25 is an outlier that surprised us too.

I got $10 on S25 climbing back up to the top of the rankings as more data comes in :)

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous