Profile: rshemet - Hacker Neue

rshemet

Joined Aug 6, 2022 37 karma

rshemet Sep 19, 2025

Yes! Cactus is optimized for mobile CPU inference and we're finishing internal testing of hybrid kernels that use the NPU, as well other chips.
We don't advise using GPUs on smartphones, since they're very energy-inefficient. Mobile GPU inference is actually the main driver behind the stereotype that "mobile inference drains your battery and heats up your phone".
Wrt to your last question – the short answer is yes, we'll have multimodal support. We currently support voice transcription and image understanding. We'll be expanding these capabilities to add more models, voice synthesis, and much more.
rshemet Sep 19, 2025

indeed, this is exactly the goal! The license grants rights to commercial use, unlocks additional hardware acceleration, includes cloud telemetry, and offers significant savings over using cloud APIs.
In our deployments, we've seen open source models rival and even outperform lower-tier cloud counterparts. Happy to share some benchmarks if you like.
Our pricing is on a per-monthly-active-device basis, regardless of utilization. For voice-agent workflows, you typically hit savings as soon as you process over ≈2min of daily inference.
rshemet Aug 15, 2025

you can run it in Cactus Chat (download from the Play Store)
rshemet Aug 15, 2025

you can also run it on Cactus - either in Cactus Chat from the App/Play Store or by using the Cactus framework to integrate it into your own app
rshemet Aug 15, 2025

THIS IS THE BOMB!!! So excited for this one. Thanks for putting cool tech out there.
rshemet Aug 8, 2025

if you ever end up trying to take this in the mobile direction, consider running on-device AI with Cactus –
https://cactuscompute.com/
Blazing-fast, cross-platform, and supports nearly all recent OS models.
rshemet Jul 11, 2025

https://play.google.com/store/apps/details?id=com.rshemetsub...
rshemet Jul 11, 2025

thank you! Very kind feedback, and we'll add your feedback to our to-dos.
re: "question would get stuck on the last phrase and keep repeating it without end." - that's a limitation of the model i'm afraid. Smaller models tend to do that sometimes.
rshemet Jul 11, 2025

say more about "community tools"?
rshemet Jul 11, 2025

in the app you mean?
Adding shortly!
rshemet Jul 11, 2025

that's our mission! if you are passionate about the space, we look forward to your contributions!
rshemet Jul 11, 2025

no, good observation - not hidden; we don't have a "clear conversation" button.
to your previous point - Cactus fully supports tool calling (for models that have been instruction-trained accordingly, e.g. Qwen 1.7B)
for "turning your old phones into local llm servers", Cactus is likely not the best tool. We'd recommend something like actual Ollama or Exo
rshemet Jul 11, 2025

looking forward to your feedback!
rshemet Jul 11, 2025

hot off the press in our latest feature release :)
we support cloud fallback as an add-on feature. This lets us support vision and audio in addition to text.
rshemet Jul 11, 2025

great observation - this data is not from a controlled environment; these are metrics from our Cactus Chat use (we only collect tok/sec telemetry).
S25 is an outlier that surprised us too.
I got $10 on S25 climbing back up to the top of the rankings as more data comes in :)
rshemet Jul 10, 2025

thank you! We're continue to add performance metrics as more data comes in.
A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes.
Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges.
rshemet Jul 10, 2025

Great question. Currently, each app is sandboxed - so each model file is downloaded inside each app's sandbox. We're working on enabling file sharing across multiple apps so you don't have to redownload the model.
With respect to the inference SDK, yes you'll need to install the (react native/flutter) framework inside each app you're building.
The SDK is very lightweight (our own iOS app is <30MB which includes the inference SDK and a ton of other stuff)
rshemet Jul 10, 2025

reminds me of
- "You are, undoubtedly, the worst pirate i have ever heard of" - "Ah, but you have heard of me"
Yes, we are indeed a young project. Not two weeks, but a couple of months. Welcome to AI, most projects are young :)
Yes, we are wrapping llama.cpp. For now. Ollama too began wrapping llama.cpp. That is the mission of open-source software - to enable the community to build on each others' progress.
We're enabling the first cross-platform in-app inference experience for GGUF models and we're soon shipping our own inference kernels fully optimized for mobile to speed up the performance. Stay tuned.
PS - we're up to good (source: trust us)
rshemet Jul 10, 2025

love this. So many layers deep, we just had a good laugh.
rshemet Jul 10, 2025

Very good point - we've heard this before.
We're restructuring the model initialization API to point to a local file & exposing a separate abstracted download function that takes in a URL.
wrt downloading post-install: based on our feedback, this is indeed a preferred pattern (as opposed to bundling in large files).
We'll update the download API, thanks again.

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous