- indeed, this is exactly the goal! The license grants rights to commercial use, unlocks additional hardware acceleration, includes cloud telemetry, and offers significant savings over using cloud APIs.
In our deployments, we've seen open source models rival and even outperform lower-tier cloud counterparts. Happy to share some benchmarks if you like.
Our pricing is on a per-monthly-active-device basis, regardless of utilization. For voice-agent workflows, you typically hit savings as soon as you process over ≈2min of daily inference.
- you can run it in Cactus Chat (download from the Play Store)
- you can also run it on Cactus - either in Cactus Chat from the App/Play Store or by using the Cactus framework to integrate it into your own app
- THIS IS THE BOMB!!! So excited for this one. Thanks for putting cool tech out there.
- if you ever end up trying to take this in the mobile direction, consider running on-device AI with Cactus –
Blazing-fast, cross-platform, and supports nearly all recent OS models.
- thank you! Very kind feedback, and we'll add your feedback to our to-dos.
re: "question would get stuck on the last phrase and keep repeating it without end." - that's a limitation of the model i'm afraid. Smaller models tend to do that sometimes.
- say more about "community tools"?
- in the app you mean?
Adding shortly!
- that's our mission! if you are passionate about the space, we look forward to your contributions!
- no, good observation - not hidden; we don't have a "clear conversation" button.
to your previous point - Cactus fully supports tool calling (for models that have been instruction-trained accordingly, e.g. Qwen 1.7B)
for "turning your old phones into local llm servers", Cactus is likely not the best tool. We'd recommend something like actual Ollama or Exo
- looking forward to your feedback!
- hot off the press in our latest feature release :)
we support cloud fallback as an add-on feature. This lets us support vision and audio in addition to text.
- great observation - this data is not from a controlled environment; these are metrics from our Cactus Chat use (we only collect tok/sec telemetry).
S25 is an outlier that surprised us too.
I got $10 on S25 climbing back up to the top of the rankings as more data comes in :)
- thank you! We're continue to add performance metrics as more data comes in.
A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes.
Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges.
- Great question. Currently, each app is sandboxed - so each model file is downloaded inside each app's sandbox. We're working on enabling file sharing across multiple apps so you don't have to redownload the model.
With respect to the inference SDK, yes you'll need to install the (react native/flutter) framework inside each app you're building.
The SDK is very lightweight (our own iOS app is <30MB which includes the inference SDK and a ton of other stuff)
- reminds me of
- "You are, undoubtedly, the worst pirate i have ever heard of" - "Ah, but you have heard of me"
Yes, we are indeed a young project. Not two weeks, but a couple of months. Welcome to AI, most projects are young :)
Yes, we are wrapping llama.cpp. For now. Ollama too began wrapping llama.cpp. That is the mission of open-source software - to enable the community to build on each others' progress.
We're enabling the first cross-platform in-app inference experience for GGUF models and we're soon shipping our own inference kernels fully optimized for mobile to speed up the performance. Stay tuned.
PS - we're up to good (source: trust us)
- love this. So many layers deep, we just had a good laugh.
- Very good point - we've heard this before.
We're restructuring the model initialization API to point to a local file & exposing a separate abstracted download function that takes in a URL.
wrt downloading post-install: based on our feedback, this is indeed a preferred pattern (as opposed to bundling in large files).
We'll update the download API, thanks again.
We don't advise using GPUs on smartphones, since they're very energy-inefficient. Mobile GPU inference is actually the main driver behind the stereotype that "mobile inference drains your battery and heats up your phone".
Wrt to your last question – the short answer is yes, we'll have multimodal support. We currently support voice transcription and image understanding. We'll be expanding these capabilities to add more models, voice synthesis, and much more.