289
points
Hello everyone. This is Yujong from the Hyprnote team (https://github.com/fastrepl/hyprnote).
We built OWhisper for 2 reasons: (Also outlined in https://docs.hyprnote.com/owhisper/what-is-this)
(1). While working with on-device, realtime speech-to-text, we found there isn't tooling that exists to download / run the model in a practical way.
(2). Also, we got frequent requests to provide a way to plug in custom STT endpoints to the Hyprnote desktop app, just like doing it with OpenAI-compatible LLM endpoints.
The (2) part is still kind of WIP, but we spent some time writing docs so you'll get a good idea of what it will look like if you skim through them.
For (1) - You can try it now. (https://docs.hyprnote.com/owhisper/cli/get-started)
bash
brew tap fastrepl/hyprnote && brew install owhisper
owhisper pull whisper-cpp-base-q8-en
owhisper run whisper-cpp-base-q8-en
If you're tired of Whisper, we also support Moonshine :)
Give it a shot (owhisper pull moonshine-onnx-base-q8)We're here and looking forward to your comments!
I was actually integrating some whisper tools yesterday. I was wondering if there was a way to get a streaming response, and was thinking it'd be nice if you can.
I'm on linux, so don't think I can test out owhisper right now, but is that a thing that's possible?
Also, it looks like the `owhisper run` command gives it's output as a tui. Is there an option for a plain text response so that we can just pipe it to other programs? (maybe just `kill`/`CTRL+C` to stop the recording and finalize the words).
Same question for streaming, is there a way to get a streaming text output from owhisper? (it looks like you said you create a deepgram compatible api, I had a quick look at the api docs, but I don't know how easy it is to hook into it and get some nice streaming text while speaking).
Oh yeah, and diarisation (available with a flag?) would be awesome, one of the things that's missing from most of the easiest to run things I can find.
I didn't tested on Linux yet, but we have linux build: http://owhisper.hyprnote.com/download/latest/linux-x86_64
> also, it looks like the `owhisper run` command gives it's output as a tui. Is there an option for a plain tex
`owhisper run` is more like way to quickly trying it out. But I think piping is definitely something that should work.
> Same question for streaming, is there a way to get a streaming text output from owhisper?
You can use Deepgram client to talk to `owhisper serve`. (https://docs.hyprnote.com/owhisper/deepgram-compatibility) So best resource might be Deepgram client SDK docs.
> diarisation
yeah on the roadmap
Great work on this! excited to keep an eye on things.
Overall though, it's fast and really impressive. Can't wait for it to progress.
Can you help me out to find where the code you've built is? I can see the folder in github[0], but I can't see the code for the cli for instance? unless I'm blind.
[0] https://github.com/fastrepl/hyprnote/tree/main/owhisper
https://github.com/fastrepl/hyprnote/blob/8bc7a5eeae0fe58625...
Ultimately, I chose a cloud-based GPU setup, as the highest-performing diarization models required a GPU to process properly. Happy to share more if you’re going that route.
Where exactly, if not in the FM?
https://github.com/ggml-org/whisper.cpp/tree/master/examples...
- It supports other models like moonshine.
- It also works as proxy for cloud model providers.
- It can expose local models as Deepgram compatible api server
I just spent last week researching the options (especially for my M1!) and was left wishing for a standard, full-service (live) transcription server for Whisper like OLlama has been for LLMs.
I’m excited to try this out and see your API (there seems to be a standard vaccuum here due to openai not having a real time transcription service, which I find to be a bummer)!
Edit: They seem to emulate the Deepgram API (https://developers.deepgram.com/reference/speech-to-text-api...), which seems like a solid choice. I’d definitely like to see a standard emerging here.
Let me know how it goes!
When I find the time to set it up I’d like to contribute to the documentation to answer the questions I had, but I couldn’t even find information on how to do that (no docs folder in the repo contribution.md, which the AI assistant also points me towards, doesn’t contain information about adding to the docs).
In general I find it a bit distracting that the OWhisper code is inside of the hyprnote repository. For discoverability and “real project” purposes I find that it would probably deserve its own.
Link to the repo - https://github.com/m-bain/whisperX
EDIT: Ah, I see this was already answered.
EDIT: typo
But I was hoping couple of features would be supported: 1. Multilingual support. It seems like even if I use a multilingual model like whisper-cpp-large-turbo-q8, the application seems to assume I am speaking English. 2. Translate feature. Probably already supported but I didnt see the option.
For splitting speaker within channel, we need AI model to do that. It is not implemented yet, but I think we'll be in good shape somewhere in September.
Also we have transcript editor that you can easily split segment, assign speakers.
These are list of local models it supports:
- whisper-cpp-base-q8
- whisper-cpp-base-q8-en
- whisper-cpp-tiny-q8
- whisper-cpp-tiny-q8-en
- whisper-cpp-small-q8
- whisper-cpp-small-q8-en
- whisper-cpp-large-turbo-q8
- moonshine-onnx-tiny
- moonshine-onnx-tiny-q4
- moonshine-onnx-tiny-q8
- moonshine-onnx-base
- moonshine-onnx-base-q4
- moonshine-onnx-base-q8
To me, STT should take a continuous audio stream and output a continuous text stream.
Whisper and Moonshine both works in a chunk, but for moonshine:
> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.
Also for kyutai, we can input continuous audio in and get continuous text out.
- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...
The short duration effectively means that the transcription will start producing nonsense as soon as a sentence is cut up in the middle.
(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)
https://github.com/bikemazzell/skald-go/
Just speech to text, CLI only, and it can paste into whatever app you have open.
What exactly does the silence detection mean? does that mean it'll wait until a pause, and then send the audio off to whisper, and return the output (and stop the process)? Same question with continuous. Does that just mean it continues going until CTRL+C?
Nvm, answered my own question, looks like yes for both[0][1]. Cool this seems pretty great actually.
[0] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...
[1] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...
For just transcribing file/audio,
`owhisper run <MODEL> --file a.wav` or
`curl httpsL//something.com/audio.wav | owhisper run <MODEL>`
might makes sense.
But the base-q8 works (and works quite well!). The TUI is really nice. Speaker diarization would make it almost perfect for me. Thanks for building this.
Though, with a twist that it would transcribe it with IPA :)
also fyi - https://docs.hyprnote.com/owhisper/configuration/providers/o...
I see that you are also using llama.cpp's code? That's cool, but make sure you become a member of that community, not an abuser.