- kraddypattiesRunning into "no healthy upstream" when navigating to the link -- hug of death maybe?
- been lurking for most of my adult life (and it shows :-))
Thanks HN! You make me smarter every (other) day.
- Thanks for trying it out!
Yea that latency makes sense; "listening" includes turn detection and STT, "thinking" LLM + TTS _and then_ our model, so the pipeline latency stacks up pretty quick. The actual video model starts streaming out frames <500ms from the TTS generation, but we're still working on reducing latency from parts of the pipeline that we are using off the shelf.
We have a high level blog post here https://www.keyframelabs.com/blog/persona-1 about the architecture of the video model, the WebRTC "agent" stack is Livekit + a few backend components hosted in Modal.
- 4 points
- We've been tinkering with building realtime talking head models (avatar models, etc.) for a while now, and finally have something that works (well enough)! Operates at ~2x realtime on a 4090, significantly faster than that on enterprise grade GPUs.
You can try it yourself at https://playground.keyframelabs.com/playground/persona-1 and there's a (semi)technical blog post at https://www.keyframelabs.com/blog/persona-1
The main use case we designed for was language learning, particularly having a conversational partner -- generally we've found that adding a face to the voice really helps trigger the fight or flight response, which we've found to be the hardest part of speaking a new language with confidence.
But in building out the system around the model to enable that use case (tool use on a canvas for speaking prompts and images, memory to make conversations less stale, etc.), we think there's potential for other use cases too.