Let alone 'chat' use cases, but holding a reponse up for N*1.2 longer than it could holds all sorts of other resources up/down stream.
If I was using them to process far more text, e.g. summarise long documents, or if I was using it as an inline editing assistant, then I'd care more about the speed.
Streaming a response from a chatbot is only one use-case of LLMs.
I would argue the most interesting applications do not fall into this category.
…not yet anyway. Fast moving area, lots of blue water outside the chat interface.
Groq model shines at latency, not at the other two.
For example, if you're a game company and you want to use LLMs so your players can converse with nonplayer characters in natural language, replacing a multiple-choice conversation tree - you'd want that to be low latency, and you'd want it to be cheap.
(All the sudden having nightmares of getting billed for the conversations I have in the single player game I happen to be enjoying...)
If there is a future with this idea, its gotta be just shipping the LLM with game right?
That might be a nice application for this library of mine: https://github.com/Const-me/Cgml/
That’s an open source Mistral ML model implementation which runs on GPUs (all of them, not just nVidia), takes 4.5GB on disk, uses under 6GB of VRAM, and optimized for interactive single-user use case. Probably fast enough for that application.
You wouldn’t want in-game dialogues with the original model though. Game developers would need to finetune, retrain and/or do something else with these weights and/or my implementation.
Small, bounded conversations, with problematic lines trimmed over time, striking a balance between possibility and self-contradiction.
I could see it working really well in a Mass Effect-type game.
> If there is a future with this idea, its gotta be just shipping the LLM with game right?
Depends how high you can let your GPU requirements get :)
EDIT: Watching the videos, I am more and more confused by why this is even desirable. The complexity of dialogue in a game, it seems, needs to match the complexity of the more general possibilities and actions you can undertake in the game itself. Without that, it all just feels like you are in a little chatbot sandbox within the game, even if the dialogue is perfectly "in character." It all seems to feel less immersive with the LLMs.
I’m not convinced latency matters as much as groqs material tries to claim it does.