I just benchmarked some perf for some of my larger context window queries last week and groq's API took 1.6 seconds versus 1.8 to 2.2 for OpenAI GPT-3.5-turbo. So, it wasn't much faster. I almost emailed their support to see if I was doing something wrong. Would love to hear any details about your workload or the complexity of your queries.
In most of the cases, overall response time is mostly dominated by output as it is ~100x slower per token than input.
I believe certain companies would kill for 20% performance improvements on their main product.
I’m not convinced latency matters as much as groqs material tries to claim it does.
Let alone 'chat' use cases, but holding a reponse up for N*1.2 longer than it could holds all sorts of other resources up/down stream.
If I was using them to process far more text, e.g. summarise long documents, or if I was using it as an inline editing assistant, then I'd care more about the speed.
For example, if you're a game company and you want to use LLMs so your players can converse with nonplayer characters in natural language, replacing a multiple-choice conversation tree - you'd want that to be low latency, and you'd want it to be cheap.
(All the sudden having nightmares of getting billed for the conversations I have in the single player game I happen to be enjoying...)
If there is a future with this idea, its gotta be just shipping the LLM with game right?
They're selling dreams and aspirations, and those are what's driving the funding.
OpenAI and Anthropic's APIs are obviously not latency-driven. Same with comparable LLM API resellers like Azure. Most people are likely not expecting tight latency SLOs there. That said, chat experiences (esp. voice ones) would probably be even more valuable if they could react in "human time" instead of with few seconds delay.
Integrating specialized hardware that can shave inference to fractions of a second seems like something that could be useful in a variety of latency-sensitive opportunities. Especially if this allows larger language models to be used where traditionally they were too slow.
Reducing latency doesn't automatically translate to winning the market or even increased revenue. There are tons of other variables such as functionality, marketing, back-office sales deals and partnerships. Lots of times, users can't even tell which service is objectively better (even though you and I have the know how and tools to measure and better know reality).
Unfortunately the technical angle is only one piece of the puzzle.
they probably bought NVDA stock :)