Comment by geor9e - Hacker Neue

geor9e Apr 8, 2024 parent

I don't understand why the comments are trash-talking Groq. They are the fastest LLM inference provider by a big margin. Why would they sell their hardware to any other company for any price? Keep it all for themselves and take over the market. 95% of my LLM requests go to Groq these days because it's 0.25 seconds round trip for a complete answer. In comparison, "Claude Instant" takes about 4 seconds. The other 5% of my requests go to Claude Opus and GPT-4, when I'm willing to wait an excruciating 5+ seconds for a better answer. I hate waiting. Latency is king. Groq wins.

fitzn Apr 8, 2024

What open source model are you using when you hit groq?

I just benchmarked some perf for some of my larger context window queries last week and groq's API took 1.6 seconds versus 1.8 to 2.2 for OpenAI GPT-3.5-turbo. So, it wasn't much faster. I almost emailed their support to see if I was doing something wrong. Would love to hear any details about your workload or the complexity of your queries.

YetAnotherNick Apr 8, 2024

It's not a lot more faster for input but it is something like 10x faster for output(mixtral vs gpt-3.5). This could enable completely new mode of interaction with LLMs e.g. agents.

In most of the cases, overall response time is mostly dominated by output as it is ~100x slower per token than input.

bee_rider Apr 8, 2024

What context did I miss that implies they are using an open source model?

vineyardmike Apr 8, 2024

If you go to GroqChat (which is like a demo app), they offer Gemma, Mistral, and LLaMa. These are all open-weights models.

laserbeam Apr 8, 2024

> 1.6 vs 1.8-2.2 seconds

I believe certain companies would kill for 20% performance improvements on their main product.

gpapilion Apr 8, 2024

I have lots of questions about how important latency is since you may be replacing many minutes or hours of a person’s time with undoubtedly a quicker response by any measure. This seems like a knee jerk reaction assuming latency is as important as it’s been with advertising.

I’m not convinced latency matters as much as groqs material tries to claim it does.

verdverm Apr 8, 2024

Google won search in large part because of their latency. I stopped using local models because of latency. I switched from OpenAI to VertexAI because of latency (and availability)

w-ll Apr 8, 2024

When has latency ever not mattered?

Let alone 'chat' use cases, but holding a reponse up for N*1.2 longer than it could holds all sorts of other resources up/down stream.

ben_w Apr 8, 2024

When it's already faster than I can absorb the response, which for me as an organic brain includes the normal token generation rate of the free tier of ChatGPT.

If I was using them to process far more text, e.g. summarise long documents, or if I was using it as an inline editing assistant, then I'd care more about the speed.

3 More Comments →

michaelt Apr 8, 2024

Depends on your application.

For example, if you're a game company and you want to use LLMs so your players can converse with nonplayer characters in natural language, replacing a multiple-choice conversation tree - you'd want that to be low latency, and you'd want it to be cheap.

beepbooptheory Apr 8, 2024

But are people really going to do this? The cost here seems prohibitive unless you're doing a subscription type game (and even then I'm not sure). And the kinds of games that benefit from open ended dialogue attract players who just want to pay an upfront cost and have an adventure.

(All the sudden having nightmares of getting billed for the conversations I have in the single player game I happen to be enjoying...)

If there is a future with this idea, its gotta be just shipping the LLM with game right?

5 More Comments →

frozenport Apr 8, 2024

I guess its tool calling? When you chain the LLMs together?

robrenaud Apr 8, 2024

Model quality matters a ton too. They aren't serving OpenAI or Anthropic models, which are state of the art.

verdverm Apr 8, 2024

Research suggest most answers and use cases do not require the largest, most sophisticated models. When you start building more complex systems, the overall time increases from chaining and you can pick different models for different points

metadat Apr 8, 2024

"kill", .. why would anyone kill for a fraction of a second in this case? Informed folks know that LLM hosters aren't raking in the big bucks.

They're selling dreams and aspirations, and those are what's driving the funding.

vineyardmike Apr 8, 2024

Google has used LMs in search for years (just not trendy LLMs), and search is famously optimized to the millisecond. Visa uses LMs to perform fraud detection every time someone makes a transaction, which is also quite latency sensitive. I'm guessing "informed folks" aren't so informed about the broader market.

OpenAI and Anthropic's APIs are obviously not latency-driven. Same with comparable LLM API resellers like Azure. Most people are likely not expecting tight latency SLOs there. That said, chat experiences (esp. voice ones) would probably be even more valuable if they could react in "human time" instead of with few seconds delay.

Integrating specialized hardware that can shave inference to fractions of a second seems like something that could be useful in a variety of latency-sensitive opportunities. Especially if this allows larger language models to be used where traditionally they were too slow.

metadat Apr 9, 2024

I wish things were so simple!

Reducing latency doesn't automatically translate to winning the market or even increased revenue. There are tons of other variables such as functionality, marketing, back-office sales deals and partnerships. Lots of times, users can't even tell which service is objectively better (even though you and I have the know how and tools to measure and better know reality).

Unfortunately the technical angle is only one piece of the puzzle.

cyanydeez Apr 8, 2024

What is the killer app product of a LLM Play ATM that's not a lossleader?

EVa5I7bHFq9mnYK Apr 8, 2024

>> why the comments are trash-talking Groq

they probably bought NVDA stock :)

freediver Apr 8, 2024

How do you decide which requests to send to gpt4/opus?

huac Apr 8, 2024

why don't you stream the results?

tpetry Apr 8, 2024

You still have to wait for the end of the streamed response until you can continue with your task.