Comment by klysm - Hacker Neue

klysm 5 days ago parent

It's definitely interesting that some neural nets can reduce compute requirements, but that's certainly not making a dent on the LLM part of the pie.

lukeschlather 4 days ago

Sam Altman has made a lot of grandiose claims about how much power he's going to need to scale LLMs, but the evidence seems to suggest the amount of power required to train and operate LLMs is a lot more modest than he would have you believe. (DeepSeek reportedly being trained for just $5M, for example.)

lovich 4 days ago

I saw a claim that DeepSeek had piggybacked off of some aspect of training that ChatGPT had done, and so that cost needed to be included when evaluating DeepSeek.

This training part of LLMs is still mostly Greek to me, so if anyone could explain that claim as true or false and the reasons why, I’d appreciate it

lukeschlather 4 days ago

I think the claim that DeepSeek was trained for $5M is a little questionable. But OpenAI is trying to raise $100B which is 20,000 times as much as $5M. Though even at $1B I think it's probably not that big a deal for Google or OpenAI. My feeling is they can profit on the prices they are charging for their LLM APIs, and that the dominant compute cost is inference, not training. Though obviously that's only true if you're selling billions of dollars worth of API calls like Google and OpenAI.

OpenAI has had $20B in revenue this year, and it seems likely to me they have spent considerably less than that on compute for training GPT5. Probably not $5M, but quite possibly under $1B.

TomatoCo 4 days ago

So LLMs predict the next token. Basically, you train them by taking your training data that's N words long and, for X = 1 to N, and optimizing it to predict token X using tokens 1 to X-1.

There's no reason you couldn't generate training data for a model by getting output from another model. You could even get the probability distribution of output tokens from the source model and train the target model to repeat that probability distribution, instead of a single word. That'd be faster, because instead of it learning to say "Hello!" and "Hi!" from two different examples, one where it says hello and one where it says hi, you'd learn to say both from one example that has a probability distribution of 50% for each output.

Sometimes DeepSeek said it's name is ChatGPT. This could be because they used Q&A pairs from ChatGPT for training or because they scraped conversations other people posted where they were talking to ChatGPT. Or for unknown reasons where the model just decided to respond that way, like mixing up some semantics of wanting to say "I'm an AI" and all the scraped data referring to AI as ChatGPT.

Short of admission or leaks of DeepSeek training data it's hard to tell. Conversely, DeepSeek really went hard into an architecture that is cheap to train, using a lot of weird techniques to optimize their training process for their hardware.

Personally, I think they did. Research shows that a model can be greatly improved with a relatively-small set of high quality Q&A pairs. But I'm not sure the cost evaluation should be influenced that much, because the ChatGPT training price was only paid once, it doesn't have to be repaid for every new model that cribs its answers.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous