Comment by cschmidt - Hacker Neue

This paper has a good solution:

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

nielsole 9 hours ago

Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

fennecbutt 36 minutes ago

I guess it's just working with the brain model (so to speak) than against it.

Inthesamewaythatweusepunctuation. Or even that we usually order words a certain way, oranges and apples, Ted and Bill, roundabouts and swings.

cschmidt OP 6 hours ago

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.

infogulch 4 hours ago

Little endian wins in the end.

pas 1 hour ago

... why does reversing the all the digits help? could you please explain it? many thanks!

Y_Y 9 hours ago

What do the vector space embeddings for digit strings even look like? Can you do arithmetic on them? If that's even desirable that it seems like you could just skip "embedding" altogether and intern all the numbers along one dimension.

jvanderbot 1 day ago

Ok great! This is precisely how I chunk numbers for comparison. And not to diminish a solid result or the usefulness of it or the baseline tech: its clear that it we keep having to create situation - specific inputs or processes, we're not at AGI with this baseline tech

chmod775 14 hours ago

> [..] we're not at AGI with this baseline tech

DAG architectures fundamentally cannot be AGI and you cannot even use them as a building block for a hypothetical AGI if they're immutable at runtime.

Any time I hear the goal being "AGI" in the context of these LLMs, I feel like listening to a bunch of 18th-century aristocrats trying to get to the moon by growing trees.

Try to create useful approximations using what you have or look for new approaches, but don't waste time on the impossible. There's no iterative improvements here that will get you to AGI.

kristjansson 13 hours ago

> "So... what does the thinking?"

> "You're not understanding, are you? The brain does the thinking. The meat."

> "Thinking meat! You're asking me to believe in thinking meat!"

https://www.mit.edu/people/dpolicar/writing/prose/text/think...

AllegedAlec 5 hours ago

Thank you. It's maddening how people keep making this fundamental mistake.

mgraczyk 9 hours ago

This is meant to be some kind of Chinese room argument? Surely a 1e18 context window model running at 1e6 tokens per second could be AGI.

chmod775 7 hours ago

Personally I'm hoping for advancements that will eventually allow us to build vehicles capable of reaching the moon, but do keep me posted on those tree growing endeavors.

mgraczyk 5 hours ago

Tree growing?

And I don't follow, we've had vehicles capable of reaching the moon for over 55 years

4 More Comments →

rar00 5 hours ago

This argument works better for state space models. A transformer would still steps context one token at a time, not maintain an internal 1e18 state.

mgraczyk 1 hour ago

That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not

lukan 8 hours ago

"Surely a 1e18 context window model running at 1e6 tokens per second could be AGI."