- cschmidt parentThere are equal weight S&P ETFs, which avoid having a handful of stock dominating. However, they do have to do a lot more rebalancing to keep things in line.
- There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 https://arxiv.org/abs/2504.02122.
- 1 point
- I think in this context Management Science is an older term that was synonymous with operations research. The flagship journal of Informs (the institute for operations research and management science) has the same name. Studying how to optimize thing, lots of statistics and math. Stanford was at the forefront of the field from George Danzig onwards. So not trying to make management a “science” in this case.
- Attention does help, which is why it can learn arithmetic, even with arbitrary tokenization. However, if you put it in a standard form, such as right-to-left groups of 3, you make it an easier problem for the LLM to learn. All the examples it sees are in the same format. Here, the issue is that BLT operates in an autoregressive manner (strictly left to right), which makes it harder to tokenize the digits in a way that is easier for the LLM to learn. Each digit is its own token (Llama style), or flipping the digits might be the best.
- Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.
- And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689
- Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.
- I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.
- This paper has a good solution:
https://arxiv.org/abs/2402.14903
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.
- I'm just saying that these systems don't work for me. I write ML/AI conference papers in LaTeX, and I think that use case will be tough to dislodge. I can see this being very attractive to people making other types of documents without a fixed format, especially if you don't already know LaTeX.
- One thing that has helped with ease of use is Overleaf. It is a hosted LaTeX editor with lots of collaboration features (leaving comments, history of edits) that let people collaborate in real time on a paper. It comes with many templates to get you started on a new document. If you're working with collaborators, it has a lock on the market.
LaTeX itself can be easy for simple things (pick a template, and put text in each section). And it can grow into almost anything if you put in enough effort. It is far and away the standard way to write math equations, so if your document has lots of formulas, that's a plus.
- That's in interesting point. While your correct, of course, it is so common to consider a hash table lookup a O(1) operation, it never occurred to me. But in this case, the loops are actually really tight and the hash table lookup might be a significant part of the time, so it might well behave more like O(n L^2). I'll update the docs and paper.
- Yes, they were concurrent work. (Co-author of BoundlessBPE here). A sibling comment describes the main differences. Our paper motivates why superwords can lead to such a big improvement, by overcoming a limit that pre-tokenization imposes on current tokenization methods. The SuperBPE paper has a wonderful set of downstream evaluation runs. So if you're interested in either, they are quite complimentary papers.
- Regarding $O(n L^2)$ vs $O(n L)$, that was because we somewhat sloppily tend to use the term 'tokenization' for both training a tokenizer vocab, and for tokenizing a given document. In the paper, we tried to always call the latter one segmentation or inference. The former is $O(n L^2)$ per iteration, while the latter $O(n L)$. I'll update the README to be more explicit about this.
- Co-author of the PathPiece paper here.
With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization.
Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.
- It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods.
See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...
I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.
- This paper was accepted as a poster to NeurIPS 2024, so it isn't just a pre-print. There is a presentation video and slides here:
https://neurips.cc/virtual/2024/poster/94849
The underlying data has been open sourced as discussed on his blog here https://timothynguyen.org/2024/11/07/open-sourced-my-work-on...
- Here's a paper reviewing the various choices, that is often mentioned in discussions around data structures for text editors:
- It seems like Helix is using it https://github.com/helix-editor/helix/blob/master/docs/archi...