Profile: cschmidt - Hacker Neue

cschmidt

Joined Nov 13, 2007 2,900 karma

Craig Schmidt, email is firstname@firstnamelastname.com

cschmidt Nov 18, 2025 parent

There are equal weight S&P ETFs, which avoid having a handful of stock dominating. However, they do have to do a lot more rebalancing to keep things in line.
cschmidt Oct 23, 2025 parent

There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 https://arxiv.org/abs/2504.02122.
cschmidt Aug 5, 2025 parent

I worry how often that is happening already on Spotify.
1 point Jul 30, 2025

Gian-Carlo Rota's Combinatorial Theory Course: The Guidi Notes

0 comments cschmidt ellerman.org
cschmidt Jul 29, 2025 parent

I’m not sure about this masters program, but the undergrad program seems to be proper ORMS.
cschmidt Jul 29, 2025 parent

I think in this context Management Science is an older term that was synonymous with operations research. The flagship journal of Informs (the institute for operations research and management science) has the same name. Studying how to optimize thing, lots of statistics and math. Stanford was at the forefront of the field from George Danzig onwards. So not trying to make management a “science” in this case.
cschmidt Jun 28, 2025 parent

Attention does help, which is why it can learn arithmetic, even with arbitrary tokenization. However, if you put it in a standard form, such as right-to-left groups of 3, you make it an easier problem for the LLM to learn. All the examples it sees are in the same format. Here, the issue is that BLT operates in an autoregressive manner (strictly left to right), which makes it harder to tokenize the digits in a way that is easier for the LLM to learn. Each digit is its own token (Llama style), or flipping the digits might be the best.
cschmidt Jun 26, 2025 parent

Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.
cschmidt Jun 25, 2025 parent

And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689
cschmidt Jun 25, 2025 parent

Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.
cschmidt Jun 25, 2025 parent

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.
cschmidt Jun 24, 2025 parent

This paper has a good solution:
https://arxiv.org/abs/2402.14903
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.
cschmidt Jun 14, 2025 parent

Gurobi does have a cloud service where you pay by the hour. A full non-academic license is pricy.
cschmidt Jun 4, 2025 parent

I'm just saying that these systems don't work for me. I write ML/AI conference papers in LaTeX, and I think that use case will be tough to dislodge. I can see this being very attractive to people making other types of documents without a fixed format, especially if you don't already know LaTeX.
cschmidt Jun 4, 2025 parent

One thing that has helped with ease of use is Overleaf. It is a hosted LaTeX editor with lots of collaboration features (leaving comments, history of edits) that let people collaborate in real time on a paper. It comes with many templates to get you started on a new document. If you're working with collaborators, it has a lock on the market.
LaTeX itself can be easy for simple things (pick a template, and put text in each section). And it can grow into almost anything if you put in enough effort. It is far and away the standard way to write math equations, so if your document has lots of formulas, that's a plus.
cschmidt Jun 4, 2025 parent

You make a fair point - I'm talking specifically about CS/ML/AI conferences. I shouldn't overgeneralize.
cschmidt Jun 3, 2025 parent

Every conference has their own required LaTeX style file that must be used. Unless there is an automated way to convert these exactly, I don't see how LaTeX alternatives can be used.
cschmidt Jun 1, 2025 parent

Anyone reading this in the future, I meant to say the length weighting is a bit nonstandard. It is usually by frequency. Oops
cschmidt May 31, 2025 parent

That's in interesting point. While your correct, of course, it is so common to consider a hash table lookup a O(1) operation, it never occurred to me. But in this case, the loops are actually really tight and the hash table lookup might be a significant part of the time, so it might well behave more like O(n L^2). I'll update the docs and paper.
cschmidt May 31, 2025 parent

Yes, they were concurrent work. (Co-author of BoundlessBPE here). A sibling comment describes the main differences. Our paper motivates why superwords can lead to such a big improvement, by overcoming a limit that pre-tokenization imposes on current tokenization methods. The SuperBPE paper has a wonderful set of downstream evaluation runs. So if you're interested in either, they are quite complimentary papers.
cschmidt May 31, 2025 parent

Regarding $O(n L^2)$ vs $O(n L)$, that was because we somewhat sloppily tend to use the term 'tokenization' for both training a tokenizer vocab, and for tokenizing a given document. In the paper, we tried to always call the latter one segmentation or inference. The former is $O(n L^2)$ per iteration, while the latter $O(n L)$. I'll update the README to be more explicit about this.
cschmidt May 31, 2025 parent

Co-author of the PathPiece paper here.
With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization.
Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.
cschmidt May 31, 2025 parent

Somehow I didn't get any notifications of your PR. Sorry about that. I'll take a look.
cschmidt May 31, 2025 parent

It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods.
See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...
I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.
cschmidt May 26, 2025 parent

Are you planning to publish your results when you're done?
cschmidt May 22, 2025 parent

That probably would have worked. I just discovered there was a bug, and it popped up a thing about 4, so I didn't actually try the old version.
cschmidt May 22, 2025 parent

Claude 3.8 wrote me some code this morning, and I was running into a bug. I switched to 4 and gave it its own code. It pointed out the bug right away and fixed it. So an upgrade for me :-)
cschmidt May 18, 2025 parent

This paper was accepted as a poster to NeurIPS 2024, so it isn't just a pre-print. There is a presentation video and slides here:
https://neurips.cc/virtual/2024/poster/94849
The underlying data has been open sourced as discussed on his blog here https://timothynguyen.org/2024/11/07/open-sourced-my-work-on...
cschmidt Jan 15, 2025 parent

Here's a paper reviewing the various choices, that is often mentioned in discussions around data structures for text editors:
https://www.cs.unm.edu/~crowley/papers/sds.pdf
cschmidt Jan 15, 2025 parent

It seems like Helix is using it https://github.com/helix-editor/helix/blob/master/docs/archi...

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous