- k8si parentI'm not sure people outside of Greater Boston would care, but those of us who do live there probably find it exceedingly strange that this occurred in Brookline of all places.
- Well, currently we have a ton of Congresspeople who are primarily motivated by their "good financial sense" (for obvious reasons e.g. this study). So, I think we could do with a few more Congresspeople with less financial sense and more genuine motivation to improve the lives of their constituents.
- Maybe this is a nitpick but CoNLL NER is not a "challenging task". Even pre-LLM systems were getting >90 F1 on that as far back as 2016.
Also, just in case people want to lit review further on this topic: they call their method "programmatic data curation" but I believe this approach is also called model distillation and/or student-teacher training.
- Is it actually more feasible now? Do LLMs actually make this problem easier to solve?
Because I have a hard time believing they can actually extract time increments and higher-level tasks from log data without a ton of pre/post-processing. But then the problem is just as much work as it was 5 years ago when you might have been using plain old BERT.
- I suggest going through the exercise of seeing whether this is true quantitatively. Get a business-relevant NER dataset together (not CoNLL, preferably something that your boss or customers would care about), run it against Mistral/etc, look at the P/R/F1 scores, and ask "does this solve the problem that I want to solve with NER". If the answer is 'yes', and you could do all those things without reading the book or other NLP educational sources, then yeah you're right, job's finished.
- You are a teenager who needs oral contraceptives because you are sexually active. You don't want your parents to find out. Since you're a teenager, you have a few constraints:
- you have no car, how do you get to your doctor's appointment without asking parents for a ride?
- you are a minor, do you have any guarantee that your doctor won't tell your parents? You can't risk them finding out, they are very conservative
- you may not have ever made a doctor appointment for yourself before, maybe don't have access to insurance information etc
Planned Parenthood provides BCPs at a price you can afford with your teenager job (also guarantees privacy) but the closest one is hours away...
What do you do?
- Communication rates are very similar across languages: https://www.science.org/doi/10.1126/sciadv.aaw2594
See also (great read): https://pubmed.ncbi.nlm.nih.gov/31006626/
wrt your Spanish example: grammatical gender adds information redundancy to make it easier to process spoken language (e.g. helps with reference resolution). This redundancy enables Spanish speakers to speak at a relatively fast rate without incurring perception errors. English has fewer words but a slower speech rate. It's an optimization problem.
The speech rate issue isn't as obvious if you're only looking at text, but I'd argue/speculate that lossless speech as a language evolutionary constraint has implications for learnability.
tl;dr there is no communication tax, languages are basically equivalent wrt to information rate, they just solved the optimization problem of compactness vs speech rate differently
- "word" isn't a useful concept in a lot of languages. Words are obvious in English because English is analytic: https://en.wikipedia.org/wiki/Analytic_language
But there are tons of languages (not just CJK languages) that use either compounding or combos of root + prefix/suffix/infix to express what would be multiple words in English. E.g. German 'Schadenfreude'. Its actually way more useful to tokenize this as separate parts because e.g. 'Freude' might be part of a lot of other "words" as well. So you can share that token across a lot of words, thereby keeping the vocab compact.
- I don't know what you mean by compiler terms but basically, worse tokenizer = worse LM performance. This is because worse tokenizer means more tokens per sentence so it takes more FLOPs to train on each sentence, on average. So given a fixed training budget, English essentially gets more "learning per token" than other languages.
- For GPT4: "Pricing is $0.03 per 1,000 “prompt” tokens (about 750 words) and $0.06 per 1,000 “completion” tokens (again, about 750 words)."
Meanwhile, there are off-shelf models that you can train very efficiently, on relevant data, privately, and you can run these on your own infrastructure.
Yes, GPT4 is probably great at all the benchmark tasks, but models have been great at all the open benchmark tasks for a long time. That's why they have to keep making harder tasks.
Depending on what you actually want to do with LMs, GPT4 might lose to a BERTish model in a cost-benefit analysis--especially given that (in my experience), the hard part of ML is still getting data/QA/infrastructure aligned with whatever it is you want to do with the ML. (At least at larger companies, maybe it's different at startups.)
- Again, vegans almost always have to read the ingredients/labels, on every processed food product they plan to consume. The little 'vegan' icon on the back is new and not consistently used. Choosing a plant based lifestyle is A LOT more burdensome than not doing that. I know because I've switched back and forth many times and am married to a vegan. Whey, casein, random cream, honey, they're in everything.
Even so: reading ingredients is honestly not that hard.