1. https://jalammar.github.io/illustrated-word2vec/
2. https://jalammar.github.io/visualizing-neural-machine-transl...
3. https://jalammar.github.io/illustrated-transformer/
4. https://jalammar.github.io/illustrated-bert/
5. https://jalammar.github.io/illustrated-gpt2/
And from there it's mostly work on improving optimization (both at training and inference time), training techniques (many stages), data (quality and modality), and scale.
---
There's also state space models, but don't believe they've gone mainstream yet.
https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
And diffusion models - but I'm struggling to find a good resource so https://ml-gsai.github.io/LLaDA-demo/
---
All this being said- many tasks are solved very well using a linear model and tfidf. And are actually interpretable.
Indeed, before that there was a lot of work on applying classical ML classifiers (Naive Bayes, Decision Trees, SVM, Logistic Regression...) and clustering algorithms (fancily referred to as unsupervised ML) to bag-of-words vectors. This was a big field, with some overlap with Information Retrieval, lending to fancier weightings and normalizations of bag-of-words vectors (TF-IDF, BM25). There was also the whole field of Topic Modeling.
Before that there was a ton of statistical NLP modeling (Markov chains and such), primarily focused around machine translation before neural-networks got good enough (like the early version of Google Translate).
And before that there were a few decades of research on grammars (starting with Chomsky), with a lot of overlap with compilers, theoretical CS (state-machines and such) and symbolic AI (lisps, logic programming, expert systems...).
I myself don't have a very clear picture of all of this. I learned some in undergrad and read a few ancient NLP books (60s - 90s) out of curiosity. I started around the time where NLP, and AI in general, had been rather stagnant for a decade or two, it was rather boring and niche, believe it or not, but was starting to be revitalized by the new wave of ML and then word2vec with DNNs.