A few people have mentioned looking a the vLLM docs and blog (recommended!). I'd also recommend SGLang's docs and blog as well.
If you're interested in a bit of a deeper dive, I can highly recommend reading some of what DeepSeek has published: https://arxiv.org/abs/2505.09343 (and actually quite a few of their Technical Reports and papers).
I'd also say that while the original GPT-4 was a huge model when it was originally released (rumored 1.7T-A220B), these days you can get (original release) "GPT-4-class" performance at ~30B dense/100B sparse MoE - and almost all the leading MoEs have between 12-37B activations no matter how big they get - Kimi K2 (1T param weights) has only 32B activations). If you do a basic quants (FP8/INT8) you can easily push 100+ tok/s on pretty bog standard data center GPUs/nodes. You quant even lower for even better speeds (tg is just MBW) for not much in quality loss (although for open source kernels, usually without getting much overall throughput or latency improvements).
A few people have mentioned speculative decoding, if you want to learn more, I'd recommend taking a look at the papers for one of the (IMO) best open techniques, EAGLE: https://github.com/SafeAILab/EAGLE
The other thing that is often ignored, especially for multiturn that I haven't seen mentioned yet is better caching, specifically prefix caching (radix-tree, block-level hash) or tiered/offloaded kvcaches (LMCache as one example). If you search for those keywords, you'll find lots there as well.
If you're interested in a bit of a deeper dive, I can highly recommend reading some of what DeepSeek has published: https://arxiv.org/abs/2505.09343 (and actually quite a few of their Technical Reports and papers).
I'd also say that while the original GPT-4 was a huge model when it was originally released (rumored 1.7T-A220B), these days you can get (original release) "GPT-4-class" performance at ~30B dense/100B sparse MoE - and almost all the leading MoEs have between 12-37B activations no matter how big they get - Kimi K2 (1T param weights) has only 32B activations). If you do a basic quants (FP8/INT8) you can easily push 100+ tok/s on pretty bog standard data center GPUs/nodes. You quant even lower for even better speeds (tg is just MBW) for not much in quality loss (although for open source kernels, usually without getting much overall throughput or latency improvements).
A few people have mentioned speculative decoding, if you want to learn more, I'd recommend taking a look at the papers for one of the (IMO) best open techniques, EAGLE: https://github.com/SafeAILab/EAGLE
The other thing that is often ignored, especially for multiturn that I haven't seen mentioned yet is better caching, specifically prefix caching (radix-tree, block-level hash) or tiered/offloaded kvcaches (LMCache as one example). If you search for those keywords, you'll find lots there as well.