You've made several incorrect assumptions and I am not bothered enough to try to correct them so I apologize for my ignorance. I'll just say that 16ms memory tax is wildly incorrect.
namibj
You are either having a massive misconception of GPT-like decoder transformers, of how GPU data paths are architected, or are trolling.
Go talk to a modern reasoning model to get yourself some knowledge, it's gonna be much better than what you appear to have.