Now, running local models instead of using them as a SaaS has a clear purpose: the price of your local model won't suddenly increase ten fold once you start depending on it, like the SaaS models might. Any level of control beyond that is illusory with LLMs.
It's fine for models to have open-weights and closed data. It's only barely fitting the opensource model IMHO though.
An open weight model addresses the second part of THIS, but not the first. However, even an open weight model with all of the training data available doesn't fix the first problem. Even if you somehow got access to enough hardware to train your own GPT-5 based on the published data, you still couldn't meaningfully fix an issue you have with it, not even if you hired Ilya Sutskever and Yann LeCun to do it for you: these are black boxes that no one can actually understand at the level of a program or device.
They probably can't give you the training set as it would amount to publication of infringing content. Where would you store it, and what would you do with it anyway?
I'm not sure how everyone can have access to the data without necessitating another taking on the burden of providing it.
I'm also not saying anyone should be forced to disclose training data. I'm only staying that a LLM that's only openweight and not open data/pipeline barely fits the opensource model of the stack mentioned by OP.
The average user like me wouldn't be able to run pipelines without serious infrastructure, but it's very important to understand how the data is used and how the models are trained, so that we own the model and can assess its biases openly.