This is specifically why I use LM studio and paid for 128gb on my MacBook.
I kept devstral in memory 15gb~ always since I have so much extra.
I can’t wait for a few years from now where I can have triple the memory bandwidth at this size ram
I've found that Linux already does this out of the box: if you load a model once, subsequent loads are much faster because the OS caches the on-disk pages entirely in RAM (assuming you have enough RAM). If you switch to another model and/or run out of RAM, the OS will automatically unload some parts to make room. So all you need to do is read the most used model(s) on startup, to warm the disk cache + add a TTL to unload models and stop actively occupying (V)RAM.
I wonder if we'll see OS level services/daemons to try and lower the time to first token as these things get used more. And the interface for application developers will be a simple system prompt.
In some ways the idea sounds nice, but there would be a lot of downsides:
- Memory eaten up by potentially unused models
- Less compute available to software running specialized models for specific tasks