Comment by kgeist - Hacker Neue

kgeist May 28, 2025 parent

I've found that Linux already does this out of the box: if you load a model once, subsequent loads are much faster because the OS caches the on-disk pages entirely in RAM (assuming you have enough RAM). If you switch to another model and/or run out of RAM, the OS will automatically unload some parts to make room. So all you need to do is read the most used model(s) on startup, to warm the disk cache + add a TTL to unload models and stop actively occupying (V)RAM.

This item has no comments currently.