Comment by rudedogg - Hacker Neue

rudedogg May 28, 2025 parent

On a similar note:

I wonder if we'll see OS level services/daemons to try and lower the time to first token as these things get used more. And the interface for application developers will be a simple system prompt.

In some ways the idea sounds nice, but there would be a lot of downsides:

- Memory eaten up by potentially unused models

- Less compute available to software running specialized models for specific tasks

zackify May 28, 2025

This is specifically why I use LM studio and paid for 128gb on my MacBook.

I kept devstral in memory 15gb~ always since I have so much extra.

I can’t wait for a few years from now where I can have triple the memory bandwidth at this size ram

wbl May 28, 2025

I have sad news for you about how interface bandwidths have scaled. Yes banking can help but only so much.

kgeist May 28, 2025

I've found that Linux already does this out of the box: if you load a model once, subsequent loads are much faster because the OS caches the on-disk pages entirely in RAM (assuming you have enough RAM). If you switch to another model and/or run out of RAM, the OS will automatically unload some parts to make room. So all you need to do is read the most used model(s) on startup, to warm the disk cache + add a TTL to unload models and stop actively occupying (V)RAM.

This item has no comments currently.