Corporate security hates websockets though, SSE is much easier for end-users to get approved.
I think it would be even more wasteful to continue inference in background for nothing if the user decided to leave without pressing the stop button. Saving the partial answer at the exact moment the client disappeared would be better.
What if I want to have the agent go off and work on something for a while and I'll check back tomorrow?
It's yet another system that needs some DRAM though. The good news is that you can auto-expire the queued up responses pretty fast :shrug:
No idea if it's worth it, though. Someone with access to the statistics surrounding dropped connections/repeated prompts at a big LLM service provider would need to do some math.