- get a Mac Mini or Mac Studio - just run ollama serve, - run ollama web-ui in docker - add some coding assitant model from ollamahub with the web-ui - upload your documents in the web-ui
No code needed, you have your self hosted LLM with basic RAG giving you answers with your documents in context. For us the deepseek coder 33b model is fast enough on a Mac Studio with 64gb ram and can give pretty good suggestions based on our internal coding documentation.
You can mix models in a single model file, it's a feature I've been experimenting with lately
Note: you don't have to rely on their model Library, you can use your own. Secondly, support for new models is through their bindings with llama.cpp
I still need GPT-4 for some tasks, but in daily usage it's replaced much of ChatGPT usage, especially since I can import all of my ChatGPT chat history. Also curious to learn about what people want to do with local AI.
Does anyone know why this would be?
'llama.cpp-based' generally seems like the norm.
Ollama is just really easy to set up & get going on MacOS. Integral support like this means one less thing to wire up or worry about when using a local LLM as a drop-in replacement for OpenAI's remote API. Ollama also has a model library[1] you can browse & easily retrieve models from.
Another project, Ollama-webui[2] is a nice webui/frontend for local LLM models in Ollama - it supports the latest LLaVA for multimodal image/prompt input, too.
It's also possible to connect to OpenAI API and use GPT-4 on per token plan. I cancelled my chatGPT subscription since. But 90% of the usage for me is Mistral 7B fine-tunes, I rarely use OpenAI
re: cancelling ChatGPT subscription: I am tempted to do this also except I suspect that when they release GPT-5 there may be a waiting list, and I don’t want any delays in trying it out.
No special flags or anything, just the standard format. Do take care of the spaces and end of lines. sharing a gist of the function I use for formatting it: https://gist.github.com/theskcd/a3948d4062ed8d3e697121cabd65... (hope this helps!)
Straight up giving large files tends to degrade performance (so you need to do some reranking on the snippets before sending them over
I just played around with this tool and it works as advertised, which is cool but I'm up and running already. (For anyone reading this though who, like me, doesn't want to learn all the optimization work... I might see which one is faster on your machine)
they separate serving heavy weights from model definition and usage itself.
what that means is weights of some model, let's say mixtral are loaded on the server process (and kept in memory for 5m as default) and you interact with it by using modelfile (inspired by dockerfile) - all your modelfiles that inherit FROM mixtral will reuse those weights already loaded in memory, so you can instantly swap between different system prompts etc - those appear as normal models to use through cli or ui.
the effect is that you have very low latency, very good interface - for programming api and ui.
ps. it's not only for macs
open weight models + (llama.app) as ollama + ollama-webui = real openai.
> A few pip install X’s and you’re off to the races with Llama 2! Well, maybe you are, my dev machine doesn’t have the resources to respond on even the smallest model in less than an hour.
I never tried to run these LLMs on my own machine -- is it this bad?
I guess if I only have a moderate GPU, say a 4060TI, there is no chance I can play with it, then?
I also have the 16GB version, which I assume would be a little bit better.
Unfortunately, having tried this and a bunch of other models, they are all toys compared to GPT-4.
1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/