Comment by ultrasaurus

ultrasaurus Feb 8, 2024 parent

The improvements in ease of use for locally hosting LLMs over the last few months have been amazing. I was ranting about how easy https://github.com/Mozilla-Ocho/llamafile is just a few hours ago [1]. Now I'm torn as to which one to use :)

1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

keriati1 Feb 9, 2024

I think it is even easier right now for companies to self host an inference server with basic rag support:

- get a Mac Mini or Mac Studio - just run ollama serve, - run ollama web-ui in docker - add some coding assitant model from ollamahub with the web-ui - upload your documents in the web-ui

No code needed, you have your self hosted LLM with basic RAG giving you answers with your documents in context. For us the deepseek coder 33b model is fast enough on a Mac Studio with 64gb ram and can give pretty good suggestions based on our internal coding documentation.

vergessenmir Feb 9, 2024

Personally I'd recommend Ollama, because they have a good model (dockeresque), the APIs are quite more widely supported

You can mix models in a single model file, it's a feature I've been experimenting with lately

Note: you don't have to rely on their model Library, you can use your own. Secondly, support for new models is through their bindings with llama.cpp

xyc Feb 9, 2024

The pace of progress here is pretty amazing. I loved how easy it is to get llamafile up and running, but I missed feature complete chat interfaces, so I built one based off it: https://recurse.chat/.

I still need GPT-4 for some tasks, but in daily usage it's replaced much of ChatGPT usage, especially since I can import all of my ChatGPT chat history. Also curious to learn about what people want to do with local AI.

SOLAR_FIELDS Feb 9, 2024

My primary use case would be to feed large internal codebases into an LLM with a much larger context window than what GPT-4 offers. Curious what the best options here are, in terms of model choice, speed, and ideas for prompt engineering

xyc Feb 9, 2024

I came across https://github.com/BloopAI/bloop that seems promising, but it's GPT-4 based. It uses Qdrant as vector DB for RAG.

aethelyon Feb 11, 2024

Bloop is amazing. Once you use it you stop building your own DIY codebase QA setups.

Eisenstein Feb 9, 2024

Yi-34B-200K might be something to look at.

* https://huggingface.co/01-ai/Yi-34B-200K

littlestymaar Feb 11, 2024

What's up with the landing page though? Unless I'm not well awaken, there doesn't seem to be a download section or anything.

jondwillis Feb 9, 2024

I’ve been using Ollama with Mixtral-7B on my MBP for local development and it has been amazing.

gnicholas Feb 9, 2024

I have used it too and am wondering why it starts responding so much faster than other similar-sized models I've tried. It doesn't seem quite as good as some of the others, but it is nice that the responses start almost immediately (on my 2022 MBA with 16 GB RAM).

Does anyone know why this would be?

regularfry Feb 9, 2024

I've had the opposite experience with Mixtral on Ollama, on an intel linux box with a 4090. It's weirdly slow. But I suspect there's something up with ollama on this machine anyway, any model I run with it seems to have higher latency than vLLM on the same box.

kkzz99 Feb 9, 2024

You have to specify the amount of layers to put on the GPU with ollama. Ollama defaults to far less layers compared to what is actually possible.

castles Feb 9, 2024

To clarify - did you mean Mixtral (8x)7b, or Mistral 7b?

jondwillis Feb 9, 2024

MIXtral (8x)-7B

a_wild_dandan Feb 8, 2024

I've always used `llamacpp -m <model> -p <prompt>`. Works great as my daily driver of Mixtral 8x7b + CodeLlama 70b on my MacBook. Do alternatives have any killer features over Llama.cpp? I don't want to miss any cool developments.

Casteil Feb 9, 2024

70b is probably going to be a bit slow for most on M-series MBPs (even with enough RAM), but Mixtral 8x7b does really well. Very usable @ 25-30T/s (64GB M1 Max), whereas 70b tends to run more like 3.5-5T/s.

'llama.cpp-based' generally seems like the norm.

Ollama is just really easy to set up & get going on MacOS. Integral support like this means one less thing to wire up or worry about when using a local LLM as a drop-in replacement for OpenAI's remote API. Ollama also has a model library[1] you can browse & easily retrieve models from.

Another project, Ollama-webui[2] is a nice webui/frontend for local LLM models in Ollama - it supports the latest LLaVA for multimodal image/prompt input, too.

[1] https://ollama.ai/library/mixtral

[2] https://github.com/ollama-webui/ollama-webui

visarga Feb 9, 2024

Yeah, ollama-webui is an excellent front end and the team was responsive in fixing a bug I reported in a couple of days

It's also possible to connect to OpenAI API and use GPT-4 on per token plan. I cancelled my chatGPT subscription since. But 90% of the usage for me is Mistral 7B fine-tunes, I rarely use OpenAI

mark_l_watson Feb 9, 2024

Thanks for that idea, I use Ollama as my main LLM driver, but I still use OpenAI, Anthropic, and Mistral commercial API plans. I access Ollama via a REST API and my own client code, but I will try their UI.

re: cancelling ChatGPT subscription: I am tempted to do this also except I suspect that when they release GPT-5 there may be a waiting list, and I don’t want any delays in trying it out.

skp1995 Feb 9, 2024

I have found deepseek coder 33B to be better than codellama 70B (personal opinion tho).. I think the best parts of deepseek are around the fact that it understands multi-file context the best.

karolist Feb 9, 2024

Same here, I run deepseek coder 33b on my 64GB M1 Max at about 7-8t/s and it blows all other models I've tried for coding. It feels like magic and cheating at the same time, getting these lenghty and in-depth answers with activity monitor showing 0 network IO.

hcrisp Feb 9, 2024

I tried running Deepseek 33b using llama.cpp with 16k context and it kept injecting unrelated text. What is your setup so it works for you? Do you have some special CLI flags or prompt format?

skp1995 Feb 10, 2024

I use the default prompt template which is defined in the tokenizer.config https://huggingface.co/deepseek-ai/deepseek-coder-33b-instru...

No special flags or anything, just the standard format. Do take care of the spaces and end of lines. sharing a gist of the function I use for formatting it: https://gist.github.com/theskcd/a3948d4062ed8d3e697121cabd65... (hope this helps!)

karolist Feb 9, 2024

I actually use lmstudio with settings preset for deepseek that comes with it, except for mlock set to keep it entirely in memory, works really good

_ink_ Feb 9, 2024

How exactly do you use the LLM with multiple files? Do you copy them enterly into the prompt?

skp1995 Feb 10, 2024

not really copying the whole files, but I work in an editor which has keyboard shortcuts for adding the relevant context from files and putting that in the prompts.

Straight up giving large files tends to degrade performance (so you need to do some reranking on the snippets before sending them over

ultrasaurus OP Feb 9, 2024

Based on a day's worth of kicking tires, I'd say no -- once you have a mix that supports your workflow the cool developments will probably be in new models.

I just played around with this tool and it works as advertised, which is cool but I'm up and running already. (For anyone reading this though who, like me, doesn't want to learn all the optimization work... I might see which one is faster on your machine)

livrem Feb 9, 2024

With all the models I tried there was a quite a bit of fiddling for each one to get the correct command-line flags and a good prompt, or at least copy-paste some command-line from HF. Seems like every model needs its own unique prompt to give good results? I guess that is what the wrappers take care of? Other than that llama.cpp is very easy to use. I even run it on my phone in Termux, but only with a tiny model that is more entertaining than useful for anything.

te_chris Feb 9, 2024

For the chat models, they're all finetuned slightly differently in their prompt format - see Llama's. So having a conversion between the OAI api that everyone's used to now and the slightly inscrutable formats of models like Llama is very helpful - though much like langchain and its hardcoded prompts everywhere, there's probably some subjectivity and you may be rewarded by formatting prompts directly.

mark_l_watson Feb 9, 2024

The slight incompatibilities of prompt formats and style is a nuisance. I have just been looking an Mistral’s prompt design documentation and I now feel like I have underutilized mistral-7B and mixtral-8-7b https://docs.mistral.ai/guides/prompting-capabilities/

mirekrusin Feb 9, 2024

ollama is extremely convenient wrapper around llamacpp.

they separate serving heavy weights from model definition and usage itself.

what that means is weights of some model, let's say mixtral are loaded on the server process (and kept in memory for 5m as default) and you interact with it by using modelfile (inspired by dockerfile) - all your modelfiles that inherit FROM mixtral will reuse those weights already loaded in memory, so you can instantly swap between different system prompts etc - those appear as normal models to use through cli or ui.

the effect is that you have very low latency, very good interface - for programming api and ui.

ps. it's not only for macs

open weight models + (llama.app) as ollama + ollama-webui = real openai.

myaccountonhn Feb 9, 2024

Curious if anyone has any recommendation for what LLM model to use today if you want a code assistant locally. Mistral?

thrdbndndn Feb 9, 2024

From the blog article:

> A few pip install X’s and you’re off to the races with Llama 2! Well, maybe you are, my dev machine doesn’t have the resources to respond on even the smallest model in less than an hour.

I never tried to run these LLMs on my own machine -- is it this bad?

I guess if I only have a moderate GPU, say a 4060TI, there is no chance I can play with it, then?

pitched Feb 9, 2024

I would expect that 4060ti to get about 20-25 tokens per second on Mixtral. I can read at roughly 10-15 tokens per second so above that is where I see diminishing returns for a chatbot. Generating whole blog articles might have you sit waiting for a minute or so though.

thrdbndndn Feb 9, 2024

Thanks, that sounds more than tolerable than "more than an hour"!

I also have the 16GB version, which I assume would be a little bit better.

cellis Feb 9, 2024

It depends on the context window, but my 3090 gets ~60/s on smaller windows.

visarga Feb 9, 2024

I get 50-60t/s on Mistral 7B on 2080 Ti

jsjohnst Feb 9, 2024

The Apple M1 is very useable with ollama using 7B parameter models and is virtually as “fast” as ChatGPT in responding. Obviously not same quality, but still useful.

Eisenstein Feb 9, 2024

You can load a 7B parameter model quantized at Q4_K_M as gguf. I don't know ollama, but you can load it in koboldcpp -- use cuBLAS and gpu layers 100 context 2048 and it should fit it all into 8GB of VRAM. For quantized models look at TheBloke on huggingface -- Mistral 7B is a good one to try.

dizhn Feb 9, 2024

If I am not mistaken, layer offloading is a llama.cpp feature so a lot of frontends/loaders that use it also have it. I use it with koboldcpp and text-generation-webui.

jwr Feb 9, 2024

On an M3 MacBook Pro with 32GB of RAM, I can comfortably run 34B models like phind-codellama:34b-v2-q8_0.

Unfortunately, having tried this and a bunch of other models, they are all toys compared to GPT-4.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous