Profile: bick_nyers - Hacker Neue

bick_nyers

Joined Aug 31, 2018 829 karma

bick_nyers Jun 2, 2025

Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.
bick_nyers Jun 1, 2025

The general rule of thumb when assessing MoE <-> Dense model intelligence is SQRT(Total_Params*Active_Params). For Deepseek, you end up with ~158B params. The economics of batch inferencing a ~158B model at scale are different when compared to something like Deepseek (it is ~4x more FLOPS per inference after all), particularly if users care about latency.
bick_nyers Jun 1, 2025

There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world.
A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.
If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.
This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.
There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.
bick_nyers May 7, 2025

I've been using PyCharm for the debugger (and everything else) and VSCode + RooCode + Local LLM lately.
I've heard decent things about the Windsurf extension in PyCharm, but not being able to use a local LLM is an absolute non-starter for me.
bick_nyers Mar 31, 2025

MoE inference wouldn't be terrible. That being said, there's not a good MoE model in the 70-160B range as far as I'm aware.
bick_nyers Mar 5, 2025

If you want to split tensorwise yes. Layerwise splits could go over Ethernet.
I would be interested to see how feasible hybrid approaches would be, e.g. connect each pair up directly via ConnectX and then connect the sets together via Ethernet.
bick_nyers Mar 5, 2025

About $12k when Project Digits comes out.
bick_nyers Mar 5, 2025

Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.
bick_nyers Mar 2, 2025

It's not really possible to say what's "best" because the criteria is super subjective.
I personally like the Spline family, and I default to Spline36 for both upscaling and downscaling in ffmpeg. Most people can't tell the difference between Spline36 and Lanczos3. If you want more sharpness, go for Spline64, for less sharpness, try Spline16.
Edit: As far as I'm aware though OpenCV doesn't have Spline as an option for resizing.
bick_nyers Feb 1, 2025

It would not be that slow as it is an MoE model with 37b activated parameters.
Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.
At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.
This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).
bick_nyers Feb 1, 2025

It will be slower for a 70b model since Deepseek is an MoE that only activates 37b at a time. That's what makes CPU inference remotely feasible here.
bick_nyers Jan 28, 2025

An actual hardcore technical AI "psychology" program would actually be really cool. Could be a good onboarding for prompt engineering (if it still exists in 5 years).
bick_nyers Jan 28, 2025

I definitely agree with you in the interim regarding junior developers. However, I do think we will eventually have the AI coding equivalent of CICD built into perhaps our IDE. Basically, when an AI generated some code to implement something, you chain out more AI queries to test it, modify it, check it for security vulnerabilities etc.
Now, the first response some folks may have is, how can you trust that the AI is good at security? Well, in this example, it only needs to be better than the junior developers at security to provide them with benefits/learning opportunities. We need to remember that the junior developers of today can also just as easily write insecure code.
bick_nyers Jan 7, 2025

Check out their project digits announcement, 128GB unified memory with infiniband capabilities for $3k.
For more of the fast VRAM you would be in Quadro territory.
bick_nyers Dec 27, 2024

I suspect the big AI companies try to adversarially train that out as it could be used to "jailbreak" their AI.
I wonder though, what would be considered a meaningful punishment/reward to an AI agent? More/less training compute? Web search rate limits? That assumes that what the AI "wants" is to increase its own intelligence.
bick_nyers Dec 10, 2024

I wonder if you would want to use an earlier layer as opposed to the penultimate layer, I would imagine that the LLM uses that layer to "prepare" for the final dimensionality reduction to clean the signal such that it scores well on the loss function.
bick_nyers Dec 8, 2024

I kinda wish you could just take a course on a specific distribution. Like, here's the Poisson class where you learn all of its interesting properties and apply it to e.g. queuing problems.
bick_nyers Dec 4, 2024

If you are comfortable with purchasing used hardware, used 3090 are great value, they can be had for roughly a third of the price of a new 4090.
How many GPUs you need is completely dependent on the size of your team, their frequency of usage, and the size of the models you are comfortable with.
I generally recommend you rent instances on something like runpod to build out a good estimate of your actual usage before commiting a bunch of money to hardware.
bick_nyers Nov 11, 2024

I would probably refer to category 1 as "Open Architecture". I wouldn't want to give anyone the false impression that category 1 is comparable in the slightest to Open Weights, which is vastly more useful.
bick_nyers Nov 6, 2024

You could always split one of the experts up across multiple GPUs. I tend to agree with your sentiment, I think researchers in this space tend to not optimize that well for inference deployment scenarios. To be fair, there is a lot of different ways to deploy something, and a lot of quantization techniques and parameters.

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous