Preferences

bick_nyers
Joined 829 karma

  1. Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.
  2. The general rule of thumb when assessing MoE <-> Dense model intelligence is SQRT(Total_Params*Active_Params). For Deepseek, you end up with ~158B params. The economics of batch inferencing a ~158B model at scale are different when compared to something like Deepseek (it is ~4x more FLOPS per inference after all), particularly if users care about latency.
  3. There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world.

    A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.

    If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

    This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.

    There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.

  4. I've been using PyCharm for the debugger (and everything else) and VSCode + RooCode + Local LLM lately.

    I've heard decent things about the Windsurf extension in PyCharm, but not being able to use a local LLM is an absolute non-starter for me.

  5. MoE inference wouldn't be terrible. That being said, there's not a good MoE model in the 70-160B range as far as I'm aware.
  6. If you want to split tensorwise yes. Layerwise splits could go over Ethernet.

    I would be interested to see how feasible hybrid approaches would be, e.g. connect each pair up directly via ConnectX and then connect the sets together via Ethernet.

  7. About $12k when Project Digits comes out.
  8. Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.
  9. It's not really possible to say what's "best" because the criteria is super subjective.

    I personally like the Spline family, and I default to Spline36 for both upscaling and downscaling in ffmpeg. Most people can't tell the difference between Spline36 and Lanczos3. If you want more sharpness, go for Spline64, for less sharpness, try Spline16.

    Edit: As far as I'm aware though OpenCV doesn't have Spline as an option for resizing.

  10. It would not be that slow as it is an MoE model with 37b activated parameters.

    Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.

    At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.

    This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).

  11. It will be slower for a 70b model since Deepseek is an MoE that only activates 37b at a time. That's what makes CPU inference remotely feasible here.
  12. An actual hardcore technical AI "psychology" program would actually be really cool. Could be a good onboarding for prompt engineering (if it still exists in 5 years).
  13. I definitely agree with you in the interim regarding junior developers. However, I do think we will eventually have the AI coding equivalent of CICD built into perhaps our IDE. Basically, when an AI generated some code to implement something, you chain out more AI queries to test it, modify it, check it for security vulnerabilities etc.

    Now, the first response some folks may have is, how can you trust that the AI is good at security? Well, in this example, it only needs to be better than the junior developers at security to provide them with benefits/learning opportunities. We need to remember that the junior developers of today can also just as easily write insecure code.

  14. Check out their project digits announcement, 128GB unified memory with infiniband capabilities for $3k.

    For more of the fast VRAM you would be in Quadro territory.

  15. I suspect the big AI companies try to adversarially train that out as it could be used to "jailbreak" their AI.

    I wonder though, what would be considered a meaningful punishment/reward to an AI agent? More/less training compute? Web search rate limits? That assumes that what the AI "wants" is to increase its own intelligence.

  16. I wonder if you would want to use an earlier layer as opposed to the penultimate layer, I would imagine that the LLM uses that layer to "prepare" for the final dimensionality reduction to clean the signal such that it scores well on the loss function.
  17. I kinda wish you could just take a course on a specific distribution. Like, here's the Poisson class where you learn all of its interesting properties and apply it to e.g. queuing problems.
  18. If you are comfortable with purchasing used hardware, used 3090 are great value, they can be had for roughly a third of the price of a new 4090.

    How many GPUs you need is completely dependent on the size of your team, their frequency of usage, and the size of the models you are comfortable with.

    I generally recommend you rent instances on something like runpod to build out a good estimate of your actual usage before commiting a bunch of money to hardware.

  19. I would probably refer to category 1 as "Open Architecture". I wouldn't want to give anyone the false impression that category 1 is comparable in the slightest to Open Weights, which is vastly more useful.
  20. You could always split one of the experts up across multiple GPUs. I tend to agree with your sentiment, I think researchers in this space tend to not optimize that well for inference deployment scenarios. To be fair, there is a lot of different ways to deploy something, and a lot of quantization techniques and parameters.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal