Profile: p1esk - Hacker Neue

p1esk

Joined May 26, 2012 6,330 karma

hncomments5@gmail.com

p1esk 2 days ago parent

perhaps because you are interested in optimizations or distillation or something
Yes, my job is model compression: quantization, pruning, factorization, ops fusion/approximation/caching, in the context of hw/sw codesign.
In general, I agree with you that simple intuitions often break down in DL - I observed it many times. I also agree that we don't have good understanding how these systems work. Hopefully this situation is more like pre-Newtonian physics, and Newtons are coming.
p1esk 2 days ago parent

No. Each projection is ~5% of total FLOPs/params. Not enough model capacity change to care. From what I remember, removing one of them was worse than other two, I think it was Q. But in all three cases, degradation (in both loss and perplexity) was significant.
p1esk 2 days ago parent

1. The two papers you linked are about importance of attention weights, not QKV projections. This is orthogonal to our discussion.
2. I don't see how the transformations done in one attention block can be reversed in the next block (or in the FFN network immediately after the first block): can you please explain?
3. All state of the art open source LLMs (DeepSeek, Qwen, Kimi, etc) still use all three QKV projections, and largely the same original attention algorithm with some efficiency tweaks (grouped query, MLA, etc) which are done strictly to make the models faster/lighter, not smarter.
4. When GPT2 came out, I myself tried to remove various ops from attention blocks, and evaluated the impact. Among other things I tried removing individual projections (using unmodified input vectors instead), and in all three cases I observed quality degradation (when training from scratch).
5. The terms "sensitivity", "visibility", and "important" all attempt to describe feature importance when performing pattern matching. I use these terms in the same sense as importance of features matched by convolutional layer kernels, which scan the input image and match patterns.
p1esk 2 days ago parent

You should compare the number of top AI scientists each company has. I think those numbers are comparable (I’m guessing each has a couple of dozen). Also how attractive each company is to the best young researchers.
p1esk 2 days ago parent

I glanced at these links and it seems that all these attention variants still use QKV projections.
Do you see any issues with my interpretation of them?
p1esk 2 days ago parent

The way I think about QKV projections: Q defines sensitivity of token i features when computing similarity of this token to all other tokens. K defines visibility of token j features when it’s selected by all other tokens. V defines what features are important when doing weighted sum of all tokens.
p1esk 3 days ago parent

But that's the problem - if you don't like doing anything, what will you do? What will you fill your life with? You will quickly get bored of anything you try. Your life will have no meaning, and you will probably turn to alcohol or drugs.
p1esk 3 days ago parent

Financial freedom is about not having to worry about losing your job, or tolerating shitty work conditions. Why would you retire if you do what you love? I think the real problem might be if there's nothing you actually love doing (long term), that's when money won't help.
p1esk Jan 4, 2026 parent

Almost nobody else in engineering did this.
What you described is the job of a product manager. Are there no PMs at Google?
p1esk Jan 4, 2026 parent

Strange question. If you don’t know why you need this, you probably don’t. It will be the same as with the introductory AI course you did 20 years ago.
p1esk Jan 4, 2026 parent

I love to sit alone in a cafe - reading. Before smartphones I was reading newspapers or books. Now I read on my phone or tablet. While there, I don’t want to talk to anyone, I just want to sit and read quietly.
p1esk Jan 2, 2026 parent

What do you mean?
p1esk Jan 2, 2026 parent

It would need an educated populace
How do you measure that?
p1esk Jan 1, 2026 parent

If someone is interested in low level tensor implementation details they could benefit from a course/book “let’s build numpy in C”. No need to complicate DL library design discussion with that stuff.
p1esk Jan 1, 2026 parent

Wanting power usually has little to do with making the world better.
p1esk Jan 1, 2026 parent

Sure, I get it - trying to understand a specific condition affecting someone close to you. I personally have very little trust in doctors.
But, outside of this need, what actionable science have you learned and applied to your own life?
p1esk Jan 1, 2026 parent

My comment is also directed towards OP: you don’t need small talk to make friends.
p1esk Jan 1, 2026 parent

OK, I just read the abstract and conclusion of the NAC paper posted above. But then I saw a comment from Aurornis saying it’s not that good. Not sure who I should listen to.
p1esk Jan 1, 2026 parent

Would regular engineers like us understand molecular biology papers?
p1esk Dec 31, 2025 parent

Is Estonia a good country to immigrate to for an American?
p1esk Dec 31, 2025 parent

If something is not clear in the book, can I ask it to explain it to me?
p1esk Dec 31, 2025 parent

This doesn’t make sense to me. Why would you want something like this? What is it exactly that you expect from such a finetuned model that you cannot get from a frontier general purpose model?
p1esk Dec 31, 2025 parent

I hate small talk. Never needed it to make friends.
p1esk Dec 27, 2025 parent

Why limit to 2 agents? I typically use all 3.
p1esk Dec 27, 2025 parent

Please elaborate
p1esk Dec 26, 2025 parent

No China or Russia? Strange…
p1esk Dec 25, 2025 parent

they spending $20 billion dollars to handicap an inference company
Inference hardware company
p1esk Dec 23, 2025 parent

The model might get loaded on every token - from GPU memory to GPU. This depends on how much of it is cached on GPU. Inputs to every layer must be loaded as well. Also, if your model doesn’t fit in GPU memory but fits in CPU memory, and you’re doing GPU offloading, then you’re also shuffling between CPU and GPU memory.
p1esk Dec 23, 2025 parent

Depends on map_location arg in torch.load: might be loaded straight to GPU memory
p1esk Dec 23, 2025 parent

We’ve had “compute in flash” for a few years now: https://mythic.ai/product/

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous