Preferences

p1esk
Joined 6,330 karma
hncomments5@gmail.com

  1. perhaps because you are interested in optimizations or distillation or something

    Yes, my job is model compression: quantization, pruning, factorization, ops fusion/approximation/caching, in the context of hw/sw codesign.

    In general, I agree with you that simple intuitions often break down in DL - I observed it many times. I also agree that we don't have good understanding how these systems work. Hopefully this situation is more like pre-Newtonian physics, and Newtons are coming.

  2. No. Each projection is ~5% of total FLOPs/params. Not enough model capacity change to care. From what I remember, removing one of them was worse than other two, I think it was Q. But in all three cases, degradation (in both loss and perplexity) was significant.
  3. 1. The two papers you linked are about importance of attention weights, not QKV projections. This is orthogonal to our discussion.

    2. I don't see how the transformations done in one attention block can be reversed in the next block (or in the FFN network immediately after the first block): can you please explain?

    3. All state of the art open source LLMs (DeepSeek, Qwen, Kimi, etc) still use all three QKV projections, and largely the same original attention algorithm with some efficiency tweaks (grouped query, MLA, etc) which are done strictly to make the models faster/lighter, not smarter.

    4. When GPT2 came out, I myself tried to remove various ops from attention blocks, and evaluated the impact. Among other things I tried removing individual projections (using unmodified input vectors instead), and in all three cases I observed quality degradation (when training from scratch).

    5. The terms "sensitivity", "visibility", and "important" all attempt to describe feature importance when performing pattern matching. I use these terms in the same sense as importance of features matched by convolutional layer kernels, which scan the input image and match patterns.

  4. You should compare the number of top AI scientists each company has. I think those numbers are comparable (I’m guessing each has a couple of dozen). Also how attractive each company is to the best young researchers.
  5. I glanced at these links and it seems that all these attention variants still use QKV projections.

    Do you see any issues with my interpretation of them?

  6. The way I think about QKV projections: Q defines sensitivity of token i features when computing similarity of this token to all other tokens. K defines visibility of token j features when it’s selected by all other tokens. V defines what features are important when doing weighted sum of all tokens.
  7. But that's the problem - if you don't like doing anything, what will you do? What will you fill your life with? You will quickly get bored of anything you try. Your life will have no meaning, and you will probably turn to alcohol or drugs.
  8. Financial freedom is about not having to worry about losing your job, or tolerating shitty work conditions. Why would you retire if you do what you love? I think the real problem might be if there's nothing you actually love doing (long term), that's when money won't help.
  9. Almost nobody else in engineering did this.

    What you described is the job of a product manager. Are there no PMs at Google?

  10. Strange question. If you don’t know why you need this, you probably don’t. It will be the same as with the introductory AI course you did 20 years ago.
  11. I love to sit alone in a cafe - reading. Before smartphones I was reading newspapers or books. Now I read on my phone or tablet. While there, I don’t want to talk to anyone, I just want to sit and read quietly.
  12. What do you mean?
  13. It would need an educated populace

    How do you measure that?

  14. If someone is interested in low level tensor implementation details they could benefit from a course/book “let’s build numpy in C”. No need to complicate DL library design discussion with that stuff.
  15. Wanting power usually has little to do with making the world better.
  16. Sure, I get it - trying to understand a specific condition affecting someone close to you. I personally have very little trust in doctors.

    But, outside of this need, what actionable science have you learned and applied to your own life?

  17. My comment is also directed towards OP: you don’t need small talk to make friends.
  18. OK, I just read the abstract and conclusion of the NAC paper posted above. But then I saw a comment from Aurornis saying it’s not that good. Not sure who I should listen to.
  19. Would regular engineers like us understand molecular biology papers?
  20. Is Estonia a good country to immigrate to for an American?
  21. If something is not clear in the book, can I ask it to explain it to me?
  22. This doesn’t make sense to me. Why would you want something like this? What is it exactly that you expect from such a finetuned model that you cannot get from a frontier general purpose model?
  23. I hate small talk. Never needed it to make friends.
  24. Why limit to 2 agents? I typically use all 3.
  25. Please elaborate
  26. No China or Russia? Strange…
  27. they spending $20 billion dollars to handicap an inference company

    Inference hardware company

  28. The model might get loaded on every token - from GPU memory to GPU. This depends on how much of it is cached on GPU. Inputs to every layer must be loaded as well. Also, if your model doesn’t fit in GPU memory but fits in CPU memory, and you’re doing GPU offloading, then you’re also shuffling between CPU and GPU memory.
  29. Depends on map_location arg in torch.load: might be loaded straight to GPU memory
  30. We’ve had “compute in flash” for a few years now: https://mythic.ai/product/

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal