Preferences

refibrillator
Joined 1,263 karma

  1. No disrespect but paying to verify age feels absurd, let alone putting a private company in charge of what should be an essential function of the government.

    How about when you turn 18 or whatever the government gives you a signed JWT that contains your DOB? Anyone who needs to verify your age can check that and simply validate the signature via a public key published by the government.

    Simply grab a new JWT when you need it, to ensure privacy.

    And sure, sprinkle in some laws that make it illegal to store or share JWTs for clearly fraudulent intents.

    > the vast majority of kids don't easily have access to alcohol or cigarettes

    This feels like it comes from an affluent perspective, where I grew up it was trivial to acquire these things and much worse, there will always be someone’s older brother etc who will do this for $20 because he’s got nothing to lose.

  2. H100 has 80 GB of HBM3. There’s only like 37 MB of SRAM on a single chip.
  3. Fascinatingly, the body already has a mechanism for this: fasting. One of the many beneficial side effects is rapid mucosal atrophy, decreasing villus height and crypt depth.

    You can find evidence of this in the literature, but it’s absurdly understudied, because big pharma would rather sell you a subscription to life.

    Fortunately there are many good people in the world, especially in the field of medicine, who want to help their patients unconditionally. So there are glimmers of hope, like some of the top cardiologists in the world going against status quo and treating patients with fasting regimes instead of surgery.

  4. This is hilarious, I don’t even want to know if it’s legit.
  5. Love anecdotes like this! But admittedly I feel a bit lost, so please forgive my ignorance when I ask: why does choosing a subset of k integers at random require deduplication? My naive intuition is that sampling without replacement can be done in linear time (hash table to track chosen elements?). I’m probably not understanding the problem formulation here.
  6. One of the cooler and lesser known features of JPEG XL is a mode to losslessly transcode from JPEG while achieving ~20% space reduction. It’s reversible too because the original entropy coded bitstream is untouched.

    Notably GCP is rolling this out to their DICOM store API, so you get the space savings of JXL but can transcode on the fly for applications that need to be served JPEG.

    Only know this because we have tens of PBs in their DICOM store and stand to save a substantial amount of $ on an absurdly large annual bill.

    Native browser support is on our wishlist and our contacts indicate the chrome team will get there eventually.

  7. Yeah it’s pretty clearly a bot account, or at least someone who likes to copy paste from chatgpt to sound smart.
  8. > It works better!

    > I strongly believe it is one of the best technologies for AI agents

    Do you have any quantitative evidence to support this?

    Sincere question. I feel it would add some much needed credibility in a space where many folks are abusing the hype wave and low key shilling their products with vibes instead of rigor.

  9. Ha made me chuckle. For those wondering seriously about this, it’s not a viable optimization because weights are not readily compressible via JPEG/DCT, and there are a limited number of these units on the chip which bottlenecks throughout, meaning speed is dwarfed by simply reading uncompressed weights from HBM.
  10. Great exposition, loved the touch of humor. Please do the backward pass when it’s published.

    As a fellow Tri Dao groupie and lucky duck who gets to build on Hopper/Blackwell clusters, I find it amazing how difficult it is becoming to write kernels that saturate GPU hardware.

    When I squint, there appears to be a trend emerging across work like FA4, monolithic (mega) kernels, etc. Namely, a subversion of the classic CUDA programming model in the form of fine grained task based parallelism, managed entirely in “user space”.

    Not exactly sure what’s ahead but I’m strapping in for a wild ride…

  11. Well “import torch” for example will resolve certain dynamically linked symbols, which must be done first before importing your own .so code that uses libtorch and pybind11. If not you will get a super fun to debug segfault, leaving you staring at gdb backtrace output while you ponder your career choice.

    This is buried deep in the PyTorch docs and I don’t have the willpower to go find it right now, sorry.

  12. Tokenization is typically done on CPU and is rarely (if ever) a bottleneck for training or inference.

    GPU kernels typically dominate in terms of wall clock time, the only exception might be very small models.

    Thus the latency of tokenization can essentially be “hidden”, by having the CPU prepare the next batch while the GPU finishes the current batch.

  13. Hi author(s), the on-GPU interpreter approach looks like a promising path forward, have you seen this strikingly similar concurrent work?

    https://www.hackerneue.com/item?id=44111673

    I find it curious that fundamentals of the CUDA programming model (eg kernel launches) are being subverted in favor of fine grained task based parallelism that ends up using the hardware more effectively. Makes me wonder if CUDA has been holding us back in some ways.

    What are the chances we see your work land in PyTorch as an experimental backend?

    Awesome stuff thanks for sharing.

    P.S. minor typo, your first two paragraphs under part 1 are nearly identical.

  14. The code has few comments but gotta love when you can tell someone was having fun!

    https://github.com/ScalingIntelligence/tokasaurus/blob/65efb...

    I’m honestly impressed that a pure python implementation can beat out vLLM and SGLang. Granted they lean on FlashInfer, and of course torch.compile has gotten incredibly powerful in the last few years. Though dynamic shapes have still been a huge thorn in my side, I’ll need to look closer at how they pulled it off…

  15. > Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original

    How close are we talking?

    I’m not calling you a liar OP, but in general I wish people perpetuating such broad claims would be more rigorous.

    Unsloth does amazing work, however as far as I’m aware even they themselves do not publish head to head evals with the original unquantized models.

    I have sympathy here because very few people and companies can afford to run the original models, let alone engineer rigorous evals.

    However I felt compelled to comment because my experience does not match. For relatively simple usage the differences are hard to notice, but they become much more apparent in high complexity and long context tasks.

  16. Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others).

    Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.

    Classic comp sci tradeoff between space and speed, no free lunch, etc.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal