- 3 points
- Crossover from the other front page article. I tested out ChatGPT5 search mode and there are some good sources!
- I wonder if the error propagation problem could be solved with a “branching” generator? Basically at every token you fork off N new streams, with some tree pruning policy to avoid exponential blowup. With a bit of bookkeeping you could make an attention mask to support the parallel streams in the same context sharing prefixes. Perhaps that would allow more of an e2e error minimization than the greedy generation algorithm in use today?
- 2 points
- Or see the explanation in video form here: https://m.youtube.com/watch?v=d0HJvGSWw8A
Mamba has been discussed a lot here, and this seems like a promising line of inquiry for improvement
- 2 points
- Sparse attention essentially combines 3 types of attention optimizations:
1. Compression of the query input vectors to reduce the size of the KV cache
2. Selectively computing uncompressed attention on a subset of tokens based on the compressed blocks with the highest attention scores
3. Using sliding window for local attention at full resolution
> Both Full Attention and sparse attention models are pretrained on 270B tokens of 8k-length texts, followed by continued training and supervised fine-tuning on 32k-length texts with YaRN to achieve long-context adaptation. Both models are trained to full convergence to ensure fair comparison.
> our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27B total parameters with 3B active parameters
Evaluated on MMLU, MMLU-PRO, CMMLU, BBH, GSM8K, MATH, DROP, MBPP, and HumanEval. NSA outperforms full attention on 7/9.
Beats out H2O, InfLLM, Quest, Exact-Top, and full attention on LongBench
Perfect retrieval on 64k needle-in-a-haystack
The CoT eval is less convincing, but outperforms the FA on AIME24.
Training speed of 2-9x vs. FlashAttention
Decoding speedup of 4-12x vs. full attention ["expected"? Didn't see comparison to other attention mechanisms]
- 1 point
- 1 point
- Great to see this is alive and progressing! I believe Ohm started life in Alan Kay’s research group, to build a graphical OS and office suite in 10k lines of code. I found this talk immensely inspiring https://m.youtube.com/watch?v=ubaX1Smg6pY
- 107 points
- > I feel like I'm taking crazy pills when I read about others' experiences. Surely I am not alone?
You're not alone :-) I asked a very similar question about a month ago: https://www.hackerneue.com/item?id=42552653 and have continued researching since.
My takeaway was that autocomplete, boiler plate, and one-off scripts are the main use cases. To use an analogy, I think the code assistants are more like an upgrade from handsaw to power tools and less like hiring a carpenter. (Which is not what the hype engine will claim).
For me, only the one-off script (write-only code) use-case is useful. I've had the best results on this with Claude.
Emacs abbrevs/snippets (+ choice of language) virtually eliminate the boiler plate problem, so I don't have a use for assistants there.
For autocomplete, I find that LSP completion engines provide 95% of the value for 1% of the latency. Physically typing the code is a small % of my time/energy, so the value is more about getting the right names, argument order, and other fiddly details I may not remember exactly. But I find, that LSP-powered autocomplete and tooltips largely solve those challenges.
- The “pair programming” approach with good models is just slow enough that I lose focus on each step. The faster models I’ve tried are not good enough except for straightforward things where it’s faster to just use emacs/LSP refactoring and editing tools. Maybe supermaven manages to beat the “good enough, fast enough” bar; I’ll have to try it!
- I still don’t understand how people are getting value out of AI coders. I’ve tried really hard and the commits produced are just a step up from garbage. Writing code from scratch is generally decent. But after a few rounds of edits the assistant just starts piling in conditionals into existing functions until it’s a rats nest 4 layers deep and 100+ lines long. The other day it got into a loop trying to resolve a type error, where it would make a change, then revert it, then make it again
ETA: Sorry forgot about the relevancy in my rant! The one area where I’ve found the AIs helpful is enumerating and then creating test cases
- Maybe this? https://www.cis.upenn.edu/~bcpierce/tapl/
- Yes I know the slotted attribute is not in a __dict__, which definitely helps memory usage. But my point is that if the parent structure is itself in a dict, that access will swamp the L1 cache miss in terms of latency. Even the interpretation overhead (and likely cache thrashing) will eliminate L1 cache speedups.
And yes __slots__ improve perf, but it’s about avoiding the __dict__ access, which hits really generic hashing code and then memory probing more than it is about L1 cache
Where __slots__ are most useful (and IIRC what they were designed for) is when you have a lot of tiny objects and memory usage can shrink significantly as a result. That could be the difference between having to spill to disk or keeping the workload in memory. E.g., Openpyxl does with a spreadsheet model, where there could be tons of cell references floating around
- Never found so many choice quotes in one article...
> Susie Thomas, a clerk for Lee County’s superior court, estimates it now takes her 10 times as many clicks to complete her case indexing. She was buried in scanning paper dockets for Odyssey’s online database until May 2024 and sorely misses the old DOS program. “It was a lot simpler and easier,” she says.
> A Tyler spokesperson says that ... its definition of “defect” is “not a ‘bug’ in the software, but something that didn’t work as anticipated.”
> Errors in the company’s apps have allegedly contributed to people getting stuck in prison for weeks longer than was ordered, or having incorrect verdicts entered on their records. Yet its products remain ubiquitous, in part because it has few serious competitors in the judicial space.
https://youtu.be/3K-R4yVjJfU?si=JdVyYOlxUbEcvEEo&t=2624
> Q: Are the releases aligned with pre-training efforts?
> A: There used to be a time not that long ago, maybe half a year, distant past, where the models would align with RL runs or pretraining runs ... now the naming is by capability. GPT5 is a capable model; 5.1 is a more capable model