- I see this argument often but for me it misses something.
The difference is about power. The wealth being this concentrated means the power is concentrated.
If people are okay with the idea of an ETF, or a wealth manager (or any type do fund manager/investment bank) then they should be okay with sovereign wealth funds/national ETFs that provide dividends with a guaranteed single share single vote setup.
If you want competition, then the US government used to be good at creating and sustaining artificial compétition in military procurement - similar to how Amazon let's teams compete on the same projects internally.
Because the competition would be artificially and enforced by laws, there's just as much as potential for massive efficiency gains as there is potential for corruption (the Norwegian national wealth fund has gone swimmingly for them)
- You need to think about 1) the latent state 2) the fact that part of the model is post trained to bias the MC towards abiding by the query in the sense of the reward.
A way to look at it is that you effectively have 2 model "heads" inside the LLM, one which generates, one which biases/steers.
The MCMC is initialised based on your prompt, the generator part samples from the language distribution it has learned, while the sharpening/filtering part biases towards stuff that would be likely to have this MCMC give high rewards in the end. So the model regurgitates all the context that is deemed possibly relevant based on traces from the training data (including "tool use", which then injects additional context) and all those tokens shift the latent state into something that is more and more typical of your query.
Importantly, attention acts as a Selector and has multiple heads, and these specialize, so (simplified) one head can maintain focus on your query and "judge" the latent state, while the rest can follow that Markov chain until some subset of the generated+tool injected tokens give enough signal to the "answer now" gate that the middle flips into "summarizing" mode, which then uses the latent state of all of those tokens to actually generate the answer.
So you very much can think of it as sampling repeatedly from an MCMC using a bias, A learned stoping rule and then having a model creating the best possible combination of the traces, except that all this machinery is encoded in the same model weights that get to reuse features between another, for all the benefits and drawbacks that yields.
There was a paper close when OF became a thing that showed that instead of doing CoT, you could just spend that token budget on K parallel shorter queries (by injecting sth. Like "ok, to summarize" and "actually" to force completion ) and pick the best one/majority vote. Since then RLHF has made longer traces more in distribution (although there's another paper that showed as of early 2025 you were trading reduced variance and peak performance as well as loss of edge cases for higher performance on common cases , although this might be ameliorated by now) but that's about the way it broke down 2024-2025
- I'd encourage everyone to learn about Metropolis Hastings Markov chain monte carlo and then squint at lmms, think about what token by token generation of the long rollouts maps to in that framework and consider that you can think of the stop token as a learned stopping criterion accepting (a substring of) the output
- I have a tiny tiny podcast with a friend where we try to break down what parts of the hype are bullshit (muck) and which kernels of truth are there, if any, startedpartially as a place to scream into the void, partially to help the people who are anxious about AGI or otherwise bring harmed by the hype. I think we have a long way to go in terms of presentation (breaking down very technical terms to an audience that is used to vague-hype around "AI" is hard), but we cite our sources, maybe it'll be interesting gpr you to check out out shownotes
https://kairos.fm/muckraikers/
I personally struggle with Gary Marcus critiques because whenever they are about "making ai work" it goes into neurosymbploc "AI" which o have technical disagreements with, and I have _other_ arguments for the points he sometimes raises which I think are more rigorous, so it's difficult to be roughly in the same camp - but overall I'm happy someone with reach is calling BS ad well.
- Cool, thanks a lot. Btw, I have a very tiny tiny (50 to 100 audience ) podcast where we try to give context to what we call the "muck" of AI discourse (trying to ground claims into both what we would call objectively observable facts/évidence, and then _separately_ giving out own biased takes), if you would be interested to come on it and chat => contact email in my profile.
- Could you either release the dataset (raw but anonymized) for independent statistical évaluation or at least add the absolute times of each dev per task to the paper? I'm curious what the absolute times of each dev with/without AI was and whether the one guy with lots of Cursor experience was actually faster than the rest of just a slow typer getting a big boost out of llms
Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect
- I really believe in the importance of praising people and acknowledging their efforts, when they are kind and good human beings and (to much lesser degree) their successes.
But, and I mean their without snark: What value is your praise for what is good if I cannot trust that you will be critical of what is bad? Note that critique can be unpleasant but kind, and I don't care for "brutal honesty" (which is much more about the brutality than the honesty in most cases).
But whether it's the joint Slavic-german culture or something else, I much prefer for things to be _appropriate_, _kind_ and _earnest_ instead of just supportive or positive. Real love is despite a flaw, in full cognizance if it, not ignoring them.
- What is your definition of "understand them well"?
- Check the actual paper on the type of sorts it actually got speedup on :-) (hint: a few percentage points on larger n,similar to what pgo might find, the big speedup is for n around 8 or so, where it basically enumerated and found a sorting network)
- Nah, it's much simpler, the models aren't reliably able to recall the correct rule from memory - it's im the training set for sure.
This is another specialized synthetic data generation pipeline for a curriculum for one particular algorithm cluster to be encoded into the weights, not more not less. They even mention quality control still beim important
- Please make sure aider and llm-cli can use this soon,kthx :-)
- First point:I know people who had their houses trashed by card carrying Neonazis for being trans or migrant looking, who had their PII abused by Nazi cops (again card carrying) who shared access illegally, and I know people whose grandparents died in the resistance or camps. I've seen the Concentration camps as part of my travels when I was younger.
This isn't theoretical, this is learning from past mistakes. Only people who think themselves beyond this would be this dismissive.
2nd point: if you want to make a very specific point about coasian dynamics, freedom of business and liberal commerce, you need to be more specific, on the extremely broad strokes the new deal, china, the post war recovery in Europe, the well regulated energy markets vs unregulated monopolies etc. are all empirical counters to your vibes. I made these intentionally as broad strokes as you did, feel free to debunk them, but as an honest intellectal you'd then also find the limits to your claim and add nuance to it. Only Rand fanfiction takes the "job provider" literally
- In the beginning of the PhD, to help with rent I contracted to help develop computer vision algorithms in this field, only PoCs, never got very far.
And interesting thing is that the lice apparently evolve super fast, including getting translucent and resistant against poison
- While I do think gp should give you an example, I also invite you to provide a counter example, otherwise both of you are just stating your vibes.
If you are _actually_ willing to discuss, you can't just demand the other side to give, you can also set the standard by giving.
- In the spirit of HN rules, actual answering instead of snark:
1. The memory of Nazis using centralised industrial might and information to kill millions of people (google dutch insurance records Nazis)
2.a much stronger history of workers rights and distrust of rich people together with a different attitude to gouvernement making it politically challenging
Can hate it or love it, but the median life expectancy in Europe is much less correlated with wealth and _I think_ is higher from memory, child hunger and food insecurity is much lower and homelessness and other corporate abuse is much less of a problem corrected for wealth and population.
Whether we manage to keep this remains to be seen, but I think it's a reasonable set of different preferences
- your profile says >Hit me up if you want to collaborate on NLP research
but doesn't hint on how, check _my_ profile for hints on how :-p
- A few technical questions (I had a somewhat related work with friends here https://openreview.net/forum?id=I3HCE7Ro78H although we focused on gradient multiplicity in adversarial training, not massively parallel training)
1. Do you think this is a form of variance reduction or more a form of curriculum (focus first on the bulk, then on remaining errors)? 2. Did you observe any overfitting/additional adversarial risk? 3. Did you try this on just single-node minibatches as well? How did that perform?
- GNNs are useful at least in one case, when your data a set of atoms that define your datum through their interactions, specifically a set that is that is high cardinality (so you can't YOLO it with attention) with some notion of neighbourhood (i.e. geometry) within your set (defined by the interactions) which if maintained makes the data permutation equivariant, BUT you can't find a meaningful way to represent that geometry implicitly (for example because it changes between samples) => you YOLO it by just passing the neighourhood/interaction structure in as an input.
In almost every other case, you can exploit additional structure to be more efficient (can you define an order? sequence model. is it euclidean/riemanian? CNN or manifold aware models. no need to have global state? pointcloud networks. you have an explicit hierarchy? Unet version of your underlying modality. etc)
The reason why I find GNNs cool is that 1) they encode the very notion of _relations_ and 2) they have a very nice relationship to completely general discretized differential equations, which as a complex systems/dynamical systems guy is cool (but if you can specialize, there's again easier ways)
- The whole point of GNNs is that they generalize to arbitrary topologies by explicitly conditioning the idea of "neighbours" on the graph specifying the topology. Graph layout has been tried here https://github.com/limbo018/DREAMPlace to great fanfare although there is recent drama about it https://www.semanticscholar.org/paper/The-False-Dawn%3A-Reev... . Graph transformations are a thing as well https://arxiv.org/abs/2012.01470 but it's a tricky problem because you implicitly need to solve the graph matching problem
Identifiability means that out of all possible models, you can learn the correct one given enough samples.causal identifiability has some other connotations
See here https://causalai.net/r80.pdf as a good start (a nose in a causal graph is Markov given its parents, and a k-step Markov chain is a k-layer causal dag)