Preferences

sailingparrot
Joined 4,945 karma
halcyon.cameo_3y@icloud.com

  1. Just for training and processing the existing context (pre fill phase). But when doing inference a token t has to be sampled before t+1 can so it’s still sequential
  2. Yes, very few new NPP have been built in Europe recently. Quite a few have been built by Europe however. The french company Framatome alone, with 18k employees, is actively building 2 EPR reactors in the UK (+ preliminary studies for 8 more), one reactor has been finished last year in France and recently multiple were built or being built in China, India, Russia (although I guess that might be canceled).

    Its also already operating the 57 french reactors as well as operating reactors in South Africa, China, Korea, Belgium, Finland.

    Sure, the industry will need to grow, but claiming it basically has to start from 0 is ludicrous.

  3. > So you want to create a completely new industry. From the ground. With all existing experts having retired.

    This is an article about Europe. Do you really believe France alone is operating 57 nuclear reactors, and producing 70% of its energy via fission, without the industry, the knowledge, and with no experts left? Is chatgpt running everything?

  4. Sure, I'ld rather be doomscrolling on tiktok than being stuck for 4 years in trenches in France during WW1, we are talking about larger trend across time though.

    > I just don't trust any measure that's over 20 years old.

    Then take psychological measures that are 20 years old or less, they all go in the same direction.

    Or if you don't trust psychology, take suicide rate, pretty hard to miscount, and is not subject to much change in how people self-report whether or not they killed themselves.

    You seem to be conflating how you think people ought to feel given their privileged conditions versus how they actually happen to feel.

  5. > I guarantee our hunter-gatherer ancestors felt all the same emotions—burnout, comparison, envy, anxiety, stress, overwhelm, hopelessness.

    Yes they felt all the same emotions. You absolutely cannot guarantee they felt them in the same proportions thought.

    > our brains have not changed that much.

    That is the point: our brains have not changed and is still evolving at the speed of gene mutations. Our environment though is changing magnitudes faster than before.

    > how is this line of reasoning constructive?

    This is not trying to be constructive, just trying to understand the human condition. We probably have no choice but to learn to deal with it, that doesn’t mean technology has no adverse impact.

  6. > Throughout time, people have complained that technology is ruining the world. Before AI it was the internet, and before that it was TV, nuclear power, and so on.

    What if throughout time they have been right ? Any proof thst while tech brought longer lifes and more material wealth, we haven’t just spiraled down for a while in term of mental wellbeing, sense of meaning, sense of belonging etc ?

    That’s obviously not true of every piece of tech (I.e. it’s hard to imagine how antibiotics or replacing a coal plant by a nuclear power plant could have negative impact of people’s mental wellbeing) but it could be true about technology in general. It’s not a stretch to believe that technologies that radically transform what a day in the life of a human being looks like, can also have an impact on said human beings life.

    Our bodies and mind, have been finetuned for living in nature and hunting gathering, with a small group consisting of our families and friends for millions of years. Now we live a sedentary life, for many away from family and without any sense of community, in large, noisy, devoid of nature cities having to do day in day out the same job that is more and more compartimentalized and less and less concrete and meaningful, only to go home and sit in front of a tv to be bombarded by ads trying to induce fomo, or god forbid, doom scrolling on tik tok for hours.

    If it just so happened that those two modes of life generate the exact same levels and qualities of stress in our little brains, that would be quite the coincidence.

    Look at every stat around mental health: anxiety, depression, sense of meaning etc. They are all getting worse over decades. And if you think it’s caused by people just complaining more than before, look at how the rate of people willing to kill themselves, that’s the ultimate truth. All worsening.

  7. > So basically they are admitting that not enough people will pay for it to be a profitable business

    Since when capitalism is about stopping trying to make more money right when you become profitable? If they can find a way to make 10x the revenue needed to be profitable, they will.

  8. The unbearable pain of having to handle bills of different sizes, there is not enough empathy in this world to truly pay hommage to your suffering.
  9. At no point have I argued that LLMs aren’t autoregressive, I am merely talking about LLMs ability to reason across time steps, so it seems we are talking past each other which won’t lead anywhere.

    And yes, LLM can be studied under the lens of Markov processes: https://arxiv.org/pdf/2410.02724

    Have a good day

  10. > If it conveys the intended information then what's wrong with that?

    Well, the issue is precisely that it doesn’t convey any information.

    What is conveyed by that sentence, exactly ? What does reframing data curation as cognitive hygiene for AI entails and what information is in there?

    There are precisely 0 bit of information in that paragraph. We all know training on bad data lead to a bad model, thinking about it as “coginitive hygiene for AI” does not lead to any insight.

    LLMs aren’t going to discover interesting new information for you, they are just going to write empty plausible sounding words. Maybe it will be different in a few years. They can be useful to help you polish what you want to say or otherwise format interesting information (provided you ask it to not be ultra verbose), but its just not going to create information out of thin air if you don't provide it to it.

    At least, if you do it yourself, you are forced to realize that you in fact have no new information to share, and do not waste your and your audience time by publishing a paper like this.

  11. > entirely embedded in this sequence.

    Obviously wrong, as otherwise every model would predict exactly the same thing, it would not even be predicting anymore, simply decoding.

    The sequence is not enough to reproduce the exact output, you also need the weights.

    And the way the model work is by attending to its own internal state (weights*input) and refining it, both across the depth (layer) dimension and across the time (tokens) dimension.

    The fact that you can get the model to give you the exact same output by fixing a few seeds, is only a consequence of the process being markovian, and is orthogonal to the fact that at each token position the model is “thinking” about a longer horizon than the present token and is able to reuse that representation at later time steps

  12. train on bad data, get a bad model
  13. Yes, it's "just" an optimization technique, in the sense that you could not have it and end up with the same result (given the same input sequence), just much slower.

    Conceptually what matters is not the kv-cache but the attention. But IMHO thinking about how the model behave during inference, when outputting one token at a time and doing attention on the kv cache is much easier to grok than during training/prefilling where the kv cache is absent and everything happens in parallel (although they are mathematically equivalent).

    The important part of my point, is that when the model is processing token N, it can check it's past internal state during token 1,...,N-1, and thus "see" its previous plan and reasoning, and iterate over it, rather than just repeating everything from scratch in each token's hidden state (with caveat, explained at the end).

    token_1 ──▶ h₁ᴸ ────────┐

    token_2 ──▶ h₂ᴸ ──attn──┼──▶ h₃ᴸ (refines reasoning)

    token_3 ──▶ h₃ᴸ ──attn──┼──▶ h₄ᴸ (refines further)

    And the kv-cache makes this persistent across time, so the entire system (LLM+cache) becomes effectively able to save its state, and iterate upon it at each token, and not have to start from scratch every time.

    But ultimately its a markov-chain, so again mathematically, yes, you could just re-do the full computation all the time, and end up in the same place.

    Caveat: Because token N at layer L can attend to all other tokens <N but only at layer L, it only allows it to see the how the reasoning was at that depth, not how it was after a full pass, so it's not a perfect information passing mechanism, and is more pyramidal than straight line. Hence why i referenced feedback transformers in another message. But the principle still applies that information is passing through time steps.

  14. Isn't the ability to store past reasoning in an external system to avoid having to do the computation all over again precisely what a memory is though?

    But mathematically KV-caching, instead of doing prefilling at every token is equivalent, sure. But the important part of my message was the attention.

    A plan/reasoning made during the forward pass of token 0 can be looked at by subsequent (or parallel if you don’t want to use the cache) passes of token 1,…,n. So you cannot consider token n to be starting from scratch in terms of reasoning/planning as it can reuse what has already been planned in previous tokens.

    If you think about inference with KV-caching, even though you are right that mathematically it’s just an optimization, it makes this behavior much more easy to reason about: the kv-cache is a store of past internal states, that the model can attend to for subsequent tokens, which allows that subsequent token internal hidden states to be more than just a repetition of what the model already reasoned about in the past.

  15. You are forgetting about attention on the kv-cache, which is the mechanism that allows LLM to not start anew everytime.
  16. But you are missing the causal attention from your analysis. The output is not the only thing that is preserved, there is also the KV-cache.

    At token 1, the model goes through, say, 28 transformer blocks, for each one of those block we save 2 projections of the hidden state in a cache.

    At token 2, on top of seeing the new token, the model is now also able in each one of those 28 blocks, to look at the previously saved hidden states from token 1.

    At token 3, it can see the states from token 2 and 1 etc.

    However I still agree that is not a perfect information-passing mechanism because of how those model are trained (and something like feedback transformer would be better), but information still is very much being passed from earlier tokens to later ones.

  17. Indeed what I meant. The LLM isn’t a blank slate at the beginning of each new token during autoregression as the kv cache is there.
  18. I don't remember any paper looking at this specific question (thought it might be out there), but in general Anthropic's circuit threads series of article is very good on the broader subject: https://transformer-circuits.pub
  19. I think a better mental framework of how those model work is that they keep an history of the state of their "memory" across time.

    Where humans have a single evolving state of our memory LLMs have access to all the states of their "memories" across time, and while past state can't be changed, the new state can: This is the current token's hidden state, and to form this new state they look both at the history of previous states as well as the new information (last token having been sample, or external token from RAG or whatnot appended to the context).

    This is how progress is stored.

  20. I don't follow how this relates to what we are discussing. Autoregressive LLMs are able to plan within a single forward pass and are able to look back at their previous reasoning and do not start anew at each token like you said.

    If you append tokens from another source, like in a turn base conversation, then the LLM will process all the new appended tokens in parallel while still being able to look back at it's previous internal state (and thus past reasoning/planning in latent space) from the already processed tokens, then will adjust the plan based on the new information.

    What happens to you as a human if you come up with a plan with limited information and new information is provided to you?

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal