Preferences

fchollet
Joined 292 karma

  1. One interesting observation is that French-derived words in English tend to be fancier -- formal, sophisticated, higher-class -- while Germanic ones tend to be more casual, everyday vocabulary.
  2. The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

    By the time OpenAI attempted ARC in 2024, a colossal amount of resources had already been expended trying to beat the benchmark. The OpenAI run itself costs several millions in inference compute alone.

    ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before. o3 is a case of a good approach meeting an appropriate benchmark, rather than an effort to beat ARC specifically.

  3. You can easily convert these tasks to token strings. The reason why ARC does not use language as part of its format is that it seeks to minimize the amount of prior knowledge needed to approach the tasks, so as to focus on fluid intelligence as opposed to acquired knowledge.

    All ARC tasks are built entirely on top of "Core Knowledge" priors, the kind of elementary knowledge that a small child has already mastered and that is possessed universally by all humans.

  4. The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

    ARC 1 was released long before in-context learning was identified in LLMs (and designed before Transformer-based LLMs existed), so the fact that LLMs can't do ARC was never a design consideration. It just turned out this way, which confirmed our initial assumption.

  5. There have been some human studies on ARC 1 previously, I expect there will be more in the future. See this paper from 2021, which was one of the earliest works in this direction: https://arxiv.org/abs/2103.05823
  6. It's useful to know what current AI systems can achieve with unlimited test-time compute resources. Ultimately though, the "spirit of the challenge" is efficiency, which is why we're specifically looking for solutions that are at least within 1-2 order of magnitude of cost from being competitive with humans. The Kaggle leaderboard is very resource-constrained, and on the public leaderboard you need to use less than $10,000 in compute to solve 120 tasks.
  7. ARC 3 is still spatially 2D, but it adds a time dimension, and it's interactive.
  8. > Who would be buying bitcoin right now?

    Well, maybe the US government? What if the US starts dedicating 10-15% of yearly federal receipts to serve as exit liquidity for Bitcoin holders?

  9. What all top models do is recombine at test time the knowledge they already have. So they all possess Core Knowledge priors. Techniques to acquire them vary:

    * Use a pretrained LLM and hope that relevant programs will be memorized via exposure to text data (this doesn't work that well)

    * Pretrain a LLM on ARC-AGI-like data

    * Hardcode the priors into a DSL

    > Which is to say, a data augmentation approach

    The key bit isn't the data augmentation but the TTT. TTT is a way to lift the #1 issue with DL models: that they cannot recombine their knowledge at test time to adapt to something they haven't seen before (strong generalization). You can argue whether TTT is the right way to achieve this, but there is no doubt that TTT is a major advance in this direction.

    The top ARC-AGI models perform well not because they're trained on tons of data, but because they can adapt to novelty at test time (usually via TTT). For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy. This demonstrates empirically that ARC-AGI cannot be solved purely via memorization and interpolation.

  10. It is correct that the first model that will beat ARC-AGI will only be able to handle ARC-AGI tasks. However, the idea is that the architecture of that model should be able to be repurposed to arbitrary problems. That is what makes ARC-AGI a good compass towards AGI (unlike chess).

    For instance, current top models use TTT, which is a completely general-purpose technique that provides the most significant boost to DL model's generalization power in recent memory.

    The other category of approach that is working well is program synthesis -- if pushed to the extent that it could solve ARC-AGI, the same system could be redeployed to solve arbitrary programming tasks, as well as tasks isomorphic to programming (such as theorem proving).

  11. I will never enter ARC Prize myself, since I'm organizing it. But the reason I made ARC in the first place was to work on it myself! I intend to solve it (outside of the context of the competition).
  12. ARC was never supposed to grade LLMs! I designed the ARC format back when LLMs weren't a thing at all. It's a test of AI systems' ability to generalize to novel tasks.
  13. I believe the MindsAI solution does feature novel ideas that do indeed lead to better generalization (test-time fine-tuning). So it's definitely the kind of research that ARC was supposed to incentivize -- things are working as intended. It's not a "hack" of the benchmark.

    And yes, they do use a lot of synthetic pretraining data, which is much less interesting research-wise (no progress on generalization that way...) but ultimately it's on us to make a robust benchmark. MindsAI is playing by the rules.

  14. My go-to DL stack is Keras 3 + JAX. W&B is a great tool as well. I think JAX is generally under-appreciated compared to how powerful it is.
  15. Keras is now standalone and multi-backend again. Keras weights files from older versions are still loadable and Keras code from older versions are still runnable (on any backend as long as they only used Keras APIs)!

    In general the ability to move across backends makes your code much longer-lived: you can take your Keras models with you (on a new backend) after something like TF or PyTorch stops development. Also, it reduces version compatibility issues, since tf.keras 2.n could only work with TF 2.n, but each Keras 3 version can work with a wide range of older and newer TF versions.

  16. This roughly aligns with my timeline. ARC will be solved within a couple of years.

    There is a distinction between solving ARC, creating AGI, and creating an AI that would represent an existential risk. ARC is a stepping stone towards AGI, so the first model that solves ARC should have taught us something fundamental about how to create truly general intelligence that can adapt to never-seen-before problem, but it will likely not itself be AGI (due to be specialized in the ARC format, for instance). Its architecture could likely be adapted into a genuine AGI, after a few iterations -- a system capable of solving novel scientific problems in any domain.

    Even this would not clearly lead to "intelligence explosion". The points in my old article on intelligence explosion are still valid -- while AGI will lead to some level of recursive self-improvement (as do many other systems!) the available evidence just does not point to this loop triggering an exponential explosion (due to diminishing returns and the fact that "how intelligent one can be" has inherent limitations brought about by things outside of the AI agent itself). And intelligence on its own, without executive autonomy or embodiment, is just a tool in human hands, not a standalone threat. It can certainly present risks, like any other powerful technology, but it isn't a "new species" out to get us.

  17. Yes to both.
  18. Actually, `keras.distribution` is straightforward to implement in TF DTensor and with the experimental PyTorch SPMD API. We haven't done it yet first because these APIs are experimental (only JAX is mature) and second because all the demand for large-model distribution at Google was towards the JAX backend.
  19. Enjoy the book!
  20. That's what I plan on doing -- so I would say yes :)

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal