On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI

By JWS 🔸 @ 2024-06-15T20:24 (+72)

Overview

Recently Dwarkesh Patel released an interview with François Chollet (hereafter Dwarkesh and François). I thought this was one of Dwarkesh's best recent podcasts, and one of the best discussions that the AI Community has had recently. Instead of subtweeting those with opposing opinions or vagueposting, we actually got two people with disagreements on the key issue of scaling and AGI having a good faith and productive discussion.[1]

I want to explicitly give Dwarkesh a shout-out for having such a productive discussion (even if I disagree with him on the object level) and having someone on who challenges his beliefs and preconceptions. Often when I think of different AI factions getting angry at each other, and the quality of AI risk discourse plummeting, I'm reminded of Scott's phrase "I reject the argument that Purely Logical Debate has been tried and found wanting. Like GK Chesterton, I think it has been found difficult and left untried." More of this kind of thing please, everyone involved.

I took notes as I listened to the podcast, and went through it again to make sure I got the key claims right. I grouped them into similar themes, as Dwarkesh and François often went down a rabbit-hole to pursue an interesting point or crux and later returned to the main topic.[2] I hope this can help readers navigate to their points of interest, or make the discussion clearer, though I'd definitely recommend listening/watching for yourself! (It is long though, so feel free to jump around the doc rather than slog through it one go!)

Full disclosure, I am sceptical of a lot of the case for short AGI timelines these days, and thus also sceptical of claims that x-risk from AI is an overwhelmingly important thing to be doing in the entire history of humanity. This is of course comes across in my summarisation and takeaways, but I think acknowledging that openly is better than leaving it to be inferred, and I hope this post can be another addition in helping improve the state of AI discussion both in and outside of EA/AI-Safety circles. It is also important to state explicitly here that I might very well be wrong! Please take my perspective as just that, one perspective among many, and do not defer to me (or to anyone really). Come to your own conclusions on these issues.[3]


The Podcast 

All timestamps are for the YouTube video, not the podcast recording. I've tried to cover the podcast by the main things as they appeared chronologically, and then tracking them through the transcript. I include links to some external resources, passing thoughts in footnotes, and more full thoughts in block-quotes.

Introducing the ARC Challenge

The podcast starts with an introduction of the ARC Challenge itself, and Dwarkesh is happy that François has put out a line in the sand as an LLM sceptic instead of moving the goalposts [0:02:27]. François notes that LLMs struggle on ARC, in part because its challenges are novel and meant to not be found on the internet, instead the approaches that perform better are based on 'Discrete Program Search' [0:02:04]. He later notes that ARC puzzles are not complex and require very little knowledge to solve [0:25:45].

Dwarkesh agrees that the problems are simple and thinks it's an "intriguing fact" that ARC problems are simple for humans, but LLMs are bad at them, and he hasn't been convinced by the explanations he's got from LLM proponents/scaling maximalists about why that is [0:11:57]. Towards the end François mentions in passing that big labs tried ARC but didn't share because their results because they're bad [1:08:28].[4]

One of ARC's main selling points is that humans are clearly meant to do well at this, even children, [0:12:27] but Dwarkesh does push on point, suggesting that while smart humans will do well those of average intelligence will have a mediocre score. François says that they tried with 'average humans' and got 85%, but this was with MTurk and Dwarkesh is sceptical that this actually captures the 'average human' [0:12:47].[5]

Finally, Mike has teamed up with François to increase the prize pool for solving ARC to ~$1 million. It's currently being hosted on Kaggle [1:09:30], though there are limitations on the models that you can use and some compute restrictions (so do check this section for concrete details). Mike thinks that this is an important moment of 'contact with reality', and will up the prize if it isn't solved within 3 months [1:24:33].

Writing this section I actually think that the podcast didn't dive too deep into what ARC actually is. François originally introduced the idea in paper On the Measure of Intelligence, which is well worth a read. There's a GitHub repo with a public train/test set, though Francois has kept the private test set private. The idea is that you have a set of ~3 test cases in a grid, with an input/output transformation described by an unknown rule. You are given a final test case and have to produce the correct output grid. You score 1 for outputting the correct grid, and 0 for any other answer.

To get a sense of what the tests are like, you can play along yourself! A couple of good interactive versions can be found here and here. Very quickly I think you should get the intuition of "Huh, why are state-of-the-art models trained with trillions of tokens really bad at this obvious thing I can just 'get'?"

Other excellent recent research along these lines has been spearheaded by Melanie Mitchell,[6] so I'll link to some of her research on these lines:

Should we expect LLMs to "saturate" it?

At various points, Dwarkesh poses a challenge to François by asking why we shouldn't just expect the best LLM within a year to 'saturate'[7] the test. [0:01:05, 0:14:04, 1:13:47] François thinks that the answer is empirical [0:09:52, 1:14:09] but he is sceptical that LLM-based/LLM-only/Scaling-driven approaches will be able to crack ARC, and that those kind of techniques have reached a plateau [01:04:40] Chollet says that existing methods have reached a plateau. . Mike notes that he expected the same thing when he first heard of ARC but has come around to it getting at something different than those other benchmarks [1:02:57], and that the longer ARC survives the more the story of 'progress in LLMs has plateaued'[8] starts to look more plausible [1:24:52]. 

Mike notes that as the public train and test set are on GitHub there's an asterisk on any result on this metric, as it may have been included in the training data of the LLMs taking the test [1:20:05] Dwarkesh counters by referencing Scale AI's recent work on data contamination on the GSM8K benchmark, and that while many models were overfit the leading ones didn't suffer from correcting for this [1:21:07].

A few times, Dwarkesh asks if multi-modal models will be able to perform better than text only models [0:08:43, 0:09:44, 1:24:25].[9] François also says that fine-tuning LLMs on ARC shows that it can parse the inputs, so that isn't actually a problem. The reason that they don't do well, instead, is the unfamiliarity of the tasks the LLMs are being asked to solve [0:11:12].

There are a few times in the podcast where Dwarkesh mentions talking about ARC personally with friends (often they're working a leading AI labs in San Fransisco) and being unconvinced by their answers, or finding those friends overconfident in the ability of LLMs to be able to solve ARC puzzles. [1:11:31, 1:27:13] In the latter case, they went from going "of course LLMs can solve this" to getting 25% with Claude 3 Opus on the publicly available testing set.

This seems to be a bit of reference class tennis, where Dwarkesh is looking at recent performance of leading models on benchmarks and saying our prior on ARC should be a similar here, that it's a harder test but it will fall the same fate. In contrast, I think LLM sceptics like François (and myself) are taking a more Lakatos-esque defence by saying the fact that LLMs can do well on things like MMLU but terribly on ARC is rather an falsification of an auxiliary hypothesis ('that current benchmarks are actually measuring progress towards AGI') and while the core hypothesis ('current approaches will not scale all the way to AGI') remains valid.

At heart, a huge chunk of this debate really is a question of epistemology and philosophy of science. Still, I suppose reality will adjudicate for the most extreme hypotheses either way within the next few years.

Has Jack Cole shown LLMs can solve ARC?

Dwarkesh brings up the example of Jack Cole (who, along with Mohamed Osman, has achieved SOTA performance on the hidden test set - 34%) a few times in the discussion as another way to push François on whether LLMs can perform well [0:13:58, 1:13:55].

François pushes back here, saying that what Jack is doing is actually trying to get active inference to work, and thereby get the LLMs to run program synthesis [1:14:15]. For the particulars, François notes two key things about Jack's approach, which is that the high performance is the result of pre-training of millions of synthetically generated ARC tasks and combining that with test-time fine tuning so the LLM can actually learn from the test case [0:14:25].

I think this was important enough to make its own separate section. It brings up more clearly where François' scepticism comes from. When you give an LLM a prompt you are doing static inference. There's a conversion of your prompt into a numerical format, then ridiculous amounts of matrix multiplication, and then an output. 

What does not happen in this process is any change to the weights of the model, they have essentially been 'locked'. Thus, in some fundamental sense, LLMs never 'learn' anything once their training regime ends. Now, while the term active inference can lead you down a free-energy-principle rabbithole, I think François is using it mean 'an AI system that can update its beliefs and internal states efficiently and quickly given novel inputs', and he thinks this novelty is a critical part of dealing with the world effectively.

I think in his point of view, getting an AI system to do this would be a different paradigm to the current 'Scale LLMs larger' paradigm. And have to say, I'm very heavily on François' side here, rather than Dwarkesh's.

Is there a difference between 'Skill' and 'Intelligence'?

Early on Dwarkesh asks if they have so much in distribution that we can't tell if a test case is in distribution or not, does it even matter [0:04:32]? François fundamentally rejects the premise here, saying that you can never pre-train on everything you need, because the world is constantly changing [0:05:07]. Some animals don't need this ability,[10] but humans do. Humans are born with limited knowledge but the ability to learn efficiently in the face of things we've no seen before [0:06:32]. Instead, François draws a distinction between skill and intelligence, and keeps returning to the concept to answer Dwarkesh's queries throughout the discussion [0:19:33, 0:21:10].

One of the main takeaways ARC is meant to show the general intelligence of humans is its demonstration of our extreme sample efficiency. Dwarkesh challenges by trying to analogise human formal education to pre-training of a transformer model [0:22:00]. François' response to this is that building blocks of 'core knowledge' are necessary for general reasoning, but these are mostly acquired early in life [0:22:55, 1:32:01]. Dwarkesh responds that some of the geometric patterns that allow him to solve ARC puzzles are things he's seen throughout his life [1:30:54].  

Dwarkesh says its compatible with François' story that even if the reasoning is local, it will get better as the size of the model increases [0:23:05]. François agrees and Dwarkesh is confused. François then clarifies that he's talking about generality [0:23:44]. He specifically claims:

"General intelligence is not task-specific skill scaled up to many skills, because there is an infinite space of possible skills. General intelligence is the ability to approach any problem, any skill, and very quickly master it using very little data. This is what makes you able to face anything you might ever encounter. This is the definition of generality." [0:23:56]

Dwarkesh responds by pointing to the ability of Gemini 1.5 to do translation of Kalamang, a language with fewer than 200 speakers[11] shows that larger versions of these models are gaining the capacity to generalise efficiently [0:24:37]. Chollet essentially rolls to disbelieve, saying that if this was true then these models would be performing well on ARC puzzles, since they are not complex, and much less complex than Kalamang translation.

François' distinction between these two concepts is in Section I.2 in "On the Measure of Intelligence". Part of the disagreement here in this (and the next section) is that Dwarkesh seems to be taking an implicit frame of intelligence-as-outputs whereas François is using on intelligence-as-process.

This may be a case where the term 'Artificial General Intelligence' is again causing more confusion than clarity. It's certainly at least conceivable for their to be transformative effects from AI without it ever clearly meeting any concept of 'generality', and I think Dwarkesh is focusing on those effects, which is why he tries to draw François into scenarios where most workers have been automated, for example.

I'm not sure who comes off best in the exchange about Kalamang translation. It seems very odd for frontier models to be able to do that but not solve ARC, but it does point to Dwarkesh's underlying claim that if we can fit enough things into model training distribution, it could be 'good' enough at many them even if it doesn't meet François' definition of AGI. Still, in my mind the modal outcome there is still more likely to be one of 'economic disruption' and less of 'catastrophic/existential' risk.

Are LLMs 'Just' Memorising?

This section is probably the most cruxy of the entire discussion, it's where Dwarkesh seems to get the most frustrated, and the two go in circles a bit. This is where two epistemic worldviews are colliding, and probably the most important part to focus on.

François casts doubt on the scaling laws by saying that they are based on benchmark performance, and that these benchmarks are able to be solved by memorisation. He summarises the leading models as 'interpolative databases', and that their performance on such benchmarks will increase will more scale   [0:17:53].  He doesn't think this is what we want though, and later on in the podcast even states "with enough scale, you can always cheat" [0:40:47].[12] 

Dwarkesh, however, denies the premise that all they are doing is memorisation and asks François why we could not in principle just brute-force intelligence [0:31:06, 0:40:49]. François actually agrees that this would be possible in a fully static world, but that the nature of the world is one where the future is unknown where this complete knowledge is not possible [0:42:16]. Dwarkesh gets his most annoyed in the podcast at this point and accuses François of playing semantics:

"you're semantically labeling what the human does as skill. But it's a memorization when the exact same skill is done by the LLM, as you can measure by these benchmarks." [0:42:58]

François says that memorisation could still be used to automate many things, as long as they are part of a 'static distribution'. Dwarkesh presses if this might include most jobs today, and François answers 'potentially' [0:44:31] He clarifies that LLMs can be useful, and that he has been a noted proponent of deep learning,[13] but that this would not lead to general intelligence, and that it may be possible to decouple the automation of many jobs from the creation of a general intelligence [0:44;48].

Dwarkesh's counter hypothesis is that creativity could just be interpolation in a high-enough dimension [0:35:24, 0:36:27]. He claims that models are vastly underparameterised compared to the human brain [0:32:59, 0:43:34],[14] and points to the phenomenon of 'grokking'[15] as an example of generalisation within current models, arguing that as models get bigger the compression will lead to generalisation. [0:46:29, 0:48:14]. François actually agrees that LLMs have some degree of generalisation, this is due to compression of their training data and that grokking is not a new phenomenon but instead an expression of the minimum description length principle [0:47:04, 0:48:58].

Dwarkesh argues that transformers might, in some basic level, already be doing program synthesis. François says while it may be possible we should then expect them to do well on ARC given that the 'solution program' for any ARC task is simple, and Dwarkesh seemingly concedes the point [0:38:09]. François reinforces his point of view by pointing to the example that current LLMs fail to generalise Caesar Cipher solutions, and only retain those most stored in their training data [0:26:38].[16]  

Dwarkesh asks why François sees the current paradigm as intrinsically limited, and François responds due to the fundamental nature of the model being a 'big parametric curve', and it is thus limited to only ever generalising within distribution  [0:49:04, 0:50:16]. In his following explanation, François argues that deep learning and discrete program search are essentially opposites and that progress will require combining the two [0:49:35].

Ok, this is the big one:

There are some points that get mixed up here, and the confusion stems from the skill/intelligence or outcome/process distinction mentioned in my comments to the last section.

First, François agrees with Dwarkesh that it is at least conceptually possible for an AI system in a 'memorisation regime' (e.g. GPT-4+N) to be highly skilful, useful, and automate many jobs. Thus, at some level, while François thinks that this question is an empirical one.

Second, François thinks that world that exists has so much irreducable change and complexity that going beyond memorisation is necessary to function at any level of complex capability. All humans have this,[17] and AGI must have this quality to, and the LLM-family-tree of models don't. The empirical anchors he points to things like the simplicity of ARC, or LLMs failure with Caesar Ciphers.

Dwarkesh is annoyed because he thinks that François is conceptually defining LLM-like models as incapable of generalisation, whereas I think François' more fundamental claim is about the unpredictability and irreducible complexity of the world itself. If he didn't believe the world was that irreducibly complex, he'd be much more of a scaling maximalist.

I think, on the object level of memorisation vs generalisation vs reasoning, François comes out on top in this crucial discussion. François has a deep knowledge of how these models work, whereas Dwarkesh is not at the same level of expertise and is pointing to an explanation of observations, but when François counters he has little to support his point apart from repeating that enough interpolation/memorisation is the same thing as generalisation/intelligence, which is rather assuming the answer to the whole issue at hand, and moves me much less than François' counterpoints. 

I think Dwarkesh's view of general intelligence simply being a patchwork of local generalisation is rather impoverished way of looking at it (and other human beings), but I've separated that more into the following section. I think when Dwarkesh talks about benchmarks he's assuming they're are reliable guide, but I think that's very much open to question.

Are the missing pieces to AGI difficult or hard to solve?

At multiple points, Dwarkesh notes that even his LLM maximalist friends do not believe 'scale is all you need', but that scale is the most important thing and that adding on the additional extras needed to get to AGI beyond scale will be the easy part [0:16:28, 0:55:09]. François disagrees, and says that the hard part of intelligence is the system 2 part [0:16:57, 0:56:21].

Dwarkesh refers back to a previous podcast where one of his guests[18] believes that intelligence is just hierarchically associated memories [0:57:04]. François doesn't quite seem to get what Dwarkesh is getting at, and they end up not diving into this more  because Dwarkesh thinks that they are going in circles [0:59:37].

This is another critical crux, and one that I am again much more inclined to take Chollet's point-of-view on than the scaling maximalists. Again, in the previously related podcast, Trenton says "most intelligence is pattern matching" and, I don't know, that seems a really contested claim? [19] It just seem like many of the scaling maximalists have assumed that System 2/Common Sense Reasoning will be the easy part, but I very much disagree.

However, much of my thinking on this issue of intelligence/explanation/creativity has has been highly influenced by David Deutsch and his works. I'd highly recommend buying and reading The Beginning of Infinity, which if you disagree with might mean we have some core epistemological differences.

In other places the concept I think is highly difficult to get to work (compared to more scaling) is referred to as 'Discovering actions' by Stuart Russell in Human Compatible,[20] or 'savannah-to-boardroom generalization' by Ajeya Cotra in this LessWrong dialogue (though my point-of-view is very much closer to Ege in that discussion).

What would it take for François to change his mind?

At various points in the podcast, Dwarkesh asks the question of what would happen to François' views if they do succeed at ARC, or what would he need to see to think that the scaling paradigm is on the path to AGI. [0:02:34, 0:03:36] Mostly, François to this by saying that this is an empirical question (i.e. he'd change his mind if he sees evidence), but he then clarifies that but also makes the point on how the performance was achieved on these benchmarks, and that he'd want to see cases where models can adapt on the fly and do something truly novel that is not in its training data  [0:02:51, 0:03:44].

Some Odds and Ends

Both Mike and François are sad about the closing down of previously openly shared research in Frontier AI work [1:06:16], and François goes further to lay the blame at OpenAI's feet and says that this has set AGI back 5-10 years[21] [1:07:08].

A couple of points also stood out to me as quite weird but perhaps lacking context or inferential distance:


Takeaways

I tried, as much as reasonable, to leave my fingerprints in my summary off apart from the ending sections in the previous section. However, thinking about this episode and its response has been both enlightening and disappointing for me. These takes are fairly hot but I also didn't want to make them a post in-and-of themselves, so I apologise if they weren't fleshed out fully, especially if you disagree with me!

And that's all folks! Thanks for reading this far if you have. I'll try to respond in the comments to discussions as much as possible, but I am probably going to need a bit of a break after writing all of this, and I'm on holiday for the next few days.

  1. ^

    At the end of the podcast Dwarkesh explicitly says he was playing devil's advocate, but I think he is arguing for a pro-scaling point-of-view. His post Will scaling work?  provides a more clear look at his perspective, and I highly recommend reading it.

  2. ^

    There's a second section with Mike Knoop (now Mike) which is more focused on the ARC Prize relaunch, which I have fewer notes on but still included

  3. ^

    At least, come to your own conclusion on how you stand regarding them. Not saying everyone has to become an expert in mechanistic interpretability.

  4. ^

    If there's anyone reading who's in a position to verify this, can we? Even the fact of ~poor performance from the leading labs would support Francois' prior and not Dwarkesh's.

  5. ^

    The most relevant research paper I could find is this one, where children were able outperform the average LLM on a simplified ARC test from around the age of 6 onwards. Still, these were kids visiting the 'NEMO science museum in
    Amsterdam' so again it's not really a sample of median humans.

  6. ^

    Perhaps not-coincidentally, a noted critic of AI x-risk

  7. ^

    He essentially means get expert performance, solve close enough to ~100% so that there is not much signal in a model's ARC score.

  8. ^

    Or has even... dare we say... hit a wall?

  9. ^

    I'm a bit confused on this point, and also about what 'natively multi-modal' means, or at least why Dwarkesh is expecting it to be such a game changer? Aren't GPT4o and Gemini already multimodal models that perform badly at ARC?

  10. ^

    Chollet seems to be referring to cases like Syphex Wasps, though how accurate that anecdote actually is is up for debate. But to me, even simple organisms showing adaptive behaviour beyond the capacity of LLMs is even more reason to be sceptical about projections of imminent AGI.

  11. ^

    See section 4.2.2.1

  12. ^

    He is gesturing at the notion of shortcut learning

  13. ^

    He has literally written a textbook about it

  14. ^

    The implication here is that as their scale increases, they'll be able to achieve human level extrapolation via interpolation.

  15. ^

    In the linked paper, it's defined as an phenomenon that's observed "where models abruptly transition to a generalizing solution after a large number of training steps, despite initially overfitting"

  16. ^

    Even I, as an LLM sceptic, was sceptical of this claim by François, but it's actually true!

  17. ^

    There was an interesting exchange between François and Subbarao Kambhampati on whether this also holds for civilisation, which you can read here

  18. ^

    Trenton Bricken, Member of Technical Staff on the Mechanistic Interpretability team at Anthropic

  19. ^

    To be very fair to him, Trenton does introduce this as a 'hot take'

  20. ^

    In the chapter called "How Might AI Progress in the Future?", Russell says "I believe this capability is the most important step needed to reach
    human-level AI." though he also says this could come at any point given a breakthrough. Perhaps, but I still think getting to that breakthrough will be much more difficult than scaling transformers to ever-larger-sizes, and especially if scaling maximalism becomes ideologically dominant to the exclusion of alternative paradigms.

  21. ^

    Given François' sceptical position, I wouldn't put too much stock in taking his timeline adjustments too concretely.

  22. ^

    When the Chinese Room comes up, for instance, it's instantly dismissed with the systems reply despite Searle addressing that in this original paper.

  23. ^

    I actually can't recommend Xuan's work and perspective highly enough. My route to LLM scepticism really picked up momentum with this thread, I think.

  24. ^

    I'm calling out particularly examples here because I think it's good to do so rather than to vaguepost, but please see my final bullet point in this section. I think my issue might be with OpenPhil's epistemic perspective on AI culturally, rather than any of the individuals working there.

  25. ^

    Bongard Problems were developed in the 1960s, and are very similar to ARC puzzles. There are a few shots indicating some kind of rule, and you'll solve the test once you can identify the rule.


Steven Byrnes @ 2024-06-17T03:16 (+18)

I agree with Chollet (and OP) that LLMs will probably plateau, but I’m also big into AGI safety—see e.g. my post AI doom from an LLM-plateau-ist perspective.

(When I say “AGI” I think I’m talking about the same thing that you called digital “beings” in this comment.)

Here are a bunch of agreements & disagreements.

if François is right, then I think this should be considered strong evidence that work on AI Safety is not overwhelmingly valuable, and may not be one of the most promising ways to have a positive impact on the world.

I think François is right, but I do think that work on AI safety is overwhelmingly valuable.

Here’s an allegory:

There’s a fast-breeding species of extraordinarily competent and ambitious intelligent aliens. They can do science much much better than Einstein, they can run businesses much much better than Bezos, they can win allies and influence much much better than Hitler or Stalin, etc. And they’re almost definitely (say >>90% chance) coming to Earth sooner or later, in massive numbers that will keep inexorably growing, but we don’t know exactly when this will happen, and we also don’t know in great detail what these aliens will be like—maybe they will have callous disregard for human welfare, or maybe they’ll be great. People have been sounding the alarm for decades that this is a big friggin’ deal that warrants great care and advanced planning, but basically nobody cares.

Then some scientist Dr. S says “hey those dots in the sky—maybe they’re the aliens! If so they might arrive in the next 5-10 years, and they’ll have the following specific properties”. All of the sudden there’s a massive influx of societal interest—interest in the dots in particular, and interest in alien preparation in general.

But it turns out that Dr. S was wrong! The dots are small meteors. They might hit earth and cause minor damage but nothing unprecedented. So we’re back to not knowing when the aliens will come or what exactly they’ll be like.

Is Dr. S’s mistake “strong evidence that alien prep is not overwhelmingly valuable”? No! It just puts us back where we were before Dr. S came along.

(end of allegory)

(Glossary: the “aliens” are AGIs; the dots in the sky are LLMs; and Dr. S would be a guy saying LLMs will scale to AGI with no additional algorithmic insights.) 

 It would make AI Safety work less tractable

If LLMs will plateau (as I expect), I think there are nevertheless lots of tractable projects that would help AGI safety. Examples include:

It seems that many people in Open Phil have substantially shortened their timelines recently (see Ajeya here).

For what it’s worth, Yann LeCun is very confidently against LLMs scaling to AGI, and yet LeCun seems to have at least vaguely similar timelines-to-AGI as Ajeya does in that link.

Ditto for me.

See also my discussion here (“30 years is a long time. A lot can happen. Thirty years ago, deep learning was an obscure backwater within AI, and meanwhile people would brag about how their fancy new home computer had a whopping 8 MB of RAM…”)

To be clear, you can definitely find some people in AI safety saying AGI is likely in <5 years, although Ajeya is not one of those people. This is a more extreme claim, and does seem pretty implausible unless LLMs will scale to AGI.

I think this makes me very concern of a strong ideological and philosophical bubble in the Bay regarding these core questions of AI.

Yeah some examples would be:

Many ≠ All! But to the extent that these things happen, I’m against it, and I do complain about it regularly.

(To be clear, I’m not opposed to contingency-planning for the possibility that LLMs will scale to AGIs. I don’t expect that contingency to happen, but hey, what do I know, I’ve been wrong before, and so has Chollet. But I find that these kinds of claims above are often stated unconditionally. Or even if they’re stated conditionally, the conditionality is kinda forgotten in practice.)

I think it’s also important to note that these habits above are regrettably common among both AI pessimists and AI optimists. As examples of the latter, see me replying to Matt Barnett and me replying to Quintin Pope & Nora Belrose.

By the way, this might be overly-cynical, but I think there are some people (coming into the AI safety field very recently) who understand how LLMs work but don’t know how (for example) model-based reinforcement learning works, and so they just assume that the way LLMs work is the only possible way for any AI algorithm to work.

JWS @ 2024-06-24T16:06 (+5)

Hey Steven! As always I really appreciate your engagement here, and I’m going to have to really simplify but I really appreciate your links[1] and I’m definitely going to check them out 🙂

I think François is right, but I do think that work on AI safety is overwhelmingly valuable.

Here’s an allegory:

I think the most relevant disagreement that we have[2]is the beginning of your allegory. To indulge it, I don't think we have knowledge of the intelligent alien species coming to earth, and to the extent we have a conceptual basis for them we can't see any signs of them in the sky. Pair this with the EA concern that we should be concerned about the counterfactual impact of our actions, and that there are opportunities to do good right here and now,[3] it shouldn't be a primary EA concern. 

Now, what would make it a primary concern is if Dr S is right and that the aliens are spotted and that they're on their way, but I don't think he's right. And, to stretch the analogy to breaking point, I'd be very upset that after I turned my telescope to the co-ordinates Dr S mentions and seeing meteors instead of spaceships, that significant parts of the EA movement were still wanting to have more funding to construct the ultimate-anti-alien-space-laser or do alien-defence-research instead of buying bednets.

(When I say “AGI” I think I’m talking about the same thing that you called digital “beings” in this comment.)

A secondary crux I have is that a 'digital being’ in the sense I describe, and possibly the AGI you think of, will likely exhibit certain autopoietic properties that make it significantly different from either the paperclip maxermiser or a 'foom-ing' ASI. This is highly speculative though, based on a lot of philosophical intuitions, and I wouldn’t want to bet humanity’s future on it at all in the case where we did see aliens in the sky.

To be clear, you can definitely find some people in AI safety saying AGI is likely in <5 years, although Ajeya is not one of those people. This is a more extreme claim, and does seem pretty implausible unless LLMs will scale to AGI.

My take on it, though I admit driven by selection bias on Twitter, is that many people in the Bay-Social-Scene are buying into the <5 year timelines. Aschenbrenner for sure, Kokotajlo as well, and even maybe Amodei[4] as well? (Edit: Also lots of prominent AI Safety Twitter accounts seem to have bought fully into this worldview, such as the awful 'AI Safety Memes' account) However, I do agree it’s not all of AI Safety for sure! I just don’t think it that, once you take away that urgency and certainy of the probelm, it ought to be considered the world's “most pressing problem”, at least without further controversial philosophical assumptions.

  1. ^

    I remember reading and liking your 'LLM plateau-ist' piece.

  2. ^

    I can't speak for all the otheres you mention, but fwiw I do agree with your frustrations at the AI risk discourse on various sides

  3. ^

    I'd argue through increasing human flourishing and reducing the suffering we inflict on animals, but you could sub in your own cause area here for instance, e.g. 'preventing nuclear war' if you thought that was both likely and an x-risk

  4. ^

    See the transcript with Dwarkesh at 00:24:26 onwards where he says that superhuman/transformative AI capabilities will come within 'a few years' of the interview's date (so within a few years of summer 2023)

Ryan Greenblatt @ 2024-06-26T04:09 (+3)

Pair this with the EA concern that we should be concerned about the counterfactual impact of our actions, and that there are opportunities to do good right here and now,[3] it shouldn't be a primary EA concern. 

As in, your crux is that the probability of AGI within the next 50 years is less than 10%?

I think from an x-risk perspective it is quite hard to beat AI risk even on pretty long timelines. (Where the main question is bio risk and what you think about (likely temporary) civilizational collapse due to nuclear war.)

It's pretty plausible that on longer timelines technical alignment/safety work looks weak relative to other stuff focused on making AI go better.

JWS @ 2024-06-28T11:58 (+3)

As in, your crux is that the probability of AGI within the next 50 years is less than 10%?

I'm essentially deeply uncertain about how to answer this question, in a true 'Knightian Uncertainty' sense and I don't know how much it makes sense to use subjective probability calculus. It is also highly variable to what we mean by AGI though. I find many of the arguments I've seen to be a) deference to the subjective probabilities of others or b) extrapolation of straight lines on graphs - neither of which I find highly convincing. (I think your arguments seem stronger and more grounded fwiw)

I think from an x-risk perspective it is quite hard to beat AI risk even on pretty long timelines.

I think this can hold, but it hold's not just in light of particular facts about AI progress now but in light of various strong philosophical beliefs about value, what future AI would be like, and how the future would be post the invention of said AI. You may have strong arguments for these, but I find many arguments for the overwhelming importance of AI Safety do very poorly to ground these, especially in the light of compelling interventions to good that exist in the world right now.

Ryan Greenblatt @ 2024-06-28T18:29 (+2)

It is also highly variable to what we mean by AGI though.

I'm happy to do timelines to the singularity and operationize this with "we have the technological capacity to pretty easily build projects as impressive as a dyson sphere".

(Or 1000x electricity production, or whatever.)

In my views, this likely adds only a moderate number of years (3-20 depending on how various details go).

Steven Byrnes @ 2024-06-25T12:44 (+2)

For what it’s worth, Yann LeCun is very confidently against LLMs scaling to AGI, and yet LeCun seems to have at least vaguely similar timelines-to-AGI as Ajeya does in that link.

Ditto for me.

Oh hey here’s one more: Chollet himself (!!!) has vaguely similar timelines-to-AGI (source) as Ajeya does. (Actually if anything Chollet expects it a bit sooner: he says 2038-2048, Ajeya says median 2050.)

Phib @ 2024-06-17T19:07 (+10)

Hi JWS, unsure if you’d see this since it’s on LW and I thought you’d be interested (I’m not sure what to think of Chollet’s work tbh and haven’t been able to spend time on it, so I’m not making much of a claim in sharing this!)

https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o

JWS @ 2024-06-25T17:35 (+12)

Thanks for sharing this Phil, it's very unforunate it came out just as I went on holiday! To all readers, this will probably be the major substantive response I make in these comments, and to get the most out of it you'll probably need some technical/background understanding of how AI systems work. I'll tag @Ryan Greenblatt directly so he can see my points, but only the first is really directed at him, the rest are responding to the ideas and interpretations.


First, to Ryan directly, this is really great work! Like, awesome job đź‘Źđź‘Ź My only sadness here is that I get the impression you think this work is kind of a dead-end? On the contrary, I think this is the kind of research programme that could actually lead to updates (either way) across the different factions on AI progress and AI risk. You get mentioned positively on the xrisk-hostile Machine Learning Street Talk about this! Melanie Mitchell is paying attention (and even appeared in your substack comments)! I feel like the iron is hot here and it's a promising and exciting vein of research![1] 

Second, as others have pointed out, the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that. But to be clear for all readers, this is what's happened:

  • Ryan got a model to achieve 50% accuracy on the public evaluation training set provided by Chollet in the original repo. Ryan has not got a score on the private set, because those answers are kept privately on Kaggle to prevent data leakage. Note that Ryan's original claims were based on the different sets being IID and the same difficulty, which is not true. We should expect performance to be lower on the private set.
  • The current SOTA  on the private test set was Cole, Osman, and Hodel with 34%, though apparently they now have reached 39% on the private set.  Ryan has noted this, so I assume we'll have clarifications/corrections soon to that bit of his piece.
  • Therefore Ryan has not achieved SOTA performance on ARC. That doesn't mean his work isn't impressive, but it is not true that GPT4o improved the ARC SOTA 16% in 6 days.
  • Also note from the comments on Substack, when limited to ~128 sample programmes per case, the results were 26% on the held out test of the training set. It's good, but not state of the art, and one wonders whether the juice is worth the squeeze there, especially if Jianghong Ying's calculations of the tokens-per-case is accurate. We seem to need exponential data to improve results.

Currently, as Ryan notes, his solution is inelligble for the ARC prize as it doesn't meet the various restrictions on runtime/compute/internet connection to enter. While the organisers say that this is meant to encourage efficiency,[2] I suspect it may be more of a security-conscious decision to limit people's access to the private test set. It is worth noting that, as the public training and eval sets are on GitHub (as will be most blog pieces about them, and eventually Ryan's own piece as well as my own) dataset contamination remains an issue to be concerned with.[3]

Third, and most importantly, I think Ryan's solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments. For example:

  • Ryan came up with the idea and implementation to use ASCII encoding since the vision capabilities of GPT4o were so unreliable. Ryan did some feature extraction on the ARC problems.
  • Ryan wrote the prompts and did the prompt engineering in lieu of their being fine-tuning available. He also provides the step-by-step reasoning in his prompts. Those long, carefully crafted prompt seems quite domain/problem-specific, and would probably point toward ARC's insufficiency as a test for generality than an example of general ability in LLMs.
  • Ryan notes that the additional approaches and tweaks are critical for performance gain above the 'just draw more samples'. I think that meme was a bit unkind, let alone inaccurate, and I kinda wish it was removed from the piece tbh.

If you check the repo (linked above), it's full of some really cool code to make this solution work, but that's the secret sauce. To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training[4] of the LLM (this is another cruxy point I highlighted in my article). I think it's much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit, and that's still basically all Ryan-GPT. 

Fourth, I got massively nerdsniped by what 'in-context learning' actually is. A lot of talk about it from a quick search seemed to be vague, wish-washy, and highly anthropomorphising. I'm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it's called learning at all. The model certainly isn't learning anything. After you ask GPT4o a query you can boot up a new instance and it'll be as clueless as when you started the first session, or you could just flood the context window with enough useless tokens so the original task gets cut off. So, if I accept Ryan's framing of the inconsistent triad, I'd reject the 3rd one, and say that "Current LLMs never "learn" at runtime (e.g. the in-context learning they can do isn't real learning)". I'm going to continue following the 'in-context learning' nerdsnipe, but yeah since we know that weights are completely fixed and the model isn't learning, what is doing it? And can we think of a better name for it than 'in-context learning'?

Fifth and finally, I'm slightly disappointed at Buck and Dwarkesh for kinda posing this as a 'mic drop' against ARC.[5] Similarly, Zvi seems to dismiss it, though he praises Chollet for making a stand with a benchmark. I contrasnt, I think that the ability (or not) of models to reason robustly, out-of-distribution, without having the ability to learn from trillions of pre-labelled samples is a pretty big crux for AI Safety's importance. Sure, maybe in a few months we'll see the top score on the ARC Challenge above 85%, but could such a model work in the real world? Is it actually a general intelligence capable of novel or dangerous acts, such as to motivate AI risk? This is what Chollet is talking about in the podcast when he says:

I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model on millions or billions of puzzles similar to ARC, you’re relying on the ability to have some overlap between the tasks that you train on and the tasks that you’re going to see at test time. You’re still using memorization.


If you're reading, thanks for making it through this comment! I'd recommend reading Ryan's full post first (which Philb linked above), but there's been a bunch of disparate discussion there, on LessWrong, on HackerNews etc. If you want to pursue what the LLM-reasoning-sceptics think, I'd recommend following/reading Melanie Mitchell and Subbarao Kambhampati. Finally, if you think this is topic/problem is worth collaborating on then feel free to reach out to me. I'd love to hear from anyone who thinks it's worth investigating and would want to pool resources.

  1. ^

    (Ofc your time is valuable and you should pursue what you think is valuable, I'd just hope this could be the start of a cross-factional, positive-sum research program which would be such a breath of fresh air compared to other AI discourse atm)

  2. ^

    Ryan estimates he used 1000x runtime compute per problem than Cole et. al, and also spent $40,000 in API costs alone (i wonder how much it costs for just 1 run though?).

  3. ^

    In the original interview, Mike mentions that 'there is an asterisk on any score that's reported on against the public test set' for this very reason

  4. ^

    H/t to @Max Nadeau  for being on top of some of the clarifications on Twitter

  5. ^

    Perhaps I'm misinterpreting, and I am using them as a proxy for the response of AI Safety as a whole, but it's very much the 'vibe' I got from those reactions

Ryan Greenblatt @ 2024-06-26T17:30 (+9)

Sure, maybe in a few months we'll see the top score on the ARC Challenge above 85%, but could such a model work in the real world?

It sound like you agree with my claims that ARC-AGI isn't that likely to track progress and that other benchmarks could work better?

(The rest of your response seemed to imply something different.)

JWS @ 2024-06-28T12:07 (+3)

At the moment I think ARC-AGI does a good job at showing the limitations of transformer models on simple tasks that they don't come across in their training set. I think if the score was claimed, we'd want to see how it came about. It might be through frontier models demonstrating true understanding, but it might through shortcut learning/data leakage/impressive but overly specific and intuitively unsatisfying solution.

If ARC-AGI were to be broken (within the constraints Chollet and Knoop place on it) I'd definitely change my opinions, but what they'd change to depends on the matter of how ARC-AGI was solved. That's all I'm trying to say in that section (perhaps poorly)

Ryan Greenblatt @ 2024-06-26T17:15 (+9)

the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that

Agreed, though it is possible that my approach is/was SOTA on the private set. (E.g., because Jack Cole et al.'s approach is somewhat more overfit.)

I'm waiting on the private leaderboard results and then I'll revise.

Ryan Greenblatt @ 2024-06-26T17:14 (+9)

My only sadness here is that I get the impression you think this work is kind of a dead-end?

I don't think it is a dead end.

As I say in the post:

  • ARC-AGI probably isn't a good benchmark for evaluating progress towards TAI: substantial "elicitation" effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks.
  • But, I still think that work like ARC-AGI can be good on the margin for getting a better understanding of current AI capabilities.
Ryan Greenblatt @ 2024-06-26T17:26 (+8)

So, if I accept Ryan's framing of the inconsistent triad, I'd reject the 3rd one, and say that "Current LLMs never "learn" at runtime (e.g. the in-context learning they can do isn't real learning)"

You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.

I'm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it's called learning at all

In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning? I would say that a system can (potentially!) be learning as long as there is some evolving state. In the case of  transformers and in-context learning, that state is activations.

JWS @ 2024-06-28T11:55 (+3)

 You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.

Ah sorry I misread the trilemma, my bad! I think I'd still hold the 3rd to be true (Current LLMs never "learn" at runtime) though I'm open to changing my mind on that looking at further research. I guess I could see ways to reject 1 (e.g. if I copied the answers and just used a lookup table I'd get 100% but I don't think there's any learning, so it's certainly feasible for this to be false, but agreed it doesn't feel satisfying), or 2 (Maybe Chollet would say selection-from-memorised-templates doesn't count as a learning, also agreed unsatisfying). It's a good challenge!

In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning?

I'm not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated, and that's where the 'learning' (if we want to call it that) comes in - the model is 'learning' to store information/generate information with some combination of accurately predicting the next token in its training data and satisfying the RL model created from human reward labelling. Which is my issue with calling ICL 'learning' since the model weights are fixed, the model isn't learning anything. Similarly, all the activation functions between the layers do not change either. I also don't make intuitive sense to me to call the outputs of layers as 'learning' - the activations are 'just matmul' which I know is reductionist, but they aren't a thing that acquires a new state in my mind.

But again, this is something I want to do a deep dive into myself, so I accept that my thoughts on ICL might not be very clear

Ryan Greenblatt @ 2024-06-28T18:26 (+2)

I'm not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated

Sure, I was just using this as an example. I should have made this more clera.


Here is a version of the exact same paragraph you wrote but for activations and incontext learning:

in pre-training and RLHF the model activations are being changed and updated by each layer, and that's where the 'in-context learning' (if we want to call it that) comes in - the activations are being updated/optimized to better predict the next token and understand the text. The layers learned to in-context learn (update the activations) across a wide variety of data in pretraining.

(We can show transformers learning to optimization in [very toy cases](https://www.lesswrong.com/posts/HHSuvG2hqAnGT5Wzp/no-convincing-evidence-for-gradient-descent-in-activation#Transformers_Learn_in_Context_by_Gradient_Descent__van_Oswald_et_al__2022_).)

Fair enough if you want to say "the model isn't learning, the activations are learning", but then you should also say "short term (<1 minute) learning in humans isn't the brain learning, it is the transient neural state learning".

JWS @ 2024-06-29T09:16 (+2)

I'll have to dive into the technical details here I think, but the mystery of in-context learning has certainly shot up my reading list, and I really appreciate that link btw! It seems Blaine has some of the similary a-priori scepticism that I do towards it, but the right way for me to proceed is dive into the empirical side and see if my ideas hold water there.

Ryan Greenblatt @ 2024-06-26T17:22 (+8)

Third, and most importantly, I think Ryan's solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments.

[...]

To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training[4] of the LLM (this is another cruxy point I highlighted in my article).

Quoting from a substack comment I wrote in response:

Certainly some credit goes to me and some to GPT4o.

The solution would be much worse without careful optimization and wouldn't work at all without gpt4o (or another llm with similar performance).

It's worth noting a high fraction of my time went into writing prompts and optimization the representation. (Which is perhaps better described as teaching gpt4o and making it easier for it to see the problem.)

There are different analogies here which might be illuminating:

  • Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
  • If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
  • You can build systems around people which remove most of the interesting intelligence from various tasks.

I think what is going on here is analogous to all of these.

Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133

 

I think it's much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit,

It is worth noting that hundreds (thousands?) of high quality researcher years have been put into making GPT4o more performant.

JWS @ 2024-06-28T12:53 (+6)

The solution would be much worse without careful optimization and wouldn't work at all without gpt4o (or another llm with similar performance).

I can buy that GPT4o would be best, but perhaps other LLMs might reached 'ok' scores on ARC-AGI if directly swapped out? I'm not sure what you refer to be 'careful optimization' here though.

There are different analogies here which might be illuminating:

  • Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
  • If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
  • You can build systems around people which remove most of the interesting intelligence from various tasks.

I think what is going on here is analogous to all of these.

On these analogies:

  1. This is an interesting point actually. I suppose credit-assingment for learning is a very difficult problem. In this case though, the child stranded would (hopefully!) survive and make a life for themselves and learn the skills they need to survive. They're active agents using their innate general intelligence to solve novel problems (per chollet). If I put a hard-drive with gpt4o's weights in the forest, it'll just rust. And that'll happen no matter how big we make that model/hard-drive imo.[1]
  2. Agreed here, will be very interesting to see how improved multimodality affects ARC-AGI scores. I think that we have interesting cases of humans being able to perform these takes in their head presumably without sight? e.g. Blind Chess Players with high ratings or Mathematicians who can reason without sight. I think Chollet's point in the interview is that they seem to be able to parse the JSON inputs fine in various cases, but still can't perform generalisation.
  3. Yep I think this is true, and perhaps my greatest fear from delegating power to complex AI systems. This is an empirical question we'll have to find out, can we simply automate away everything humans do/are needed for through a combination of systems even if each individual part/model used in said system is not intelligent?

Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133

Yep saw Max's comments and think he did a great job on X bringing some clarifications. I still think the hard part is the scaffolding. Money is easy for SanFran VCs to provide, and we know they're all fine to scrape-data-first-ask-legal-forgiveness later.

I think there's a separate point where enough scaffolding + LLM means the resulting AI system is not well described by being an LLM anymore. Take the case of CICERO by Meta. Is that a 'scaffolded LLM'? I'd rather describe it as a system which incorporates an LLM as a particular part. It's harder to naturally scale such a system in the way that you can with the transformer architecuter by stacking more layers or pre-training for longer on more data.

My intuition here is that scaffolding to make a system work well on ARC-AGI would make it less useable on other tasks, so sacrificing generality for specific performance. Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each? (Just thinking out loud here)


Final point, I've really appreciate your original work, comments on substack/X/here. I do apologise if I didn't make clear what parts were my personal reflections/vibes instead of more technical disagreements on interpretation - these are very complex topics (at least for me) and I'm trying my best to form a good explanation of the various evidence and data we have on this. Regardless of our disagreements on this topic, I've learned a lot :)

  1. ^

    Similarly, you can pre-train a model to create weights and get to a humongous size. But it won't do anything until you ask it to generate a token. At least, that's my intuition. I'm quite sceptical of how pre-training a transformer is going to lead to creating a mesa-optimiser

Ryan Greenblatt @ 2024-06-28T18:40 (+2)

But it won't do anything until you ask it to generate a token. At least, that's my intuition.

I think this seems like mostly a fallacy. (I feel like there should be a post explaning this somewhere.)

Here is an alternative version of what you said to indicate why I don't think this is a very interesting claim:

Sure you can have a very smart quadriplegic who is very knowledgable. But they won't do anything until you let them control some actuator. 

If your view is that "prediction won't result in intelligence", fair enough, though its notable that the human brain seems to heavily utilize prediction objectives.

JWS @ 2024-06-29T09:27 (+2)

(folding in replies to different sub-comments here)

Sure you can have a very smart quadriplegic who is very knowledgable. But they won't do anything until you let them control some actuator. 

I think our misunderstanding here is caused by the word do. Sure, Stephen Hawking couldn't control his limbs, but nevertheless his mind was always working. He kept writing books and papers throughout his life, and his brain was 'always on'. A transformer model is a set of frozen weights that are only 'on' when a prompt is entered. That's what I mean by 'it won't do anything'.

As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did.

Hmm, maybe we're differing on what hard works means here! Could be a difference between what's expensive, time-consuming, etc. I'm not sure this holds for any reasonable scheme, and I definitely think that you deserve a lot of credit for the work you've done, much more than GPT4o.

I think my results are probably SOTA based on more recent updates.

Congrats! I saw that result and am impressed! It's definitely clearly SOTA on the ARC-AGI-PUB leaderboard, but the original '34%->50% in 6 days ARC-AGI breakthrough' claim is still incorrect.

Ryan Greenblatt @ 2024-06-28T18:36 (+2)

I can buy that GPT4o would be best, but perhaps other LLMs might reached 'ok' scores on ARC-AGI if directly swapped out? I'm not sure what you refer to be 'careful optimization' here though.

I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.

This is very clear as these LLMs can't code basically at all.

If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.

Ryan Greenblatt @ 2024-06-28T18:20 (+2)

Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each?

Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.

I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.

Ryan Greenblatt @ 2024-06-28T18:18 (+2)

I still think the hard part is the scaffolding.

For this project? In general?

As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did. This probably holds for any reasonable scheme for dividing credit and determining what is difficult.

Ryan Greenblatt @ 2024-06-26T17:28 (+7)

Fifth and finally, I'm slightly disappointed at Buck and Dwarkesh for kinda posing this as a 'mic drop' against ARC.

I don't think the objection is to ARC (the benchmark), I think the objection is to specific (very strong!) claims that chollet makes.

I think the benchmark is a useful contribution as I note in another comment.

JWS @ 2024-06-28T12:13 (+4)

Oh yeah this wasn't against you at all! I think you're a great researcher, and an excellent interlocutor, and I learn a lot (and am learning a lot) from both your work and your reactions to my reaction.[1] Point five was very much a reaction against a 'vibe' I saw in the wake of your results being published. 

Like let's take Buck's tweet for example. We know now that a) your results aren't technically SOTA and b) It's not an LLM solution, it's an LLM + your scaffolding + program search, and I think that's importantly not the same thing. 

  1. ^

    I sincerely hope my post + comments have been somewhat more stimulating than frustrating for you

Ryan Greenblatt @ 2024-06-29T04:11 (+3)

We know now that a) your results aren't technically SOTA

I think my results are probably SOTA based on more recent updates.

It's not an LLM solution, it's an LLM + your scaffolding + program search, and I think that's importantly not the same thing. 

I feel like this is a pretty strange way to draw the line about what counts as an "LLM solution".

Consider the following simplified dialogue as an example of why I don't think this is a natural place to draw the line:

Human skeptic: Humans don't exhibit real intelligence. You see, they'll never do something as impressive as sending a human to the moon.

Humans-have-some-intelligence advocate: Didn't humans go to the moon in 1969?

Human skeptic: That wasn't humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans don't exhibit real intelligence!

Humans-have-some-intelligence advocate: ...                 Ok, but do you agree that if we removed the Humans from the overall approach it wouldn't work.

Human skeptic: Yes, but same with the culture and organization!

Humans-have-some-intelligence advocate: Sure, I guess. I'm happy to just call it humans+etc I guess. Do you have any predictions for specific technical feats which are possible to do with a reasonable amount of intelligence that you're confident can't be accomplished by building some relatively straightforward organization on top of a bunch of smart humans within the next 15 years?

Human skeptic: No.


Of course, I think actual LLM skeptics often don't answer "No" to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).

I actually don't know what in particular Chollet thinks is unlikely here. E.g., I don't know if he has strong views about the performance of my method, but using the SOTA multimodal model in 2 years.

JWS @ 2024-06-29T13:03 (+2)

Final final edit: Congrats on the ARC-AGI-PUB results, really impressive :)

This will be my final response on this thread, because life is very time consuming and I'm rapidly reaching the point where I need to dive back into the technical literature and stress-test my beliefs and intuitions again. I hope Ryan and any readers have found this exchange useful/enlightening for seeing two different perspectives hopefully have productive disagreement?

If you found my presentation of the scaling-skeptical position highly unconvincing, I'd recommend following the work and thoughts of Tan Zhi Xuan (find her on X here). One of biggest updates was finding her work after she pushed back on Jacob Steinhardt here, and recently she gave a talk about her approach to Alignment. I urge readers to consider spending much more of their time listening to her than to me about AI.


I feel like this is a pretty strange way to draw the line about what counts as an "LLM solution".

I don't think so? Again, I wouldn't call CICERO an "LLM solution". Surely there'll be some amount of scaffolding which tips over into the scaffolding being the main thing and the LLM just being a component part? It's probably all blurry lines for sure, but I think it's important to separate 'LLM only systems' from 'systems that include LLMs', because it's very easy to conceptual scale up the former but harder to do the latter.

Human skeptic: That wasn't humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans don't exhibit real intelligence!

I mean, you use this as a reductio, but that's basically the theory of Distributed Cognition, and also linked to the ideas of 'collective intelligence', though that's definitely not an area I'm an expert in by any means. Also reminds me a lot Chalmers and Clarks' thesis of the Extended Mind.[1]

Of course, I think actual LLM skeptics often don't answer "No" to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).

So I can't speak for Chollet and other LLM skeptics, and I think again LLMs+extra (or extras+LLMs) are a different beast from LLMs on their own and possibly an important crux. Here are some things I don't think will happen in the near-ish future (on the current paradigm):

  • I believe an adversarial Imitation Game, where the interrogator is aware of both the AI system's LLM-based nature and its failure modes, is unlikely to be consistently beaten in the near future.[2]
  • Primarily-LLM models, in my view, are highly unlikely to exhibit autopoietic behaviour or develop agentic designs independently (i.e. without prompting/direction by a human controller).
  • I don't anticipate these models exponential increase the rate of scientific research or AI development.[3] They'll more likely serve as tools used by scientists and researchers themselves to frame problems, but new and novel problems will still remain difficult and be bottlenecked by the real world + Hofstadter's law.
  • I don't anticipate Primarily-LLM models to become good at controlling and manoeuvring robotic bodies in the 3D world. This is especially true in a novel-test-case scenario (if someone could make a physical equivalent of ARC to test this, that'd be great)
  • This would be even less likely if the scaffolding remained minimal. For instance, if there's no initial sorting code explicitly stating [IF challenge == turing_test GO TO turing_test_game_module].
  • Finally, as an anti-RSI operationalisation, the idea of LLM-based models assisting in designing and constructing a Dyson Sphere within 15 years seems... particularly far-fetched for me.

I'm not sure if this reply was my best, it felt a little all-over-the-place, but we are touching on some deep or complex topics! So I'll respectfully bow out now, and thank again for the disucssion and giving me so much to think about. I really appreciate it Ryan :)

  1. ^

    Then you get into ideas like embodiment/enactivism etc

     

  2. ^

    I can think of a bunch of strategies to win here, but I'm not gonna say so it doesn't end up in GPT-5 or 6's training data!

  3. ^

    Of course, with a new breakthrough, all bets could be off, but it's also definitionally impossible to predict those, and unrobust to draw straight lines and graphs to predict the future if you think breakthroughs will be need. (Not saying you do this, but some other AIXR people definitely seem to be)

Egg Syntax @ 2024-06-25T19:23 (+2)

I have thoughts, but a question first: you link a Kambhampati tweet where he says,

...as the context window changes (with additional prompt words), the LLM, by design, switches the CPT used to generate next token--given that all these CPTs have been pre-computed?

What does 'CPT' stand for here? It's not a common ML or computer science acronym that I've been able to find.

DanielFilan @ 2024-11-27T23:44 (+2)

Since nobody else has responded, my best guess would be "conditional probability table".

Egg Syntax @ 2024-06-25T19:57 (+1)

I think Ryan's solution shows that the intelligence is coming for him, and not from Chat-GPT4o.

If this is true, then substituting in a less capable model should have equally good results; would you predict that to be the case? I claim that plugging in an older/smaller model would produce much worse results, and if that's the case then we should consider a substantial part of the performance to be coming from the model.

This is what Chollet is talking about in the podcast when he says...'I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved.'

This seems to me to be Chollet trying to have it both ways. Either a) ARC is an important measure of 'true' intelligence (or at least of the ability to reason over novel problems), and so we should consider LLMs' poor performance on it a sign that they're not general intelligence, or b) ARC isn't a very good measure of true intelligence, in which case LLMs' performance on it isn't very important. Those can't be simultaneously true. I think that nearly everywhere but in the quote, Chollet has claimed (and continues to claim) that a) is true.

Egg Syntax @ 2024-06-25T19:26 (+1)

I'm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it's called learning at all. The model certainly isn't learning anything. 

I would frame it as: the model is learning but then forgetting what it's learned (due to its inability to move anything from working/short-term memory to long-term memory). That's something that we see in learning in humans as well (one example: I've learned an enormous number of six-digit confirmation codes, each of which I remember just long enough to enter it into the website that's asking for it), although of course not so consistently.

Marcel D @ 2024-06-16T16:13 (+4)

Can anyone point me to a good analysis of the ARC test's legitimacy/value? I was a bit surprised when I listened to the podcast, as they made it seem like a high-quality, general-purpose test, but then I was very disappointed to see it's just a glorified visual pattern abstraction test. Maybe I missed some discussion of it in the podcasts I listened to, but it just doesn't seem like people pushed back hard enough on the legitimacy of comparing "language model that is trying to identify abstract geometric patterns through a JSON file" vs. "humans that are just visually observing/predicting the patterns."

Like, is it wrong to demand that humans should have to do this test purely by interpreting the JSON (with no visual aide)?

mlsbt @ 2024-06-17T08:54 (+5)

Language models have no problem interpreting the image correctly. You can ask them for a description of the input grid and they’ll get it right, they just don’t get the pattern.

Marcel D @ 2024-06-17T10:37 (+4)

I wouldn't be surprised if that's correct (though I haven't seen the tests), but that wasn't my complaint. A moderately smart/trained human can also probably convert from JSON to a description of the grid, but there's a substantial difference in experience from seeing even a list of grid square-color labels vs. actually visualizing it and identifying the patterns. I would strike a guess that humans who are only given a list of square color labels (not just the raw JSON) would perform significantly worse if they are not allowed to then draw out the grids.

And I would guess that even if some people do it well, they are doing it well because they convert from text to visualization.

mlsbt @ 2024-06-17T12:58 (+3)

I might be misunderstanding you here. You can easily get ChatGPT to convert the image to a grid representation/visualization, e.g. in Python, not just a list of square-color labels. It can formally draw out the grid any way you want and work with that, but still doesn’t make progress.

Also, to answer your initial question about ARC’s usefulness, the idea is just that these are simple problems where relevant solution strategies don’t exist on the internet. A non-visual ARC analog might be, as Chollet mentioned, Caesar ciphers with non-standard offsets.

Marcel D @ 2024-06-18T12:53 (+2)

Just because an LLM can convert something to a grid representation/visualization does not mean it can itself actually "visualize" the thing. A pure-text model will lack the ability to observe anything visually. Just because a blind human can write out some mathematical function that they can input into a graphing calculator, that does not mean that the human necessarily can visualize what the function's shape will take, even if the resulting graph is shown to everyone else.

mlsbt @ 2024-06-18T13:38 (+1)

I used GPT-4o which is multimodal (and in fact was even trained on these images in particular as I took the examples from the ARC website, not the Github). I did test more grid inputs and it wasn't perfect at 'visualizing' them.

Marcel D @ 2024-06-18T13:44 (+2)

I almost clarified that I know some models technically are multi-modal, but my impression is that the visual reasoning abilities of the current models are very limited, so I’m not at all surprised they’re limited. Among other illustrations of this impression, occasionally I’ve found they struggle to properly describe what is happening in an image beyond a relatively general level.

mlsbt @ 2024-06-18T13:49 (+1)

Looking forward to seeing the ARC performance of future multimodal models. I'm also going to try to think of a text-based ARC analog, that is perhaps more general. There are only so many unique simple 2D-grid transformation rules so it can be brute forced to some extent.

Aaron_Scher @ 2024-06-18T18:02 (+1)

The paper that introduces the test is probably what you're looking for. Based on a skim, it seems to me that it spends a lot of words laying out the conceptual background that would make this test valuable. Obviously it's heavily selected for making the overall argument that the test is good. 

Egg Syntax @ 2024-06-16T17:55 (+1)

"humans that are just visually observing/predicting the patterns."

 

I don't think that's actually any simpler than doing it as JSON; it's just that our brains are tuned for (and we're more accustomed to) doing it visually. Depending on the specifics of the JSON format, there may be a bit of advantage to being able to have adjacency be natively two-dimensional, but I wouldn't expect that to make a huge difference.

Marcel D @ 2024-06-18T12:56 (+2)

Again, I'd be interested to actually see humans attempt the test by viewing the raw JSON, without being allowed to see/generate any kind of visualization of the JSON. I suspect that most people will solve it by visualizing and manipulating it in their head, as one typically does with these kinds of problems. Perhaps you (a person with syntax in their username) would find this challenge quite easy! Personally, I don't think I could reliably do it without substantial practice, especially if I'm prohibited from visualizing it.

Ryan Greenblatt @ 2024-06-29T03:55 (+3)

Tom Davidson's model is often referred to in the Community, but it is entirely reliant on the current paradigm + scale reaching AGI.

This seems wrong.

It does use constants from the historical deep learning field to provide guesses for parameters and it assumes that compute is an important driver of AI progress.

These are much weaker assumptions than you seem to be implying.

Note also that this work is based on earlier work like bio anchors which was done just as the current paradigm and scaling were being established. (It was published in the same year as Kaplan et al.)

Steven Byrnes @ 2024-07-05T18:08 (+2)

I don’t recall the details of Tom Davidson’s model, but I’m pretty familiar with Ajeya’s bio-anchors report, and I definitely think that if you make an assumption “algorithmic breakthroughs are needed to get TAI”, then there really isn’t much left of the bio-anchors report at all. (…although there are still some interesting ideas and calculations that can be salvaged from the rubble.)

I went through how the bio-anchors report looks if you hold a strong algorithmic-breakthrough-centric perspective in my 2021 post Brain-inspired AGI and the "lifetime anchor".

See also here (search for “breakthrough”) where Ajeya is very clear in an interview that she views algorithmic breakthroughs as unnecessary for TAI, and that she deliberately did not include the possibility of algorithmic breakthroughs in her bio-anchors model (…and therefore she views the possibility of breakthroughs as a pro tanto reason to think that her report’s timelines are too long).

OK, well, I actually agree with Ajeya that algorithmic breakthroughs are not strictly required for TAI, in the narrow sense that her Evolution Anchor (i.e., recapitulating the process of animal evolution in a computer simulation) really would work given infinite compute and infinite runtime and no additional algorithmic insights. (In other words, if you do a giant outer-loop search over the space of all possible algorithms, then you’ll find TAI eventually.) But I think that’s really leaning hard on the assumption of truly astronomical quantities of compute [or equivalent via incremental improvements in algorithmic efficiency] being available in like 2100 or whatever, as nostalgebraist points out. I think that assumption is dubious, or at least it’s moot—I think we’ll get the algorithmic breakthroughs far earlier than anyone would or could do that kind of insane brute force approach.

Ryan Greenblatt @ 2024-07-05T22:33 (+3)

I agree that these models assume something like "large discontinuous algorithmic breakthroughs aren't needed to reach AGI".

(But incremental advances which are ultimately quite large in aggregate and which broadly follow long running trends are consistent.)

However, I interpreted "current paradigm + scale" in the original post as "the current paradigm of scaling up LLMs and semi-supervised pretraining". (E.g., not accounting for totally new RL schemes or wildly different architectures trained with different learning algorithms which I think are accounted for in this model.)

JWS @ 2024-06-29T09:00 (+2)

From the summary page on Open Phil:

In this framework, AGI is developed by improving and scaling up approaches within the current ML paradigm, not by discovering new algorithmic paradigms.

From this presentation about it to GovAI (from April 2023) at 05:10:

So the kinda zoomed out idea behind the Compute-centric framwork is that I'm assuming something like the current paradigm is going to lead to human-level AI and further, and I'm assuming that we get there by scaling up and improving the current algorithmic approaches. So it's going to look like better versions of transformers that are more efficient and that allow for larger context windows..."

Both of these seem to be pretty scaling-maximalist to me, so I don't think the quote seems wrong, at least to me? It'd be pretty hard to make a model which includes the possibility of the paradigm not getting us to AGI and then needing a period of exploration across the field to find the other breakthroughs needed.

Tsunayoshi @ 2024-06-16T20:34 (+3)

Great post, we need more summaries of disagreeing view points! 


Having said that, here are a few replies: 
 

I think this makes me very concern of a strong ideological and philosophical bubble in the Bay regarding these core questions of AI

I am only slightly acquainted with Bay area AI safety discourse, but my impression is indeed that people lack familiarity with some of the empirically true and surprising points made by skeptics e.g. Yann LeCun(LLMs DO lack common sense and robustness), and that is bad. Nevertheless, I do not think you are outright banished if you express such a viewpoint. IIRC Yudkowsky himself asserted in the past that LLMs are not sufficient for AGI (he made a point about being surprised at GPT-4 abilities on the Lex Fridman podcast). I would not put too much stock into LW upvotes as a measure of AIS researchers POV, as most LW users are engaging with AIS as a hobby and consequently do not have a very sophisticated understanding of the current pitfalls in LLMs.
 


On priors, it seems odd to place very high credence in results on exactly one benchmark. The fate of most "fundamentally difficult for LLMs, this time we mean it" benchmarks has usually been that next gen LLMs perform substantially better at them, which is also a point "Situational Awareness" makes. (e.g. Winograd schemas, GPQA). Focusing on the ARC challenge now and declaring it the actual true test of intelligence is a little bit survivorship bias.   

Scale Maximalists, both within the EA community and without, would stand to lose a lot of Bayes points/social status/right to be deferred to

Acknowledging that status games are bad in general, I do think that it is valid to point out that historically speaking the "Scale is almost all you need" worldview has so far been much more predictive of the performances that we do see with large models. The fact that this has been taken seriously by the AIS-community/Scott/Open Phil (I think) well before GPT-3 came out, whereas mainstream academic research thought of them as fun toys of little practical significance is a substantial win.

Even under uncertainty about whether the scaling hypothesis turns out to be essentially correct, it makes a lot of sense to focus on the possibility that it is indeed correct and plan/work accordingly. If it is not correct, we only have the opportunity cost of what else we could have done with our time and money. If it is correct, well.. you know the scenarios. 

 
 

SummaryBot @ 2024-06-17T17:48 (+1)

Executive summary: François Chollet and Dwarkesh Patel discuss key cruxes in the debate over whether scaling current AI approaches will lead to AGI, with Chollet arguing that more is needed beyond scaling and Patel pushing back on some of Chollet's claims.

Key points:

  1. Chollet introduces the ARC Challenge as a test of general intelligence that current large language models (LLMs) struggle with, despite the tasks being simple for humans.
  2. Chollet distinguishes between narrow "skill" and general "intelligence", arguing that LLMs are doing sophisticated memorization and interpolation rather than reasoning and generalization.
  3. Patel counters that with enough scale, interpolation could lead to general intelligence, and that the missing pieces beyond scaling may be relatively easy.
  4. Chollet thinks the hard parts of intelligence, like active inference and discrete program synthesis, are not addressed by the current scaling paradigm.
  5. The author believes Chollet makes a compelling case, and that if he is right it should significantly update people's views on AI risk and the value of current AI safety work.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Egg Syntax @ 2024-06-16T17:52 (+1)

Typo watch: 

Dwarkesh is annoyed because he thinks that François is conceptually defining LLM-like models as incapable of memorisation

I assume you mean 'incapable of generalization' here?