Unsolved research problems on the road to AGI

By Yarrow Bouchard 🔸 @ 2025-11-22T22:39 (+11)

Artificial general intelligence (AGI) — an AI system that can think, plan, learn, and solve problems like a human does, at the level of competence a human does — is not just a matter of scaling up existing AI models (which, in any case, is forlorn), but requires several problems in fundamental AI research to be solved. Those problems are:

Hierarchical planning
Continual learning
Learning from video
Data efficiency
Generalization
Reliability

While large language models (LLMs) are considered by many people to represent major fundamental progress in AI, this is in an important sense not true at all — LLMs have made no progress on any of these problems. AGI remains as far away as the time it takes to solve each one, which is hard to predict because progress in basic science is hard to predict.

I will give a description of each of these problems.

Hierarchical planning

Hierarchical planning is the ability to plan complex tasks that contain nested hierarchies of other tasks. For example, the task "make a sandwich" includes the task "get some bread". "Get some bread" includes "walk to the kitchen". "Walk to the kitchen" includes "plan a path to the kitchen and watch out for obstacles (e.g. the cat lying on the kitchen floor)", which includes "step over the cat, but stop moving if the cat suddenly gets up". And so on.

Hierarchical reinforcement learning is one idea for how to solve hierarchical planning, but it remains an open research problem.

Continual learning

Currently, deep learning-based and deep reinforcement learning-based systems have a training phase and a test phase. A model is trained for some amount of time, then training is done, permanently, and the model is deployed into the world, at which point it stops learning, permanently. Continual learning would mean there is no longer a distinction between the training phase and the test phase. AI systems would always be learning, just as humans do.

Learning from video data

Large language models (LLMs) benefit from the inherent structure in text data. Words or tokens are clean, crisp compositional units of text data that have no counterpoint in video data. Pixels are far too granular and a single pixel lacks semantic meaning in the way a word or token has it. Text prediction has a natural form: LLMs can predict the probability of the next word or token and, say, rank the five most likely words/tokens to be next in the sequence and assign a percentage probability to each. No natural form for video prediction exists. Trying to predict video from pixels to pixels leads to an uncontainable combinatorial explosion.

Effective video prediction probably requires a semantic and conceptual understanding of what's happening in the world, e.g. predicting plausible outcomes for video of a car driving up to a red stoplight requires knowing cars are things that drive on roads, that gravity keeps heavy objects stuck to the ground, that red means stop, etc. Conversely, effective video prediction techniques may help AI models gain this sort of conceptual and semantic understanding of the world.

Data efficiency

Humans frequently learn from zero examples, one example, two examples, or just a few. Deep learning models often require hundreds or thousands of training examples to get a competent grasp on a concept, e.g. a model must train on 1,000 photos of bananas to be able to classify photos of bananas with 91% accuracy. Models that learn via deep reinforcement learning require massive amounts of trial-and-error experience to learn skills.

For example, DeepMind's system AlphaStar couldn't learn how to play StarCraft II using reinforcement learning from scratch (this is related to difficulties with hierarchical planning). First, it required 971,000 examples of human-played games to imitation learn from. If the average game is 10 minutes, this is equivalent to 18.5 years of continuous play — something like three to four orders of magnitude more than humans require to learn to play at a comparable level of skill. Second, to attain Grandmaster-level skill, AlphaStar did 60,000 years of training via self-play, which is at least three orders of magnitude more than professional StarCraft II players could have played during their lifetime, even assuming all their waking hours since birth have been devoted to playing.

Generalization

Generalization is the ability for an AI system to understand concepts or information that isn’t well-represented in its training data. For example, can a convolutional neural network that has been trained on 1,000 labelled photos of bananas recognize an image of a cartoon banana? Can AlphaStar respond competently if a Grandmaster-level human player tries a new strategy or tactic that wasn’t in the 971,000 recordings of human-played games it trained on, and that didn’t come up during self-play?

Generalization is the holy grail of AI research. Current AI systems do not generalize well, at least not compared to humans, or even other mammals. AI systems are brittle, meaning they quickly fall apart when challenged with novel problems or situations.

Generalization is not to be confused with the ability to do a lot of different things. For instance, DeepMind’s model MuZero can play 57 Atari games, but wouldn’t be able to play a 58th Atari game you presented it with. MuZero can’t generalize from what it’s learned to something novel.

Reliability

The AI researcher Ilya Sutskever, a co-founder and former chief scientist of OpenAI, as well as a co-author of the breakthrough AlexNet paper in 2012, has identified reliability as possibly the hardest challenge for deep learning to overcome. In a 2023 appearance on Dwarkesh Patel’s podcast, Sutskever had the following remarks:

…there is this effect where optimistic people who are working on the technology tend to underestimate the time it takes to get there. But the way I ground myself is by thinking about the self-driving car. In particular, there is an analogy where if you look at the — so I have a Tesla, and if you look at the self-driving behavior of it, it looks like it does everything. It does everything. But it's also clear that there is still a long way to go in terms of reliability. And we might be in a similar place with respect to our models where it also looks like we can do everything, and at the same time, we will need to do some more work until we really iron out all the issues and make it really good and really reliable and robust and well-behaved.

Patel then asked what would be the most likely cause if the economic value of LLMs turned out to be disappointing. Sutskever again flagged reliability:

I really don't think that's a likely possibility, so that's the preface to the comment. But if I were to take the premise of your question, well, why were things disappointing in terms of real-world impact? My answer would be reliability. If somehow it ends up being the case that you really want them to be reliable and they ended up not being reliable, or if reliability turned out to be harder than we expect.

I really don't think that will be the case. But if I had to pick one and you were telling me — hey, why didn't things work out? It would be reliability. That you still have to look over the answers and double-check everything. That just really puts a damper on the economic value that can be produced by those systems.

In many cases, the question of economic importance is not what an AI system can do correctly 50% of the time or 90% of the time, but 99.999% of the time, if that's the sort of reliability humans have on such tasks. Self-driving cars are the prime example of this, but the same idea applies to LLMs. If a company wants to use LLMs to summarize financial documents, for instance, mistakes in the summaries are costly to correct, since a human must check the summaries for accuracy. If mistakes slip past human reviewers and aren't corrected, there is a risk bad information could lead to even more costly mistakes down the line.

Getting reliability from 90% to 99% and then to 99.9% and so on is often intractable in practice. Deep learning scaling trends indicate exponentially more training data is required to improve models' accuracy. Obtaining large-scale training data for self-driving cars is expensive and carries safety concerns. In the case of LLMs, the training data is nearly exhausted.

Implications for AGI timelines

I agree with the physicist David Deutsch’s argument that inductively extrapolating a trend forward without an explanatory theory of what is causing that trend is equivalent to a belief in magic. In the case of AI progress, what induction actually implies is, additionally, ambiguous.

If you have a vague sense that AI has been making a lot of progress, you could simply extrapolate that AI will continue to make a lot of progress, and then use your imagination to fill in the gaps of what that means, specifically. If you look at specific problems like those described above and note that minimal progress on these problems has been made in recent years, you could extrapolate this forward and infer that solving any of them will take centuries. But neither of these is a sensible approach to making predictions; making predictions just by extrapolating a trend without understanding why that trend is happening isn’t sensible in the first place.

My conclusion is that the time away from AGI is highly uncertain, but I can get a vague intuition for it based on research progress on these basic science problems. Solving all of these problems in the very near future seems unlikely, so it seems that AGI is very unlikely within, say, the next five to ten years. However, beyond the very near future, the years laid out before us quickly slip into a fog of uncertainty. To try to say whether AGI is 30 years away or 60 or 90 or 120 seems impossible in a quite fundamental sense. It seems completely forlorn to even try to guess. We do not understand the basic science involved, and we do not understand how it will eventually be figured out. There is no precedent for anyone ever correctly predicting anything like this (except maybe through sheer luck).

Tracking AGI progress

To track AGI progress, one should not rely on text-based question and answer benchmarks for LLMs or similarly simplistic and uninformative indicators. One should track research progress on each one of these fundamental research problems. This is not something one can readily quantify, but I challenge the assumption that we should expect to be able to readily quantify science or scientific progress. Premature operationalization (e.g. defining happiness as daily smile frequency) is the bane of good conceptual understanding. Underlying the desire to quantify is the desire for rigour, but operationalizing an informal concept with a sloppy, oversimplifying proxy is the opposite of rigour. This should be sternly frowned upon — the equivalent of making unsupported assertions.

It may be possible to eventually quantify the thing you want to measure, but this requires patience. First, understand the thing you want to measure as deeply as possible. Then try to imagine ways it could possibly be quantified. Some people want to rush to the second part, but this can only end badly. If you don’t understand what you’re trying to measure, you won’t measure it well.

For now, the best way to track AGI progress remains qualitative. Look at the research and see how much progress is being made. There will be measurements involved, but that won’t be the whole story. Moreover, figuring out which measurements matter and which ones don’t will be a complex reasoning process, requiring philosophical reflection on how to formalize informal concepts.

Conclusion

The amount of progress in fundamental AI research required for AGI has been widely ignored or underestimated in discussions around forecasting AGI. Discussions often proceed on the assumption that scaling will lead to AGI, which is false, or that progress on fundamental research problems has been steady, rapid, and continuous, which is also false.

There is also a kind of supernatural thinking prevalent in some discussions, in which the idea of AGI inventing itself is seriously discussed — but this is an impossibility. For an AI system to invent anything, first humans must invent an AI system with the ability to invent things. To the extent people think AI systems can already do this or are just on the cusp of it, they misunderstand current AI capabilities and limitations.

Qualitative impressions of the amount of research progress on the problems I described above may vary from person to person. What seems clear, in any case, is that people who agree with my account of the research obstacles on the road to AGI will tend to think that near-term AGI is unlikely or at least very uncertain.

Ben_West🔸 @ 2025-11-26T21:32 (+11)

LLMs have made no progress on any of these problems

Can we bet on this? I propose: we give a video model of your choice from 2023 and one of my choice from 2025 two prompts (one your choice, one my choice) then ask some neutral panel of judges (I'm happy to just ask random people in a coffee shop) which model produced more realistic videos.

tobycrisford 🔸 @ 2025-11-24T06:54 (+3)

LLMs have made no progress on any of these problems.

I think this probably overstates things? For example, o3 was able to achieve human level performance on ARC-AGI-1, which I think counts as at least some kind of progress on the problems of generalization and data efficiency?