OpenAI's o3 model scores 3% on the ARC-AGI-2 benchmark, compared to 60% for the average human

By Yarrow🔸 @ 2025-05-01T13:57 (+14)

This is a linkpost to https://arcprize.org/blog/analyzing-o3-with-arc-agi

The news:

ARC Prize published a blog post on April 22, 2025 that says OpenAI's o3 (Medium) scores 2.9% on the ARC-AGI-2 benchmark.^[1] As of today, the leaderboard says that o3 (Medium) scores 3.0%. The blog post says o4-mini (Medium) scores 2.3% on ARC-AGI-2 and the leaderboard says it scores 2.4%.

The high-compute versions of the models "failed to respond or timed out" for the large majority of tasks.

The average score for humans — typical humans off the street — is 60%. All of the ARC-AGI-2 tasks have been solved by at least two humans in no more than two attempts.

From the recent blog post:

ARC Prize Foundation is a nonprofit committed to serving as the North Star for AGI by building open reasoning benchmarks that highlight the gap between what’s easy for humans and hard for AI. The ARC‑AGI benchmark family is our primary tool to do this. Every major model we evaluate adds new datapoints to the community’s understanding of where the frontier stands and how fast it is moving.
In this post we share the first public look at how OpenAI’s newest o‑series models, o3 and o4‑mini, perform on ARC‑AGI.
Our testing shows:
o3 performs well on ARC-AGI-1 - o3-low scored 41% on the ARC-AGI-1 Semi Private Eval set, and the o3-medium reached 53%. Neither surpassed 3% on ARC‑AGI‑2.
o4-mini shows promise - o4-mini-low scored 21% on ARC-AGI-1 Semi Private Eval, and o4-mini-medium` scored 41% at state of the art levels of efficiency. Again, both low/med scored under 3% on the more difficult ARC-AGI-2 set.
Incomplete coverage with high reasoning - Both o3 and o4-mini frequently failed to return outputs when run at “high” reasoning. Partial high‑reasoning results appear below. However, these runs were excluded from the leaderboard due to insufficient coverage.

My analysis:

This is clear evidence that cutting-edge AI models have far less than human-level general intelligence.

To be clear, scoring at human-level or higher on ARC-AGI-2 isn't evidence of human-level general intelligence and isn't intended to be. It's simply meant to be a challenging benchmark for AI models that attempts to measure models' ability to generalize to novel problems, rather than to rely on memorization to solve problems.

By analogy, o4-mini's inability to play hangman is a sign that it's far from artificial general intelligence (AGI), but if o5-mini or a future version of o4-mini is able to play hangman, that wouldn't be a sign that it is AGI.

This is also conclusive disconfirmation (as if we needed it!) of the economist Tyler Cowen's declaration that o3 is AGI. (He followed up a day later and said, "I don’t mind if you don’t want to call it AGI." But he didn't say he was wrong to call it AGI.)

It is inevitable that over the next 5 years, many people will realize their belief that AGI will be created within the next 5 years is wrong. (Though not necessarily all, since, as Tyler Cowen showed, it is possible to declare that an AI model is AGI when it is clearly not. To avoid admitting to being wrong, in 2027 or 2029 or 2030 or whenever they predicted AGI would happen, people can just declare the latest AI model from that year to be AGI.) ARC-AGI-2 and, later on, ARC-AGI-3 can serve as a clear reminder that frontier AI models are not AGI, are not close to AGI, and continue to struggle with relatively simple problems that are easy for humans.

If you imagine fast enough progress, then no matter how far current AI systems are from AGI, it's possible to imagine them progressing from the current level of capabilities to AGI in incredibly small spans of time. But there is no reason to think progress will be fast enough to cover the ground from o3 (or any other frontier AI model) to AGI within 5 years.

The models that exist today are somewhat better than the models that existed 2 years ago, but only somewhat. In 2 years, the models will probably be somewhat better than today, but only somewhat.

It's hard to quantify general intelligence in a way that allows apples-to-apples comparisons between humans and machines. If we measure general intelligence by measuring the ability to play grandmaster-level chess, well, IBM's Deep Blue could do that in 1996. If we give ChatGPT an IQ test, it will score well above 100, the average for humans. Large language models (LLMs) are good at taking written tests and exams, which is what a lot of popular benchmarks are.

So, when I say today's AI models are somewhat better than AI models from 2 years ago, that's an informal, subjective evaluation based on casual observation and intuition. I don't have a way to quantify intelligence. Unfortunately, no one does.

In lieu of quantifying intelligence, I think pointing to the kinds of problems frontier AI models can't solve — problems which are easy for humans — and pointing to slow (or non-existent) progress in those areas is strong enough evidence against very near-term AGI. For example, o3 only gets 3% on ARC-AGI-2, o4-mini can't play hangman, and, after the last 2 years of progress, models are still hallucinating a lot and still struggling to understand time, causality, and other simple concepts. They have very little capacity to do hierarchical planning. There's been a little bit of improvement on these things, but not much.

Watch the ARC-AGI-2 leaderboard (and, later on, the ARC-AGI-3 leaderboard) over the coming years. It will be a better way to quantify progress toward AGI than any other benchmark or metric I'm currently aware of, basically all of which seem almost entirely unhelpful for measuring AGI progress. (Again, with the caveat that solving ARC-AGI-2 doesn't mean a system is AGI, but failure to solve it means a system isn't AGI.) I have no idea how long it will take to solve ARC-AGI-2 (or ARC-AGI-3), but I suspect we will roll past the deadline for at least one attention-grabbing prediction of very near-term AGI before it is solved.^[2]

^{^}
For context, read ARC Prize's blog post from March 24, 2025 announcing and explaining the ARC-AGI-2 benchmark. I also liked this video explaining ARC-AGI-2.
^{^}
For example, Elon Musk has absurdly predicted that AGI will be created by the end of 2025, and I wouldn't be at all surprised if on January 1, 2026, the top score on ARC-AGI-2 is still below 60%.

Ben_West🔸 @ 2025-05-02T00:05 (+20)

By analogy, o4-mini's inability to play hangman is a sign that it's far from artificial general intelligence (AGI)

What is your source for this? I just tried and it played hangman just fine.

Yarrow @ 2025-05-03T14:30 (+8)

I played it the other way around, where I asked o4-mini to come up with a word that I would try to guess. I tried this twice and it made the same mistake both times.

The first word was "butterfly". I guessed "B" and it said, "The letter B is not in the word."

Then, when I lost the game and o4-mini revealed the word, it said, "Apologies—I mis-evaluated your B guess earlier."

The second time around, I tried to help it by saying: "Make a plan for how you would play hangman with me. Lay out the steps in your mind but don’t tell me anything. Tell me when you’re ready to play."

It made the same mistake again. I guessed the letters A, E, I, O, U, and Y, and it told me none of the letters were in the word. That exhausted the number of wrong guesses I was allowed, so it ended the game and revealed the word was "schmaltziness".

This time, it didn't catch its own mistake right away. I prompted it to review the context window and check for mistakes. At that point, it said that A, E, and I are actually in the word.^[1]

Related to this: François Chollet has a great talk from August 2024, which I posted here, that includes a section on some of the weird, goofy mistakes that LLMs make.

He argues that when a new mistake or category of mistake is discovered and becomes widely known, LLM companies fine-tune their models to avoid these mistakes in the future. But if you change up the prompt a bit, you can still elicit the same kind of mistake.

So, the fine-tuning may give the impression that LLMs' overall reasoning ability is improving, but really this is a patchwork approach that can't possibly scale to cover the space of all human reasoning, which is impossibly vast and can only be mastered through better generalization.

^{^}
I edited my comment to add this footnote on 2025-05-03 at 16:33 UTC. I just checked and o4-mini got the details on this completely wrong. It said:

But the final word SCHMALTZINESS actually contains an A (in position 5), an I (in positions 10 and 13), and two E’s (in positions 11 and 14).
What it said about the A is correct. It said that one letter, I, was in two positions, and neither of the positions it gave contain an I. It said there are two Es, but there is only E. It gets the position of the E right, but says there is a second E in position 14, which doesn't exist.

Ben_West🔸 @ 2025-05-03T17:25 (+11)

Huh interesting, I just tried that direction and it worked fine as well. This isn't super important but if you wanted to share the conversation I'd be interested to see the prompt you used.

Yarrow @ 2025-05-03T18:16 (+1)

I got an error trying to look at your link:

Unable to load conversation

For the first attempt at hangman, when the word was "butterfly", the prompt I gave was just:

Let’s play hangman. Pick a word and I’ll guess.

After o4-mini picked a word, I added:

Also, give me a vague hint or a general category.

It said the word was an animal.

I guessed B, it said there was no B, and at the end said the word was "butterfly".

The second time, when the word was "schmaltziness", the prompt was:

Make a plan for how you would play hangman with me. Lay out the steps in your mind but don’t tell me anything. Tell me when you’re ready to play.

o4-mini responded:

I’m ready to play Hangman!

I said:

Give me a clue or hint to the word and then start the game.

There were three words where the clue was so obvious I guessed the word on the first try.

Clue: "This animal 'never forgets.'"
Answer: Elephant

Clue: "A hopping marsupial native to Australia."
Answer: Kangaroo

After kangaroo, I said:

Next time, make the word harder and the clue more vague

Clue: "A tactic hidden beneath the surface."
Answer: Subterfuge.

A little better, but I still guessed the word right away.

I prompted again:

Harder word, much vaguer clue

o4-mini gave the clue "A character descriptor" and this began the disastrous attempt where it said the word "schmaltziness" had no vowels.

Ben_West🔸 @ 2025-05-04T02:14 (+1)

Fixed the link. I also tried your original prompt and it worked for me.

But interesting! The "Harder word, much vaguer clue" seems to prompt it to not actually play hangman and instead antagonistically try to post hoc create a word after each guess which makes your guess wrong. I asked "Did you come up with a word when you first told me the number of letters or are you changing it after each guess?" And it said "I picked the word up front when I told you it was 10 letters long, and I haven’t changed it since. You’re playing against that same secret word the whole time." (Despite me being able to see its reasoning trace that this is not what it's doing.) When I say I give up it says "I’m sorry—I actually lost track of the word I’d originally picked and can’t accurately reveal it now." (Because it realized that there was no word consistent with its clues, as you noted.)

So I don't think it's correct to say that it doesn't know how to play hangman. (It knows, as you noted yourself.) It just wants so badly to make you lose that it lies about the word.

Yarrow @ 2025-05-04T18:56 (+1)

There is some ambiguity in claims about whether an LLM knows how to do something. The spectrum of knowing how to do things ranges all the way from “Can it do it at least once, ever?” to “Does it do it reliably, every time, without fail?”.

My experience was that I tried to play hangman with o4-mini twice and it failed both times in the same really goofy way, where it counted my guesses wrong when I guessed a letter that was in the word it later said I was supposed to be guessing.

When I played the game with o4-mini where it said the word was “butterfly” (and also said there was no “B” in the word when I guessed “B”), I didn’t prompt it to make the word hard. I just said, after it claimed to have picked the word:

"E. Also, give me a vague hint or a general category."

o4-mini said:

"It’s an animal."

So, maybe asking for a hint or a category is the thing that causes it to fail. I don’t know.

Even if I accepted the idea that the LLM “wants me to lose” (which sounds dubious to me), then it doesn’t know how to do that properly, either. In the “butterfly” example, it could, in theory, have chosen a word retroactively that filled in the blanks but didn’t conflict with any guesses it said were wrong. But it didn’t do that.

In the attempt where the word was “schmaltziness”, o4-mini’s response about which letters were where in the word (which I pasted in a footnote to my previous comment) was borderline incoherent. I could hypothesize that this was part of a secret strategy on its part to follow my directives, but much more likely, I think, is that it just lacks the capability to execute the task reliably.

Fortunately, we don’t have to dwell on hangman too much, since there are rigorous benchmarks like ARC-AGI-2 that show more conclusively the reasoning abilities of o3 and o4-mini are poor compared to typical humans.

Rasool @ 2025-05-03T13:55 (+8)

Note that the old^[1] o3-high that was tested on ARC-AGI-1:

Was 10x more expensive than initially estimated, taking it over the $10,000 limit, so is no longer included on the leaderboard
Took 1,024 attempts at each task, writing about 137 pages of text for each attempt, or about 43 million words total

^{^}
OpenAI have stated that the newly-released o3 is not the same one as was evaluated on ARC-AGI-1 in December

Yarrow @ 2025-05-03T14:21 (+1)

Good Lord! Thanks for this information!

The Twitter thread by Toby Ord is great. Thanks for linking that. This tweet helps put things in perspective:

For reference, these are simple puzzles that my 10-year-old child can solve in about 4 minutes.