o3
By Zach Stein-Perlman @ 2024-12-20T21:00 (+74)
This is a crosspost, probably from LessWrong. Try viewing it there.
nulltobycrisford 🔸 @ 2024-12-21T07:00 (+12)
The ARC performance is a huge update for me.
I've previously found Francois Chollet's arguments that LLMs are unlikely to scale to AGI pretty convincing. Mainly because he had created an until-now unbeaten benchmark to back those arguments up.
But reading his linked write-up, he describes this as "not merely an incremental improvement, but a genuine breakthrough". He does not admit he was wrong, but instead paints o3 as something fundamentally different to previous LLM-based AIs, which for the purpose of assessing the significance of o3, amounts to the same thing!
David M @ 2024-12-21T23:05 (+2)
Thanks for making the connection to Francois Chollet for me - I'd forgotten I'd read this interview with him by Dwarkesh Patel half a year ago that had made me a little more skeptical of the nearness of AGI.
Hawk.Yang 🔸 @ 2024-12-22T03:40 (+1)
Honestly, not sure I would agree with this. Like Chollet said, this is fundamentally different from simply scaling the amount of parameters (derived from pre-training) that a lot of previous scaling discourse centered around. To then take this inference time scaling stuff, which requires a qualitatively different CoT/Search Tree strategy to be appended to an LLM alongside an evaluator model, and call it scaling is a bit of a rhetorical sleight of hand.
While this is no doubt a big deal and a concrete step toward AGI, there are enough architectural issues around planning, multi-step tasks/projects and actual permanent memory (not just RAG) that I'm not updating as much as much as most people are on this. I would also like to see if this approach works on tasks without clear, verifiable feedback mechanisms (unlike software engineering/programming or math). My timelines remain in the 2030s.