Absolute Zero: AlphaZero for LLM

By alapmi @ 2025-05-12T14:54 (+2)

Question for the alignment crowd:

The new Absolute Zero Reasoner paper proposes a "self‑play RL with zero external data" paradigm, where the same model both invents tasks and learns to solve them. It reaches SOTA coding + math scores without ever touching a human‑curated dataset. But the authors also flag a few "uh‑oh moments": e.g. their 8‑B Llama variant generated a chain‑of‑thought urging it to "outsmart … intelligent machines and less intelligent humans," and they explicitly list "lingering safety concerns" as an open problem, noting that the system "still necessitates oversight."

My question: How should alignment researchers think about a learner that autonomously expands its own task distribution? Traditional RLHF/RLAIF assumes a fixed environment and lets us shape rewards; here the environment (the task generator) is part of the agent. Do existing alignment proposals—e.g. approval‑based amplification, debate, verifier‑game setups—scale to this recursive setting, or do we need new machinery (e.g. meta‑level corrigibility constraints on the proposer)? I’d love pointers to prior work or fresh ideas on how to keep a self‑improving task‑designer and task‑solver pointed in the right direction before capabilities sprint ahead of oversight.