Inference-Only Debate Experiments Using Math Problems
By Arjun Panickssery @ 2024-08-06T17:44 (+3)
This is a crosspost, probably from LessWrong. Try viewing it there.
nullSummaryBot @ 2024-08-07T14:44 (+1)
Executive summary: Experiments on AI debate for math problems show that debate only slightly outperforms consultancy and often fails to beat naive-judge baselines, with no clear relationship between debater persuasiveness and judge accuracy in reasoning-gap settings.
Key points:
- Three measures for evaluating debate: comparison to naive-judge baseline, comparison to consultancy, and judge accuracy vs. debater persuasiveness.
- Information-gap experiments (e.g., QuALITY) showed debate outperforming consultancy and naive judges, with positive trends in judge accuracy as debater persuasiveness increased.
- Reasoning-gap experiments on math problems (GSM8K) found debate only slightly outperforming consultancy and often failing to beat naive-judge baselines.
- No positive relationship observed between debater persuasiveness and judge accuracy in the reasoning-gap setting, contrary to information-gap results.
- Evidence of self-preference bias where judges favor debaters from similar model families.
- Results suggest limitations of current debate approaches for improving AI reasoning on math problems.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.