Inference-Only Debate Experiments Using Math Problems

By Arjun Panickssery @ 2024-08-06T17:44 (+3)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
SummaryBot @ 2024-08-07T14:44 (+1)

Executive summary: Experiments on AI debate for math problems show that debate only slightly outperforms consultancy and often fails to beat naive-judge baselines, with no clear relationship between debater persuasiveness and judge accuracy in reasoning-gap settings.

Key points:

  1. Three measures for evaluating debate: comparison to naive-judge baseline, comparison to consultancy, and judge accuracy vs. debater persuasiveness.
  2. Information-gap experiments (e.g., QuALITY) showed debate outperforming consultancy and naive judges, with positive trends in judge accuracy as debater persuasiveness increased.
  3. Reasoning-gap experiments on math problems (GSM8K) found debate only slightly outperforming consultancy and often failing to beat naive-judge baselines.
  4. No positive relationship observed between debater persuasiveness and judge accuracy in the reasoning-gap setting, contrary to information-gap results.
  5. Evidence of self-preference bias where judges favor debaters from similar model families.
  6. Results suggest limitations of current debate approaches for improving AI reasoning on math problems.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.