[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

By Teun van der Weij @ 2024-06-13T10:04 (+22)

This is a linkpost to https://arxiv.org/abs/2406.07358

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Andrew Gimber @ 2024-06-14T07:11 (+1)

Is there a typo in the first figure? I think the answer to the MMLU (top) question should be B, not A, because the greatest common factor of 36 and 90 is 18, not 9. (Not of course the central point of your paper/post, but it tripped me up when reading.)

Teun_Van_Der_Weij @ 2024-06-14T08:55 (+1)

Ha, you're clearly right. We will fix it.