[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
By Teun van der Weij @ 2024-06-13T10:04 (+22)
This is a linkpost to https://arxiv.org/abs/2406.07358
This is a crosspost, probably from LessWrong. Try viewing it there.
nullAndrew Gimber @ 2024-06-14T07:11 (+1)
Is there a typo in the first figure? I think the answer to the MMLU (top) question should be B, not A, because the greatest common factor of 36 and 90 is 18, not 9. (Not of course the central point of your paper/post, but it tripped me up when reading.)
Teun_Van_Der_Weij @ 2024-06-14T08:55 (+1)
Ha, you're clearly right. We will fix it.