Fit Testing AI Benchmarking

By Declan McKenna 🔷 @ 2026-04-07T10:17 (+4)

This is a linkpost to https://declanmck.com/2026/04/02/fit-testing-ai-benchmarking/

In February I got in touch with CaML as part of the AIxAnimals incubator, run by Sentient Futures. They tasked me with putting MORU Bench (Moral Reasoning Under Uncertainty) up on Inspect a benchmarking service run by the UK’s AI Security Institute. My PR was accepted and completed the project within three weeks. It was my first project working with Claude Code.

It felt surprisingly familiar despite being python rather than Swift. Building tooling, writing to a spec, making sure test coverage is solid. Of all the things I’ve done during this career switch, this is probably the closest to what I actually did as an iOS framework engineer.

A few things stood out to me:

Is this impactful work?

I’m a bit torn. On one hand, what CaML are doing (creating benchmarks that measure how an AI views animal welfare, how compassionate it is to non-human beings) seems like a great way to influence model behaviour. These benchmarks are used by frontier labs as targets to hit before releasing a model, so they have a pretty direct effect on how models will end up behaving.

On the other hand, I’m apprehensive about benchmarks in general. During my time at the EA hotel I spoke to a few AI safety people who considered most evaluations as progressing capabilities, not safety. The logic being that an evaluation measuring frontier maths or any other capability ends up helping the labs make their agents better at those capabilities. I don’t feel experienced enough to hold a strong opinion on this myself, I’m just aware opinions differ. I’ve been encouraging the people I know who feel strongly about this to write about it because I’d love to see their views stress-tested.

Fit

Of all the fit tests so far, this has been the most promising in terms of how capable I am at it. It maps nicely on to my existing experience and I did enjoy doing it, maybe not quite as much as iOS development, but more so than any other project I’ve worked on during this process.

I also have to bear in mind that creating an evaluation is probably one of the easier tasks. I can see there being more interesting work when you’re actually maintaining a framework rather than using it to create benchmarks. It’s also a project where I’m in a much better position to hit the ground running and get a job.

A couple of concerns though:

All in all, it’s a high contender so far and a successful fit test. I may go back and do a few additional PRs for Inspect or look into other evaluation projects to revisit this.
 


Jay Bailey🔸 @ 2026-04-08T02:45 (+5)

Thanks for the post! For those reading, I'm Jay - head of technology and standards for the Inspect Evals repo, and the reviewer of Declan's PR. I happened to spot this post without realising it was from a recent contributor! A couple of quick clarifications around the structure of Inspect Evals (it is pretty confusing)

Inspect Evals != Inspect, and isn't run by the same team. Inspect is the evals framework, Inspect Evals is a repository of evals that use that framework. Inspect Evals is run by Arcadia Impact, and we're contracted by the UK AISI to maintain it. 

Our developers work remotely as contractors, so moving to London isn't required. I'm in Australia at the moment. (though we're doing a restructure atm and it's uncertain how that's going to pan out, so I'm not sure about our current hiring)

I think the EA Hotel people have a point about evaluations, personally. I think if you're going to open-source evaluations, you should ask "Would I be okay if frontier AI companies trained on this / hill-climbed on this metric?" For frontier maths evals you might not want this - is it worth the increased knowledge we get about these capabilities? For moral reasoning under uncertainty, you may actively want them to do something like this.

Finally, I'm glad you liked the agent stuff - that's been a majority of where my time's gone this quarter. Appreciate the feedback, and more is always welcome :)