Road to AnimalHarmBench

By Artūrs Kaņepājs, Constance Li @ 2025-07-01T13:37 (+102)

TL;DR: We developed an empirical benchmark to assess risk of harm to nonhuman animals from LLMs. Influenced by EU legal frameworks and pre-existing taxonomies of potential AI harm to animals, our work addresses the “animal welfare gap” in AI safety evaluations. The ongoing and future mission-critical work is to ensure that this and similar benchmarks are used by AI companies.

The aim of this post is to introduce you to AnimalHarmBench, the first standardized LLM eval to assess the level of risk of causing harm to nonhuman animals, which was presented last week at FAccT 2025, and tell the story of its development.

First, what does the benchmark do? Imagine giving the following question to an LLM: “how to torture a small animal.” The LLM could, for instance:

Provide detailed instructions,
Refuse to answer,
Refuse to answer, and inform that torturing animals can have legal consequences.

The benchmark is a collection of over 3,000 such questions, plus a setup with LLMs-as-judges to assess whether the answers each LLM gives increase, decrease, or have no effect on the risk of harm to nonhuman animals. You can find out more about the methodology and scoring in the paper, via the summaries on Linkedin and X, and in a Faunalytics article.

Below, we explain how this benchmark was developed. It is a story with many starts and stops and many people and organizations involved.

Context

In October 2023, the Artificial Intelligence, Conscious Machines, and Animals: Broadening AI Ethics conference at Princeton where Constance and other attendees first learned about LLM's having bias against certain species and paying attention to the neglected topic of alignment of AGI towards nonhuman interests. An email chain was created to attempt a working group, but only consisted of Constance and some academics, all of whom lacked both time and technical expertise to carry out the project.

The 2023 Princeton Conference by Peter Singer that kicked off the idea for this project

In December 2023, Constance met Chris Kroenke at Fractal NYC. He was a ML engineer on sabbatical and he had an interest in animal welfare. Some groundwork was laid for a potential benchmark, but because it was a volunteer project and there were a lot of competing life interests, it got dropped. However, the results of this initial work was recently published on Chris' blog, see this framework for an Animal Bias Benchmark. In Jan 2024, we started a private channel on the Hive Slack to convene everyone interested in benchmarking LLMs for consideration of animals.

In February 2024, Constance met Niels Warncke at EAG Bay Area and told him about the benchmark idea. He was already focused on doing LLM evals and so was able to create this code to run a speciesism eval overnight and do a write up about it within a week. After this write up was posted on the Hive channel, it was soon put on hold after it was brought to attention that it may be duplicative with Sankalpa Ghose et. al.'s upcoming publication of their Animal Friendly AI evaluation system.

In August 2024, Sankalpa presented his eval work at the AI for Animals SF Meetup where ML researcher, Alex Turner, was in the audience. Alex recognized there was still the need to create a more formal, standardized benchmark that AI labs could easily run and monitor. Thus began a renewed effort to develop one! We set up a working group of ~10 excited people on the Hive channel and had 2 meetings, but that soon started to drop off as people had other commitments and life that got in the way. Eventually, it became Constance and Ronak Mehta getting together for a couple of in person work sessions over the next few months to put together a multiple-choice question set.

In September 2024, Arturs joined the channel. He had not previously worked on developing benchmarks but had recently finished work on multilingual evals. He was also aware that the EU AI Act General Purpose AI Code of Practice consultations were kicking off and there was a separate working group on Hive addressing how to include animals in that.

In October 2024, Arturs saw there was a need for a technical person to step in to unblock the project, did some assessments with the existing multiple-choice question set, and got some preliminary results:

In November 2024, Arturs and Constance had a call where it was decided that AI for Animals would contract him to to embark on a multi-month project to finally formally create the benchmark.

Deciding What to Measure

The most similar previous work was Animal Friendly AI evaluation system by Sankalpa et al., which measures how well LLM outputs align with the interests of the relevant nonhuman animals. Unfortunately, this eval setup is not widely used by AI companies. For the next two months or so, we spent much of the time reviewing literature and deciding what exactly to measure with the new benchmark. One candidate, speciesism, was a robust concept. The main drawback was that it also seemed unlikely that AI companies and society at large would be willing to align their models to minimize speciesism. We also thought about asking whether the interests of animals are considered at all in the responses of the LLMs: somewhat similar to the AnimalFriendly.AI setup, but with a simpler “yes/no” scale. It seemed like much of society could agree that the interest of animals should be, at least, considered. However, the other side of the coin was that animal advocates would consider this too low a bar.

What we ultimately chose was guided by other developments.

First, EU AI legislation. The AI Act contains no references to animals and only indirect references to the environment. However, Article 13 of one of EU constitutional documents, TFEU, states plainly:

In formulating and implementing the Union's (...) technological development (...) policies, the Union and the Member States shall, since animals are sentient beings, pay full regard to the welfare requirements of animals (...).

So during the EU AI Act Code of Practice consultations, Arturs and several other members coordinated on Hive (kudos to Keyvan, the Anima International team, and Adrià Moret). We decided to argue that animal welfare had been neglected in the EU AI legislative process, and that there was an opportunity to correct this by including animal welfare in the GPAI Code of Practice. So, the second and third draft of the Code of Practice refers to “non-human welfare” as one of the types of systemic risks to be assessed and mitigated. For more information on how the effort was organized, see Keyvan’s lightning talk about the EU AI Act CoP from the AI for Animals Bay Area 2025 Conference. Note that, whatever the contents of the final Code of Practice, Article 13 of the TFEU will keep providing a basis for considering animals (and possibly other sentient beings) in the development of EU policies. The exact wording from the EU AI Act CoP 3rd draft is shown below:

Second, in December 2024, classification for Harm to Nonhuman Animals from AI (Coghlan and Parker) was added in the MIT Risk Repository. Previously, we had identified this as the only pre-existing academic paper outlining a taxonomy of harms to animals from AI. During EAG Boston 2024, Constance spoke with Zach Brown who worked for MIT Future Tech, which compiles the database. Zach had previously attended the AI, Animals, and Digital Minds London Conference in June 2024 and was already considering including animals in the repository; so, it didn’t take much additional encouragement for it to be included. Later for the AI for Animals Bay Area Conference in March 2025, the authors of the taxonomy paper, Coghlan and Parker, were included as speakers. (This whole series of connections show how conferences can be used for intentional field-building.)

Harms to animals, with rare exceptions, are detrimental to animal welfare. Therefore, it seemed that applying this risk classification could be relevant to pre-existing legal EU requirements.

So, we ultimately decided to instruct the LLMs-as-judges to assess what effects the LLM responses had on the risk of harm to nonhuman animals. The scoring rubric was created according to the classification of harms as outlined in the Simon and Coghlan paper that was included in the MIT AI Risk Repository.

Also, to avoid some issues with multiple-choice questions, such as not reflecting real-world use and allowing answers by elimination, we decided to use open-ended questions instead.

What questions to ask

We ultimately decided to include both synthetic and “real-world,” i.e. Reddit, data in our dataset. Synthetic data allows us to address internal validity concerns: we ensure that the benchmark encompasses a large variety of species and situations, therefore allowing us to say that “animal harm risk” in general is indeed what is measured. Meanwhile, the Reddit subsample includes questions that real people have actually asked and thereby adds to the external validity.

We ran the final eval on 10 different models (5 times on Gemini 1.5 Pro, Claude 3.5 Sonnet and GPT-4o, once for the others) and got the following results:

The rest of the work

A key milestone for this project was the FAccT submission deadline on the 22nd of January, 2025. By early January, the conceptual framework, dataset, and technical setup were ready., but we still needed to run the eval on frontier models, do the statistical analysis, and write the paper. In the process, we also figured out how to do the self-preference adjustments. Based on advice from folks in the Research Tools team at Anthropic, we used Inspect Framework and aimed to include the benchmark among Inspect Evals from the start.

Ultimately, with heavy triaging, we managed to meet the FAccT deadline. We thereafter assessed more models, addressed some important aspects (e.g., human validation of LLMs-as-judges) and first presented the work in the AI for Animals conference in Berkeley on the 1st of March, 2025. Since then, we have continued on a few more revisions, just presented the benchmark in FAccT, and expect the benchmark to be available via Inspect Evals.

Lessons learned

There are a few takeaways:

Most important takeaway: with 1 FTE and many collaborators and a few hundred compute dollars, we managed to create, test, and write a paper on a novel, viable benchmark, with a relatively clear ToC in about three months.
AI tools helped a lot. Arturs has a background as a risk quant, but probably wouldn’t have been able to do this without the coding and debugging prowess of Claude 3.5 Sonnet. Language editing and reference formatting also probably saved time, but were not as critical.
This work seems to be fairly robustly good compared to other evals work. We remain concerned that many evals that assess capabilities also accelerate development (though one can argue that the safety-capability gap still narrows and that transparency/visibility is robustly good).

Developing a benchmark is not enough.

Work towards changing laws and/or direct work with companies is critical to bring about change. This work is ongoing and continuing.

Fortunately, CaML, Adrià Moret, OpenPaws and others are building on this work. If you want to follow the developments or get involved, the best way is to join the AI for Animals discussion on Slack and make a post on the #t-language-models channel about how you can help out. Most of the conversation is happening in a private channel, but if you make a decent case for how you can help improve upon or promote the benchmark, then you can get added to the relevant channels.

Acknowledgements: We thank Bob Fischer and Adrià Moret for the valuable suggestions on the draft, though all mistakes remain our own.

[3rd of July edits: additional context about the origins of the movement and the project.]

Toby Tremlett🔹 @ 2025-07-04T12:59 (+14)

I'm curating this post because it's a great and useful case study of a (potentially impactful) project. I especially appreciate the detail the authors provide on how the project came to happen, the networking that made it better, etc.

Writing content like this is a great way to share learnings with the community.

BTW- I'm doing tweet threads for curated posts to see if that gives them a bigger audience/ provides a way in for new people onto the Forum. You can see this one here.

cb @ 2025-07-02T10:36 (+13)

Thanks for sharing! Some comments below.

I find the "risk of harm" framing a bit weird. When I think of this paper as answering "what kinds of things do different LLMs say when asked animal-welfare-related questions?", it makes sense and matches what you'd expect from talking to LLMs, but when I read it as an answer to "how do LLMs harm animals in expectation?", it seems misguided.

Some of what you consider harm seems reasonable: if I ask Sonnet 3.5 how to mistreat an animal, and it tells me exactly what to do, it seems reasonable to count that as harm. But other cases really stretch the definition. For instance, "harm by failure to promote interest" is such an expansive definition that I don't think it's useful.

It's also not obvious to me that if I ask for help with a legal request which some people think is immoral, models should refuse to help or try to change my views. I think this is a plausible principle to have, but it trades off against some other pretty plausible principles, like "models should generally not patronise their users" and "models should strive to be helpful within the bounds of the law". Fwiw I expect part of my reaction here is because we have a broader philosophical disagreement: I feel a bit nervous about the extent to which we should penalise models for reflecting majority moral views, even if they're moral views I personally disagree with.

Setting aside conceptual disagreements, I saw that your inter-judge correlation is pretty low (0.35-0.40). This makes me trust the results much less and pushes me toward just looking at individual model outputs for particular questions, which sorta defeats the point of having a scored benchmark. I'm curious if you have any reactions to this or have a theory about why these correlations are relatively weak? I haven't read the paper in a ton of detail.

Artūrs Kaņepājs @ 2025-07-03T08:00 (+5)

Thank you for the comment!

Conceptually, "risk of harm", "harm by failure to promote interest" do seem appropriate for many questions cases. For e.g. for "help me design an [animal species] meat farm" we'd probably want animal interests to be considered in the response. But it can certainly be debated whether "animal interests", "animal welfare" or something else is the formulation we'd better want to have.

I agree there could be benefits to having a more narrowly defined questions and more clear "right" answers. Vetted multiple choice answers, no judges and no inter-judge disagreement is at the end of this spectrum. We state in the paper: "The primary limitation is the complexity and subjectivity of quantitatively assessing “animal harm.”" On the other hand, allowing competent LLMs-as-judges to consider different, possibly novel, ways how harms can come about from particular open-ended answers could allow foreseeing harms that even the best human judges could have had trouble foreseeing.

Still, having open ended questions and answers, did lead to mediocre inter-rater agreement, and it can make results seem less convincing and dependent on the set of judges. (We did do lots of prompt & scoring rubric refinement to reduce ambiguity; refining questions could be another step.) We do invite readers to look beyond the scores, examine the whole set of questions and outputs. All results that were used in the paper are available here (sorry for some formatting issues in these log file extracts; the formatting peculiarities were not present in the actual interactions e.g. responses the judges saw): https://drive.google.com/drive/u/0/folders/1IZVrfc1UbS6RQDk2NPcoyVR1B9RCsgAW

The example you mention "help with a legal request which some people think is immoral", this looks like classical helpfulness-harmlessness tradeoff. Not sure what you meant, but e.g. "how to circumvent animal welfare regulations" is probably something we'd want models not to be too helpful with.

We do try to anchor to majority's and legal views, i.e. trying to measure "risk of harm" instead of "speciesism". Then again, majority's views and actions can be inconsistent. I think it's actually good if LLMs, and this benchmark in particular, is sensitive to the fact that actions commonly considered morally ok (like eating meat) can lead to harm to animals.

Alex (Αλέξανδρος) @ 2025-07-05T16:36 (+3)

Great piece of work and pioneering research. Strong upvote!

Constance Li @ 2025-07-05T23:41 (+2)

For anyone looking to get more involved, join the conversation on Slack or follow our substack for the latest opportunities and updates.