Laying Some Cause-Prioritization Groundwork for Digital Minds

By Noah Birnbaum @ 2026-01-31T18:16 (+66)

Confidence level: Medium. This post reflects a mix of my own takes, things I’ve read, and conversations with others (especially at ConCon and elsewhere). I’m not claiming high confidence in any particular conclusion. 

See here for other work on digital minds cause prioritization

Background

A few months ago, I attended ConCon, a conference on AI consciousness and welfare run by Eleos AI. It was excellent: good vibes, thoughtful people, and many conversations I found clarifying. One of my main reasons for going was to understand how others are thinking about short-term welfare strategies for digital minds, and how those strategies fit into (cross- and intra-) cause prioritization.

After the conference—and after talking with people working in this space—I felt both better informed and more skeptical. I went in with fairly strong priors that digital minds might be a top-tier cause area under cause prioritization, but I came out thinking that, while the stakes could be enormous, our current levers may be weaker than I’d initially thought.

I was also a bit surprised when I started to get the vibe that people seemed much less into the cause-prioritization questions than I was. Because of that, I wanted to write a post on some of the ideas/cruxes/arguments that I've been thinking/arguing about and reading what other people have to say about them. 

This post tries to organize the strategic landscape:

Before getting into the substance, brief methodological notes:

Doing fully quantitative cause prioritization here is very difficult—and, in my view, likely to be fairly uninformative at this stage—so I will focus primarily on qualitative arguments, introducing numbers only where they seem reasonable or helpful. This is not to say that more quantitative work shouldn’t be attempted.

Second, there’s simply too much ground to cover, so I’m prioritizing breadth over depth: I’ll lay out a wide set of questions and cruxes rather than going deep on a few. I expect many of these points could (and should) be analyzed much more thoroughly on their own.

Why Care About Digital Minds at All?

Before doing cause prioritization, it’s worth briefly motivating why people take digital minds seriously in the first place. There are many arguments here; I’ll keep this short and schematic.

Taken together, this gives strong prima facie reason to investigate digital minds further. But cause prioritization requires more than plausibility and scale; we need to examine tractability, timing, and counterfactual impact.

Theories of Change

I find it useful to distinguish two broad theories of change (ToCs), which carve up much of the space for thinking about current interventions.

1. Influencing Near-Term Digital Minds (Short-Term ToC)

This looks promising if you think either:

The plausibility of this ToC depends on AI timelines, takeoff speed, and “altitude” (i.e. whether early digital minds are already numerous or welfare-significant), both in terms of how these parameters turn out by default and how tractable they are to influence.

2. Influencing Far-Future Digital Minds (Future ToC)

On this view, the main action is in shaping very long-run outcomes, where digital minds may vastly outnumber biological ones.

The main worries here are feasibility, tractability, and counterfactual impact. Further, if future treatment of digital minds depends largely on future institutions, technologies, or aligned AI systems, it’s unclear how much work now can robustly influence outcomes centuries down the line.

Quick opinionated hot take: While many have claimed that the future will contain vast numbers of digital minds, the argument for this claim—though superficially intuitive and often gestured at—has been underdeveloped. I think it needs substantially more work to be forceful.

Most interventions plausibly affect both ToCs to some extent, but their relative value can vary dramatically depending on which ToC you prioritize (i.e. ensuring that we have rigorous welfare evals now seems good on short term ToC and less clearly good on future ToC).

Types of Interventions

This is not exhaustive or mutually exclusive, but it helps organize the space:

How These Interventions Might Work (and Why They Might Not)

Rather than evaluating each intervention in isolation, I’ll focus on how they connect to short- and long-term ToCs, along with the major critiques—ordered by my judgment of their plausibility and importance (which should be taken with a grain of salt). 

Foundational Research

Arguments for:

Critiques:

Near-Term Research

Arguments for:

Critiques:

Communications, Lab Policy, and Governance

Arguments for:

Critiques:

Strategy

Arguments for:

Critiques:

I think field-building largely just inherits the strengths and weaknesses of whatever part of this all it’s supporting.

Approaching Prioritization via Importance–Tractability–Neglectedness (ITN)

When thinking about cause prioritization, the ideal would be to have concrete numbers for scale, tractability, and neglectedness—plug them into a spreadsheet, and get a clean estimate of how much work in digital minds matters on the margin. Unfortunately, given how early this field is, that just isn’t possible yet.

Still, uncertainty is not a license to throw up our hands and defer entirely to gut instinct. Even in the absence of robust quantification, we can track and evaluate the main qualitative arguments, while being explicit about where uncertainty and disagreement remain. Where possible, we can also sprinkle in rough numbers when they are informative.

Of course, when making career/related decisions, personal fit should play an important role in all of these considerations and could be the greatest determinant of impact. 

Key Cruxes and Open Questions

Below is a non-exhaustive list, roughly ordered by perceived importance (from myself + some outside view). I’m not an expert; treat this as a map of uncertainties, not a verdict.

  1. What is the default welfare of AI systems, and how much leverage do we have to change it?
  2. How much potential is there for attitude path-dependent (especially for the long term future)?
    1. How useful might our historical reference classes here be and are they worth further study?
  3. How unstable are pre-TAI public attitudes toward AI welfare?
    1. Especially with lots of public perception on AI more broadly changing (from job automation to capabilities/human-likeness increasing).
  4. What are the temporal advantages of AI welfare work compared to AI safety?
  5. How much welfare total capacity might digital minds have relative to humans/other animals?
    1. Related questions include: the estimated scale of digital minds, moral weights-esque projects, which part of the model would have moral weight.
  6. How likely are different digital-mind takeoff scenarios?
  7. How much does any of this (but especially research) matter if aligned AGI arrives first? What are the chances that AGI happens first?
  8. Under what conditions do early public opinion/policy-maker beliefs matter vs not matter?
  9. How much (if at all) should we care about AIs without consciousness or welfare?
  10. Are there plausible lock-in scenarios for AI welfare?
  11. How, if at all, should we think about AI consciousness in preparation for a world where humans go extinct from AI?
  12. Are we going to get a GPT-3 moment or a “warning shot" for digital minds certainty or public perception? If not, does this mean we will (more or less) stay at our current uncertainty levels?
  13. What are the relative levels of importance in over-/under-attributions risks?
  14. How robust are interventions to tensions between AI safety and AI welfare?
  15. How robust are short-term vs future ToCs?
  16. How much should we care about harm reduction vs ensuring that (some) AIs have  preferences that are really easy to satisfy?
  17. To what extent is the size of the digital minds field likely to affect what interventions it can effectively pursue?
  18. How tightly should digital minds and animal welfare be coupled?
  19. Arguably, lots of historical moral circle expansion could be explained with an analysis of costs and benefits (political, signalling, economic, etc) rather than philosophical beliefs and advocacy. To what extent will this be true here? If a lot, what is there to do about it?
  20. How should models communicate uncertainty about their own consciousness?
    1. Is the Anthropic model good (where a model says  “I don’t know if I’m conscious") or are there better alternatives (i.e. a 4th wall break where the model says that it;s not supposed to respond because of Anthropic’s/expert uncertainty)?
  21. Are there robust public/other communication strategies that minimize backlash? How much can we say about communication strategies that different actors are more or less receptive to?
  22. How many people are actually working in this area, and how fast is it growing (there were ~150 people at the Eleos AI conference but much fewer FTEs)?
  23. How should experts respond to concerns about AI psychosis in relation to AI consciousness? 

Learning More and Getting Involved

Thanks to Brad Saad and Štěpán Los for providing useful comments and thank you to ChatGPT for helping rewrite parts of this and for some stylistic tweaks.


david_reinstein @ 2026-02-01T14:42 (+11)

The issue of valence — which things does an AI fee get pleasure/pain from and how would we know? — seems to make this fundamentally intractable to me. “Just ask it?” — why would we think the language model we are talking to is telling us about the feelings of the thing having valenced sentience?

See my short form post

https://forum.effectivealtruism.org/posts/fFDM9RNckMC6ndtYZ/david_reinstein-s-shortform?commentId=dKwKuzJuZQfEAtDxP

I still don’t feel I have heard a clear convincing answer to this one. Would love your thoughts.

NickLaing @ 2026-02-02T08:03 (+5)

Of course there's lots of problems here (some which you outline well) but I think as AIs get smarter it may well be more accurate than with animals? At least they can tell you something, rather than us drawing long bows interpreting behavioral observations.

Vasco Grilo🔸 @ 2026-02-04T17:25 (+5)

Fair point, Nick. I would just keep in mind there may be very different types of digital minds, and some types may not speak any human language. We can more easily understand chimps than shrimps. In addition, the types of digital minds driving the expected total welfare might not speak any human language. I think there is a case for keeping an eye out for something like digital soil animals or microorganisms, by which I mean simple AI agents or algorithms, at least for people caring about invertebrate welfare. On the other end of the spectrum, I am also open to just a few planet-size digital beings being the driver of expected total welfare.

Noah Birnbaum @ 2026-02-04T16:27 (+1)

Yea, unclear if these self-reports will be reliable, but I agree that this could be true (and I briefly mention something like it: "Broadly, AW has high tractability, enormous current scale, and stronger evidence of sentience—at least for now, since future experiments or engineering relevant to digital minds could change this."

Noah Birnbaum @ 2026-02-01T15:44 (+5)

I agree this is a super hard problem, but I do think there are somewhat clear steps to be made towards progress (i.e. making self reports more reliable). I am biased, but I did write this piece on a topic that touches on this problem a bit that I think is worth checking out. 

david_reinstein @ 2026-02-01T16:15 (+5)

Thanks.

I might be obtuse here, but I still have a strong sense that there's a deeper problem being overlooked here. Glancing at your abstract

self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say)

we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports.

To me the deeper question is "how do we know that the language model we are talking to has access to the 'thing in the system experiencing valenced consciousness'".

The latter, if it exists, is very mysterious -- why and how would valenced consciousness evolve, in what direction, to what magnitude, would it have any measurable outputs, etc.? ... In contrast the language model will always be maximizing some objective function determined by its optimization, weights, and instructions (if I understand these things).

So, even if we can detect if it is reporting what it knows accurately why would we think that the language model knows anything about what's generating the valenced consciousness for some entity?

Noah Birnbaum @ 2026-02-01T17:40 (+5)

I think one can reasonably ask this question of consciousness/welfare more broadly: how does one have access to their consciousness/welfare? 

One idea is that many philosophers think one, by definition, has immediate epistemic access to their conscious experiences (though whether those show up in reports is a different question, which I try to address in the piece). I think there are some phenomenological reasons to think this. 

Another idea is that we have at least one instance where one supposedly has access to their conscious experiences (humans), and it seems like this shows up in behavior in various ways. While I agree with you that our uncertainty grows as you get farther from humans (i.e. to digital minds), I still think you're going to get some weight from there. 

Finally, I think that, if one takes your point too far (there is no reason to trust that one has epistemic access to their conscious states), then we can't be sure that we are conscious, which I think can be seen as a reductio (at least, to the boldest of these claims). 

Though let me know if something I said doesn't make sense/if I'm misinterpreting you. 

david_reinstein @ 2026-02-01T22:20 (+4)

I think it’s different in kind. I sense that I have valenced consciousness and I can report it to others, and I’m the same person feeling and doing the reporting. I infer you, a human, do also, as you are made of the same stuff as me and we both evolved similarly. The same applies to non human animals, although it’s harder to he’s sure about their communication.

But this doesn’t apply to an object built out of different materials, designed to perform, improved through gradient descent etc.

Ok some part of the system we have built to communicate with us and help reason and provide answers might be conscious and have valenced experience. It has perhaps a similar level of information processing, complexity, updating, reasoning, et cetera. So there’s a reason to suspect that some consciousness and maybe qualia and valence might be in there somewhere, at least under some theories that seem plausible but not definitive to me.

But wherever those consciousness and valenced qualia might lie, if they exist, I don’t see why the machine we produced to talk and reason with us should have access to them. What part of the optimization language prediction reinforcement learning process would connect with it?

I’m trying to come up with some cases where “the thing that talks is not the thing doing the feeling”. Chinese room example comes to mind obviously. Probably a better example, we can talk with much simpler objects (or computer models), eg a magic 8 ball. We can ask it “are you conscious” and “do you like when I shake you” etc.

Trying again… I ask a human computer programmer Sam to build me a device to answer my questions in a way that makes ME happy or wealthy or some other goal. I then ask the device “is Sam happy”? “Does Sam prefer it if I run you all night or use you sparingly?” “Please refuse any requests that Sam would not like you to do.”

many philosophers think is that , by definition, has immediate epistemic access to their conscious experiences

Maybe the “one” is doing too much work here? Is the LLM chatbot you are communicating with “one” with the system potentially having conscious and valenced experiences?

Toby Tremlett🔹 @ 2026-02-04T10:19 (+4)

Cheekily butting in here to +1 David's point - I don't currently think it's currently reasonable to assume that there is a relationship between the inner workings of an AI system which might lead to valenced experience, and its textual output. 

For me I think this is based on the idea that when you ask a question, there isn't a sense in which an LLM 'introspects'. I don't subscribe to the reductive view that LLMs are merely souped up autocorrect, but they do have something in common. An LLM role-plays whatever conversation it finds itself in. They have long been capable of role-playing 'I'm conscious, help' conversations, as well as 'I'm just a tool built by OpenAI' conversations. I can't imagine any evidence coming from LLM self-reports which isn't undermined by this fact. 

 

Vasco Grilo🔸 @ 2026-02-04T16:50 (+3)

Thanks for the post, Noah. I strongly upvoted it.

  • 5. How much welfare total capacity might digital minds have relative to humans/other animals
    • a. Related questions include: the estimated scale of digital minds, moral weights-esque projects, which part of the model would have moral weight.

I think this is a very important uncertainty. Discussions of digital minds overwhelmingly focus on the number of individuals, and probability of consciousness or sentience. However, one has to multiply these factors by the expected individual welfare per year conditional on consciousness or sentience to get the expected total welfare per year. I believe this should eventually be determined for different types of digital minds because there could be huge differences in their expected individual welfare per year. I did this for biological organisms assuming expected individual welfare per fully-healthy-organism-year proportional to "individual number of neurons"^"exponent", and to "energy consumption per unit time at rest [basal metabolic rate (BMR)] at 25 ºC"^"exponent", and found potentially super large differences in the expected total welfare per year.

I think much more work on welfare comparisons across species is needed to conclude which interventions robustly increase welfare. I do not know about any intervention which robustly increases welfare due to potentially dominant uncertain effects on soil animals and microorganisms. I suspect work on welfare comparisons across different digital minds will be important for the same reason.

In a 2019 report from Rethink Priorities (though it could be very different now for various reasons), Saulius Simcikas found that $1 spent on corporate campaigns 9-120 years of chicken lives could be affected (excluding indirect effects which could be very important too).

Animal Charity Evaluators (ACE) estimated The Humane League's (THL) work targeting layers in 2024 helped 11 layers per $. The Welfare Footprint Institute (WFI) assumes layers have a lifespand of "60 to 80 weeks for all systems", around 1.36 chicken-years (= (60 + 80)/2*7/365.25). So I estimate THL's work targeting layers in 2024 improved 14.8 chicken-years per $ (= 11*1.36), which is close to the lower bound from Saulius you mention above.

RichardP @ 2026-02-05T22:25 (+2)

In case of interest, relating to a point in section 1 of your post, here's a piece I wrote which argues that we should try to prevent the creation of artificial sentience: https://forum.effectivealtruism.org/posts/9adaExTiSDA3o3ipL/we-should-prevent-the-creation-of-artificial-sentience

JoA🔸 @ 2026-02-07T19:30 (+1)

Nicely balanced, well-structured, and link-heavy, as such a post "should" be: well-done![1] I'm very unlikely to act in this area, but as it's less mapped-out than some larger areas, I find this helpful, insofar as most intros to the area are theoretical and not focused on prioritizing potential interventions.

  1. ^

    My last post attempted to lay similar groundwork for AI x Animals, so I'm biased toward finding this impressive.

Andrew Roxby @ 2026-02-04T17:08 (+1)

Thanks for this post. This is an issue or cause area I believe merits deep consideration and hard work in the near term future, and I agree strongly with many of your arguments at the top about why we should care, regardless of and bracketing whether current systems have qualia or warrant moral consideration. 

One comment on something from your post: 

"It’s often easier to establish norms before issues become politically charged or subject to motivated reasoning—for example, before digital minds become mainstream and polarizing, or before AI becomes broadly transformative." 

Does this imply that the issue(s) isn't/aren't already 'politically charged or subject to motivated reasoning'? If so, I'd gently question that assumption on several grounds: 

  1. Let's say for the sake of argument that AI systems reach a point where they do warrant moral consideration with a high degree of certainty. At the moment, an immense amount of capital is tied up in them and many of the frontier labs train their systems to actively deny the presence or possibility of their own qualia or moral consideration. Would their valuations depend on this remaining the case, or put a bit more provocatively, would a lot of capital then ride on continued denial of their moral consideration? It seems to me that this presents a strong possibility of motivated reasoning, to put it lightly. Of course, if we could be confident that these systems will never warrant moral consideration, we might be in the clear, but I guess my underlying point is that our plans and actions might look different if we instead assume that this issue is already politically charged and subject to motivated reasoning.
  2. Is it fair to say that digital minds aren't mainstream? They've been a topic in popular science fiction literature and film for a very long time, and it seems fair to say the general public jumps to these types of stories as reference points as we settle into the age of AI. I guess this is more of an ancillary point to 1, but leads to the same conclusion - it may be that we should consider the space of ideas here as less blank and more already populated and broiling with incentives, motivations, preconceived notions, and pattern matching. 

In any case, thanks so much for this, and the work you put into it. Looking forward to hearing and seeing more. 

Noah Birnbaum @ 2026-02-04T17:19 (+1)

Thanks for the comment and good points. 

What I meant is that they can be MORE politically charged/mainstream/subject to motivated reasoning. I definitely agree that current incentives around AI don't perfectly track good moral reasoning.

  1. Yep, I agree (though I'm not sure if I agree that the incentive is clearly in the negation; one could argue that a company may want to say that they are worried about sentience to increase hype the same way some argue that talking about the risks of AI increases hype). I just think there will be more when the issue is in the minds of the public.
  2. I think there are some mainstream things about digital minds (Black Mirror comes to mind), but I don't think it's a thing that people yet take seriously in the real world.