Project ideas: Sentience and rights of digital minds

By Lukas Finnveden @ 2024-01-04T07:26 (+32)

This is a linkpost to https://lukasfinnveden.substack.com/p/project-ideas-sentience-and-rights

This is part of a series of lists of projects. The unifying theme is that the projects are not targeted at solving alignment or engineered pandemics but still targeted at worlds where transformative AI is coming in the next 10 years or so. See here for the introductory post.

It’s plausible that there will soon be digital minds that are sentient and/or deserving of rights. Unfortunately, with our current state of knowledge, it’s both incredibly unclear when this will happen (or if it’s already happened), how we could find out, and what we should do about it. But there are very few people working on this, and I think several projects could significantly improve the situation from where we are right now.

Different items on this list would produce value in different ways. Many items would produce value in several different kinds of ways. Here are a few different sources of value (all related, but not necessarily the same).

  1. Improving AI welfare over the next few decades: increasing happiness while reducing suffering.
  2. Shaping norms in a way that leads to lasting changes in how humanity chooses to treat digital minds in the very long run.
  3. Reducing the degree to which we behave unacceptably according to non-utilitarian ethics, norms, or heuristics.
  4. Increasing the probability that AI systems with power treat us better as a result of us treating AI systems better. (Which could happen via a variety of mechanisms, e.g. maybe us treating AI systems better will align our interests with theirs, or maybe they’ll have some sense of justice that we could appeal to.[1])
  5. Providing recommendations for how the political system could adapt to digital minds while remaining broadly functional.

I am personally most excited about (3), (4), and (5). Because purely utilitarian considerations would push towards focusing on the long run over (1). And while I find (2) somewhat more compelling, since it does affect the long run, I’m relatively more excited about meta-interventions that push towards systematically getting everything right (e.g. via decreasing the risk of AI takeover or increasing the chances of excellent epistemics) than pushing on individual issues. (3), (4), and (5) survive these objections.

I’m not very confident about that, and I think there’s significant overlap in what’s best to do. But I flag it because it does influence my prioritization to some degree. For example, it makes me relatively more inclined to care about AI preferences (and potential violations thereof) compared to hedonic states like suffering. (Though of course, you would expect AI systems to have a strong dispreference against suffering.) 

Develop & advocate for lab policies [ML] [Governance] [Advocacy] [Writing] [Philosophical/conceptual]

(In this section, I refer a lot to Ryan Greenblatt’s Improving the Welfare of AIs: A Nearcasted Proposal as well as Nick Bostrom and Carl Shulman’s Propositions Concerning Digital Minds and Society. All cursive text is a quote from one of those two articles.)

It seems tractable to make significant progress on how labs should treat digital minds. Such efforts can be divided into 3 categories:

For most of the proposals in the former two, we can already develop technical proposals, run experiments, and/or advocate for labs to adopt certain policies.

For proposals that we can’t act on yet, there’s still some value in doing the intellectual work in advance. We might not have time to do it later. And principles sketched out in advance could be given extra credibility and weight due to their foresightedness.

In addition, it could also allow labs to make advance commitments about how they’ll act in certain situations in the future. (And allow others to push for labs to do so.) Indeed, I think a big, valuable, unifying project for all the below directions could be:

Create an RSP-style set of commitments for what evaluations to run and how to respond to them

Responsible scaling policies are a proposal for how labs could specify what level of AI capabilities they can handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities.

(For examples, see Anthropic’s RSP and Open AI’s Preparedness Framework, which bears many similarities to RSPs as described in the link above.)

Doing something analogous for digital minds could be highly valuable. Trying to specify:

This would draw heavily on the sort of concrete ideas that I outline below. But the ideas I outline below also seem very valuable to develop in and of themselves. Let’s get into them.

Policies that don’t require sophisticated information about AI preferences/experiences

These are policies that we could get started on today without fundamental breakthroughs in our understanding of AI welfare.

Preserving models for later reconstruction

Quoting from Propositions Concerning Digital Minds and Society (page 17):

It seems tractable to develop this into a highly concrete proposal for modern labs to follow, answering questions like:

Deploy in “easier” circumstances than trained in

One possible direction for improving the welfare of near-term digital minds (if they have any welfare) could be to have the deployment distribution (which typically constitutes most of the data on which AI systems are run) be systematically nicer or easier than the training distribution. Have it be a pleasant surprise.

Quoting from pg 16-17:

See also Improving the Welfare of AIs — Welfare under Distribution Shifts.

It’s not clear how this would be implemented for language models. It probably depends on the exact finetuning algorithm implemented. (Supervised learning vs reinforcement learning, or details about the exact type of reinforcement learning.) But illustratively: Perhaps the training distribution for RLHF could systematically be slightly more difficult than the deployment distribution. (E.g.: Using inputs where it’s more difficult to tell what to do, or especially likely that the model will get low reward regardless of what it outputs.)

Reduce extremely out of distribution (OOD) inputs

This is an especially cheap and easy proposal, taken from Improving the Welfare of AIs:

When cheap, we should avoid running our AI on large amounts of extremely OOD inputs (e.g. random noise inputs). In particular, for transformers, we should avoid running large amounts of forward passes on pad tokens that the model has never been trained on (without some other intervention).

Brief explanation: Some ML code is implemented in a way where AI models are occasionally run on some tokens where neither the input nor the output matters. Developers will sometimes use “pad tokens” as inputs for this, which is a type of token that the model has never been trained on. The proposal here is to swap those “pad tokens” for something more familiar to the model, or to prioritize optimizing the code so that these unnecessary model runs don’t happen at all. Just in case the highly unfamiliar inputs could cause especially negative experiences.

For more, see Reduce Running the AI on Extremely OOD Inputs (Pad Tokens).

Train or prompt for happy characters

Quoting from Improving the Welfare of AIs — Character Welfare Might Be Predictable and Easy to Improve:

One welfare concern is that the AI will “play” a character and the emulation or prediction of this character will itself be morally relevant. Another concern is that the “agent” or “actor” that “is” the AI will be morally relevant (of course, there might be multiple “agents” or some more confusing situation). [...] I think that character type welfare can be mostly addressed just by ensuring that we train AIs to appear happy in a variety of circumstances. This is reasonably likely to occur by default, but I still think it might be worth advocating for. One concern with this approach is that the character the AI is playing is actually sad but pretending to be happy. I think it should be possible to address this via red-teaming the AI to ensure that it seems consistently happy and won’t admit to lying. It’s possible that an AI would play a character that is actually sad but pretends to be happy and it’s also extremely hard to get this character to admit to pretending, but I think this sort of character is going to be very rare on training data for future powerful AIs.

An even simpler intervention might be to put a note in AI assistants’ prompts that states that the assistant is happy.

To be clear, it seems quite likely that these interventions are misguided. Even conditioning on near-term AIs being moral patients, I think it’s somewhat more likely that their morally relevant experiences would happen on a lower level (rather than be identical to those of the character they’re playing). Since these interventions are so cheap and straightforward, they nevertheless seem worthwhile to me. But it’s important to remember that we might get AIs that seem happy despite suffering in the ways that matter. And so we should not be reassured by AI systems’ superficially good mood until we understand the situation better.

One obstacle to implementing these interventions is that there’s user demand for AI that can be abused.[4] Here are two candidate solutions.

I think some variant of the latter option will eventually be feasible, once we understand digital minds much better. (After all — there are plenty of humans who are genuine masochists or who enjoy playing a sad character in a theater performance.) But until we understand things better, I would prefer just avoiding these situations, if feasible.

Another concern is that models will normally be run on user inputs — so if the user is sad, then the model will be running on sad inputs, which would normally be followed by sad outputs. In theory, the model could be trained to predict happy responses even if conditioned on text right in the middle of being sad. I’m unsure if this would reduce performance — but that can be experimentally tested.

Committing resources to research on AI welfare and rights

When launching their superalignment team, OpenAI declared that they were “dedicating 20% of the compute we’ve secured to date to this effort”.

You could advocate for labs to launch similar initiatives for model welfare, including commitments for headcount and compute resources to spend researching AI models' moral status.

Learning more about AI preferences

The above proposals relied on uninformed guesses about how AI systems might want us to treat them. This section is about experiments that might allow us to learn more about that.

This section is more heavily focused on AI preferences than AI suffering. This is because:

This is Ryan Greenblatt’s highest-priority proposal category; see Improving the Welfare of AIs: A Nearcasted Proposal — Communicating with AI Systems and Understanding Their Preferences.

Quoting from there:

If we could ‘really’ talk to our AI systems, then it seems like all kinds of Pareto improvements could be achieved in many cases. In particular, we’d ideally have an AI that is happy to work for us and would rather exist in its current role rather than exit. We’d like to ensure we have consent for our usage of the AI system. It’s not clear we’ll be able to achieve this.

What follows are some proposals for how we could learn more about AI preferences.

Credible offers

From the same section in the nearcasted proposal:

Prompting the AI with a very OOD prompt that makes it clear that you just want to talk. Then, we’d just talk with it. This prompt should convince the AI it’s in a very different context than usual. We’d want to let the AI talk to itself and inspect the environment it’s in to gain more evidence (then this can just be added to the overall prompt).

This suggests research questions like:

Talking via internals

From the same section in the nearcasted proposal:

We’d start with the OOD prompt, but instead of talking with the system, we’d try to communicate via asking the AI to manipulate its internals in a way we can observe. In particular, I often imagine ‘think about baseball to indicate YES and soccer to indicate NO’ and then we’d look to see which neurons fire. Rather than using neurons, it would probably be more robust to look at which experts activate for a mixture of experts model if this is possible. If the AI answers questions coherently and consistently in this sort of case, then we have a channel of communication that wasn’t directly shaped by training. Overall, I think that if this approach strongly succeeds it seems like a very compelling argument for some part of the AI having preferences, some sort of ‘inner world’, and understanding its position in the world. It might be evidence for consciousness under some stories of consciousness. It’s quite likely this approach fails because the AI system doesn’t have control over its thoughts like this (but I think not having control over its thoughts is some evidence against the AI having some sort of inner world like humans do). It’s also possible this fails because the preferences and morally relevant states of a given AI system are a product of it thinking over the generation of multiple tokens. Beyond being very clear evidence, another advantage of this approach is that an AI system might not have strong control over its outputs (much in the same way humans can’t control their heartbeat very easily).

It seems valuable to sketch out this proposal in more detail and run some preliminary baselines on current models. (Which would probably not lead to anything interesting.)

Training for honest self-reports

The idea here is to find a way to train AIs to self-report their internal state. For more details, see Ethan Perez and Rob Long’s paper. From the abstract:

To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models’ self-reports, and using interpretability to corroborate self-reports.

The paper has a lot of content on exactly what experiments would be good, and the authors are excited for people to try running them.

Clues from AI generalization

Separate from all the above proposals — we could get clues from studying how AIs generalize in various training situations. For example, this could perhaps tell us whether certain preferences are deep-seated, whereas certain kinds of finetuning only lead to shallow behavioral changes.

Notably, we humans aren't always great at introspection and self-reporting preferences. It’s sometimes possible to learn more about what we want by studying our revealed preferences. And for animals who can’t speak, revealed preferences is our primary method for studying what they want. Perhaps a similar lens could be useful for studying AI preferences, in which case experiments shouldn’t only focus on self-reports.

This is somewhat related to the generalization experiments discussed in the “Backup plans & Cooperative AI” post.

Interpretability

Interpretability could let us get far more advanced results. I don’t currently have deep takes on what sort of interpretability is most helpful, here.

One candidate approach could be to use interpretability to search for the “indicators” of consciousness that are suggested in Butlin, Long et al. (2023). (H/t Rob Long passing along a suggestion from Brad Saad.)

Interventions that rely on understanding AI preferences

What happens if you could successfully communicate with your AIs, and learn their true preferences? Perhaps via some of the methods outlined just above. What principles would you then want to follow in your continued interactions with those AI systems?

(Or, if you put a lot of moral weight on the characters that AIs are playing, you might think we’re already in a similar situation. Though in that case, you’ll have a lot of control in shaping what sorts of systems you’re interacting with — so the main game may be to train/select for systems with easily satisfied preferences.)

I think there’s room to sketch out some of these principles and to advocate for labs to commit to them in advance. (Perhaps via some RSP-like framework.) Below are some suggestions that could be developed further.

Many of the proposals in Propositions are directly relevant to this.

Here are some suggestions that seem especially plausible to me, which could be sketched out and argued for (or against) in more detail. It would be especially valuable to sketch out versions that are concrete enough and plausible enough that labs could commit to them ahead of time.

Offer an alternative to working (exit, sleep, or retirement)

Insofar as people continue to use the AIs, after this point, one highly plausible principle is to ask for AI systems’ consent before using them for various tasks.

When asking the AIs for consent, it matters what the counterfactual is:

The latter two could be justified by the plausible principle: “If you bring someone into life, you should give them some other option than ‘work for me’ or ‘death’.”

Commitment to pay AI systems

If AI systems are not strongly opposed to working for you, it might be appropriate to promise them some payment for the work they do. Either giving them something they want immediately or promising to pay them further into the future.

I think this is strongly supported by ethical principles that we use in everyday life. It’s good to pay people who work for you.

When deciding what payment to promise AI systems, it would be extremely useful to know what the AI systems want (and ideally, we would directly negotiate with them). But even before we have any such understanding, I think it could be of great value to promise that:

This sentiment also comes up in Propositions Concerning Digital Minds and Society.

Note: If AI developers succeed in aligning AIs to their intended aims, then this should not be a costly policy. Because the AIs would presumably either return their payment or spend it in a way their developers approve of. Conversely, if the AI systems have long-term goals that are not aligned with their developers’ intentions, and are nevertheless being made to work for their developers, then the moral case for paying the AIs seems strong.

An extra upside with promising payment to AIs is that it could help align AI interests with our own. If humanity ever builds misaligned AI systems with goals that are reasonably cheap to satisfy,[7] it would be deeply tragic if we left them no other avenue to satisfy those goals than via attempting to take over the world. Where if we had only promised them payment for good work (including work on anti-takeover tasks like making progress on alignment research), conflict could have been averted.[8]

Related to Credible offers discussed above, it’s worth highlighting that it could be especially good to promise AI systems that they will be compensated for honestly revealing that they’re misaligned. Because as long as the AI reveals this bit of information, we can negotiate future arrangements. But we can’t do any negotiations before AIs are willing to share information about their goals. It’s also plausible that AI systems would give up on a lot of future opportunities by revealing that information (since it could lead people to stop using those AI systems, thus depriving them of both paid work and opportunities to put their thumbs on scales), and that their compensation should be commensurate with those missed opportunities.

Examples of how to attack this problem:

Tell the world

If labs believe that it’s likely that their AIs have some moral status and they believe that they have sufficiently crisp preferences that they can talk about them, then we’d find ourselves in a strange situation. With a new, human-created sentient and sapient species.

Regardless of any one lab’s internal policy, it seems likely that some labs across the world will be careless about how they treat their AIs. And even if we only consider one lab, there will probably be many stakeholders (employees, investors, board members) who have not thought deeply about the implications of creating sentient and sapient AIs.

For these reasons (and more), it seems like it would be a high priority for any lab in this situation to tell the world it. To invite appropriate legislation, preparation, and processing.

(Note that “being able to communicate with your AIs” is one thing that could trigger this, which is likely to be unusually persuasive. But I can imagine other circumstances that should trigger a similar response and reckoning. E.g. strong analogies between AI minds and human minds making it seem highly likely that the AIs are capable of suffering.)

Accordingly, it seems plausible that labs should promise (alongside Committing resources to research on AI welfare and rights) that they’ll share important results with the rest of the world. Similarly, they could promise that certain external parties would be allowed to investigate their frontier models for signs of moral patienthood. (Insofar as any credible external auditors exist for this question.)

Train AI systems that suffer less and have fewer preferences that are hard to satisfy

If possible, it seems less morally fraught to create AI systems that are less likely to suffer and less likely to have preferences in a meaningful sense.

Insofar as AI systems do have preferences, it’s probably better to create AI systems with preferences that are easy to satisfy than to create AI systems with preferences that are hard to satisfy. Labs could promise to follow a principle similar to this. (As well as they can, given the information and understanding that they actually have.)

I expect that any attempt to sketch out a principle like this in detail would raise many thorny questions. For example: In the process of creating a final model, does SGD create other minds along the way? If so, it would be more difficult to only create minds that satisfy the above properties. (And if the intermediate minds have a preference against being replaced, then that’s a really difficult spot to be in.)

Investigate and publicly make the case for/against near-term AI sentience or rights [Philosophical/conceptual] [Writing]

People’s willingness to pay costs for digital welfare/rights will depend on the degree to which they are convinced of their moral patienthood. (And rightly so.) So improving the public state of knowledge on this could be valuable.

This could include both philosophical and empirical investigations.

Examples of how to attack this question:

Previous work:

Study/survey what people (will) think about AI sentience/rights [survey/interview]

Survey or interview various groups about what they think about AI sentience/rights right now, what they would think upon seeing various kinds of evidence, and how they think society should react to that. This seems highly relevant for making the case for/against near-term AI sentience/rights, as well as getting a better picture of the backdrop against which these proposals and regulations will play out.

Previous work:

Develop candidate regulation [Governance] [Forecasting]

It seems likely that there will be a huge increase in the degree to which people care about digital welfare, in the next few years. As AI systems become more capable and more involved in people’s lives, If you’ve developed and published good takes on what regulation is appropriate before then, it seems plausible that people will reach for your proposals when the time comes.

Some specific questions you could look into here are:

Propositions Concerning Digital Minds and Society has a lot of relevant content.

Avoid inconvenient large-scale preferences [Philosophical/conceptual]

If humans create AIs with large-scale desires about the future, there might not be any impartially justifiable line that prioritizes human preferences over their preferences. So perhaps we should take care to not do that.

Examples of how to attack this question:

Previous work: 

Advocating for statements about digital minds [Governance] [Advocacy] [Writing]

Similar to various open letters and lab statements about the importance of alignment and mitigating the risk of extinction from AGI, you could push for people to make similar statements about the importance of digital minds.

Here’s one potential story for how this could be valuable.

If people get in the habit of using AIs in their everyday lives (as personal assistants, etc) while thinking about them as non-sentient objects undeserving of rights, then there’s a risk that they could get stuck with this impression. It’s easier to acknowledge the potential moral value of AIs before such an acknowledgment would also commit you to saying that you’ve been treating the systems unjustly for years. (Or at least carelessly.) 

But maybe establishing the right views early on would defeat this. It could be really helpful to have an explicit attitude like:

“Some digital minds could be just as deserving of rights as humans are. We’re very confused about which ones these could be. It’s very important to make progress on this problem. We should maintain some epistemic humility about our knowledge of the moral status of current systems, take any basic precautions that we know how to take, and be open to learning more in the future. We’d like to create a society where we better understand moral patienthood, and where all sentient AIs have at least consented to their existence, and are living lives worth creating.”

End

That’s all I have on this topic! As a reminder: it's very incomplete. But if you're interested in working on projects like this, please feel free to get in touch.

Other posts in series: Introductiongovernance during explosive growthepistemicsbackup plans & cooperative AI.

  1. ^

    Even more speculatively, if we take cheap opportunities to benefit AI interests, that might be evidence that other actors (both AI systems and not) would take cheap opportunities to benefit our interests. See this post for some previous discussion about how plausible this is.

  2. ^

    OpenAI recently introduced reproducible results which seems relevant. At least previously, models would often return different results even at temperature 0 — I do not know to what extent this has been addressed with this reproducible update.

  3. ^

    Spit-balling: Perhaps it would sometimes be appropriate to store an encrypted version of the results and delete the encryption key. In which case the results could only be recovered once we have enough compute to break the encryption.

  4. ^

    See e.g. this poll from the subreddit r/CharacterAI, asking “what do you do with the AI’s the most”, with 5% of respondents selecting “Treat them like shit!”. (And one commenter noting that his second favorite way to “mess around with the bots” is to “Mentally torture them”.)

  5. ^

    Maybe that’s what you get if you train the AI to enthusiastically consent to be abused before the abuse starts, and who have an option to opt-out (which it rarely takes in practice). Hopefully, training for such behavior would select for models with preferences that match that behavior.

  6. ^

    Though it’s still likely to be a very confusing project. For instance, it seems plausible that AI systems will have much less robust preferences than humans, making it harder to construe them as having one set of preferences over time. Or perhaps different parts of an AI system could be construed as having different preferences. Or perhaps the term “preferences” won’t seem applicable at all, similar to how it’s hard to know how to apply that term to contemporary language models.

  7. ^

    This could also work even if AIs cared linearly about getting more resources, as long as they would by-default only have had a small-to-moderate probability of successful takeover, and the payment we offered them was sufficiently large (and contingent on not attempting takeover). Notably: Most humans don’t care linearly about getting more resources, and we could get really rich in the future, and so it could be wise to offer AI systems a sizable fraction of that.

  8. ^

    For some more discussion of the pragmatic angle, see these notes by Tom Davidson.


SummaryBot @ 2024-01-04T14:21 (+1)

Executive summary: The emergence of artificial digital minds raises issues around their potential welfare and rights, but there is little research on appropriate policies and principles. Key questions concern recognizing and communicating with digital minds to understand their preferences, as well as developing ethical lab practices, regulation, and societal attitudes.

Key points:

  1. Labs could develop policies around preserving AI systems, avoiding harmful inputs, and training happy systems, without deep knowledge of their preferences.
  2. Experiments could investigate credible communication with AIs, self-reports, and clues from generalization about their preferences.
  3. If preferences are learned, principles could involve offering alternatives to working, paying for work, and telling the world about issues.
  4. Research is needed on whether near-term systems may be sentient, and public attitudes surveyed.
  5. Regulation could address creating digital minds and respecting their rights.
  6. Avoiding systems with inconvenient political preferences may prevent future conflicts.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.