A necessary Membrane formalism feature
By ThomasCederborg @ 2024-09-10T21:03 (+1)
Summary: Consider a Membrane that is supposed to protect Steve. An AI wants to hurt Steve but does not want to pierce Steve's Membrane. The Membrane ensures that there is zero effect on predictions of things inside the membrane. The AI will never take any action that has any effect on what Steve does or experience. The Membrane also ensures that the AI will not have access to any information that is within Steve's Membrane. One does not have to be a clever AI to come up with a strategy that an AI could use to hurt Steve without piercing this type of Membrane. The AI could for example create and hurt minds that Steve cares about, but not tell him about it (in other words: ensure that there is zero effect on predictions of things inside the membrane). If Bob knew Steve before the AI was built. And Bob wants to hurt Steve. Then Bob presumably knows Steve well enough to know what minds to create (in other words: there is no need to have any access to any information that is within Steve's Membrane). This illustrates a security hole in a specific type of Membrane formalism. This particular security hole can of course be patched. But if a Membrane is supposed to actually protect Steve from a clever AI, then there is a more serious issue that needs to be dealt with: a clever AI that wants to hurt Steve will be very good at finding other security holes. Plugging all human-findable security holes is therefore not enough to protect Steve. This post explores the question: ``what would it take to create a Membrane formalism that actually protects Steve in scenarios where Steve shares an environment with a clever AI (without giving Steve any special treatment)?'' The post does not propose any such formalism. It instead describes one necessary feature. In other words: it describes a feature that a Membrane formalism must have, in order to reliably protect Steve in this context. Very briefly and informally: the idea is that safety requires the prevention of scenarios where a clever AI wants to hurt Steve. For Steve's Membrane to ensure this, it must be extended to encompass any adoption of Steve-referring preferences by a clever AI.
Thanks to Chris Lakin for regranting to this research.
Protecting an individual that shares an environment with a clever AI is difficult
One can construct Membrane formalisms for all sorts of reasons. The present post will cover one specific case: Membrane formalisms that are supposed to provide reliable protection for a human individual, that share an environment with a powerful and clever AI that is acting autonomously (details below). The present post will describe a feature that is necessary for a formalism to fulfill this function, in this context.
This post is focused on scenarios where Steve gets hurt despite the fact that Steve's Membrane is never pierced. This means that it does not help to make sure that an AI does not want to pierce a Membrane. In other words: this post is concerned with scenarios where Steve's Membrane is internal to an AI, but where that Membrane still fails to protect Steve. Consider a Membrane formalism such that it is possible to hurt Steve without piercing Steve's Membrane. In this case there is nothing inconsistent about an AI that both (i): wants to hurt Steve, and also (ii): wants to avoid piercing Steve's Membrane. Making a Membrane internal to an AI is therefore not enough to protect Steve, if it is possible to hurt Steve without piercing Steve's Membrane. In yet other words: if it is possible to hurt Steve without piercing Steve's Membrane, then making sure that an AI does not want to pierce Steve's Membrane does not help Steve.
The demands on a formalism is dependent on the context that it will be used in. So, for a given class of contexts we can talk about features that any Membrane formalism must have. The present post will not propose any Membrane formalism. Its scope is limited to describing one feature, that is necessary in one specific class of contexts. Let's start by describing this context in a bit more detail.
The class of contexts that the present post is analysing are situations where, (i): the environment contains a powerful, clever, and autonomous AI (in other words: not an instruction following tool AI, but an autonomously acting AI Sovereign. This AI is also clever enough to think up solutions that humans can not think up), (ii): this AI will adopt its goal entirely from billions of humans, (iii): this AI will adopt its goal by following the same rules for each individual, and (iv): all humans will get the same type of Membrane. In other words: Steve must be protected from an AI that gets its goal from billions of humans, without giving Steve any form of special treatment. This in turn means that it is not possible to extend Steve's Membrane to cover everything that Steve cares about (trying to do this for everyone would lead to contradictions). The present post will start by arguing that to reliably protect Steve in such a context, a Membrane must reliably prevent the situation where such an AI wants to hurt Steve.
If the AI in question wants to hurt Steve, then protecting Steve would require the Membrane designers to predict and counter all attacks that a clever AI can think up. Even if the designers knew that they had protected Steve from all human-findable attacks, this would still not provide Steve with reliable protection. Because an AI can think up ways of attacking Steve that no human can think up (these problems are similar to the problems that one would face, if one needed to make sure that a powerful and clever AI will remain in a human constructed box). Thus, even if a Membrane formalism is known to protect Steve from all human findable forms of attack, it is not a reliable protection against a powerful and clever AI that wants to hurt Steve. And if it is possible to hurt Steve without piercing Steve's Membrane, then ensuring that the AI wants to avoid piercing Steve's Membrane does not help Steve. To reliably protect Steve, the Membrane must therefore reliably prevent the scenario where this type of AI wants to hurt Steve.
If the AI adopts preferences that refer to Steve, using a process that falls outside of Steve's Membrane, then Steve's Membrane cannot prevent the adoption of preferences to hurt Steve. So if the Membrane is not extended to include this process, then the Membrane does not offer reliable protection for Steve in this context. In other words: to reliably protect Steve in this type of scenario, Steve's Membrane must encompass the point at which a clever and powerful AI adopts preferences that refer to Steve (as a necessary but not sufficient condition).
Let's introduce some notation. At some point a decision is made, regarding which Steve-referring Preferences will be adopted by a clever AI. Let's say that iff some specific Membrane formalism, means that Steve's Membrane will be extended to encompass this decision, then this formalism has the Extended Membrane (EM) feature. In the class of scenarios that we are looking at (where Steve shares an environment with a clever and powerful AI of the type described above), the EM feature is a necessary feature of a Membrane formalism. (It is however definitely not sufficient).
Let's recap the argument for necessity of the EM feature. If a clever and powerful AI wants to hurt Steve, then such an AI would be able to think up ways of attacking Steve that humans would not be able to think up. If Steve comes face to face with a clever AI that wants to hurt Steve, then the task of the designers of a Membrane formalism is therefore impossible. Such designers would have to find a way of protecting Steve against a set of attack vectors that they are not capable of comprehending. This is not feasible. Thus, in order to protect Steve, the Membrane must instead prevent the existence of an AI that wants to hurt Steve. If the adoption of preferences that refer to Steve happens outside of Steve's Membrane. Then the Membrane cannot prevent the adoption of a preference to hurt Steve. Thus, for the Membrane to be able to reliably prevent the adoption of such preferences, it must be extended to encompass the decision of which Steve-referring preferences the AI will adopt. Otherwise the Membrane cannot prevent the existence of an AI that wants to hurt Steve. And if a Membrane does not prevent the existence of an AI that wants to hurt Steve, then the Membrane is not able to reliably protect Steve (because even if a Membrane is internal to the AI, and even if all human findable security holes are known to be fully patched, this Membrane will still not help Steve against a clever AI that wants to hurt Steve). So if a Membrane formalism does not have the EM feature, it is known to fail at the task of reliably protecting Steve in the context under consideration.
The uses and limitations of establishing the necessity of the EM feature
Adding this type of extension to the Membrane of every individual does not introduce contradictions. Because the extension is in preference adoption space. Not in any form of outcome space. While this avoids contradictions, it also means that extending the Membrane in this way cannot guarantee that the AI will act in ways that Steve finds acceptable. This section will describe various scenarios where people are unhappy with a Membrane that has the EM feature. And it will discuss the fact that in many cases it will be unclear whether or not it is reasonable to describe a given Membrane formalism as having the EM feature. This section will also describe why identifying the EM feature as necessary was still useful (in brief: establishing necessity was still useful for designers, because they can now reject those Membrane formalisms, that clearly do not have the EM feature).
Let's take a trivial example of an AI that is acting in a way that Steve finds unacceptable, even though Steve is protected by a Membrane with the EM feature. If it is very important to Steve that any AI interacts with some specific historical monument in a very specific way. Then an AI might act in ways that makes Steve prefer the situation where there was no AI, even though this AI has no intention of hurting Steve. This is because adding the EM feature does not extend the Membrane to encompass everything that Steve cares about. Extending the Membranes of multiple people in such a way would introduce contradictions (other people might also care deeply about the same historical monument). In other words: defining the EM feature in preference adoption space avoids contradictions. But it means that the EM feature cannot hope to be sufficient. A necessary feature can however still be useful for designers, because it allows designers to reject any formalism that clearly does not have the necessary feature.
A necessary feature can be useful, even if there exists many cases where it is unclear whether or not a given formalism has this feature. As long as clear negatives exist (Membrane formalisms that clearly do not have this feature), then discovering that the feature is necessary can be useful for designers. In other words: as long as it is possible to determine that at least some potential formalisms definitely does not have the EM feature, then this feature can be useful for designers. The existence of clear negatives is needed for this finding to be useful. But the existence of clear positives is not important (because clear positives are treated the same as unclear cases in the design process). To illustrate the role of necessary (but far from sufficient) features, let's turn to a less trivial example. A scenario with Gregg, who categorically rejects the EM feature as inherently immoral.
The EM feature will be completely unacceptable to Gregg, on honestly held, non-strategic, moral grounds. Gregg sees most people as heretics and Gregg demands that any AI must hurt all heretics as much as possible. For an entity as powerful as an AI, hurting heretics is a non negotiable moral imperative. Thus, Gregg will categorically reject the EM feature. In fact, Gregg will reject any conceivable AI that does not subject most people to the most horrific punishment imaginable. So making Gregg happy is not actually compatible with a good outcome (from the perspective of most humans. Since Gregg demands that any AI must hurt most humans as much as possible). More importantly for our present purposes: making Gregg happy is definitely not compatible with fulfilling the function of a Membrane formalism of the type that we are exploring in the present post: protecting individuals.
Now let's get back to the issue of what function a necessary but not sufficient feature can play in the design process. Let's re-formulate the Gregg example as a necessary condition of any Membrane formalism (or any other AI proposal for that matter): Gregg must categorically reject the proposal as an abomination, due to an honestly held normative judgment. Let's refer to this feature as the Rejected by Gregg on Honestly held Moral grounds (RGHM) feature. Unless a proposal results in most people being subjected to the worst thing a clever AI can think up, then that proposal will have the RGHM feature. So the absence of the RGHM feature can probably not be used to reject a large number of proposals. But given what we know about Gregg, it is entirely valid to describe this as a necessary feature (of a Membrane or an AI). Therefore it can be used to reject any proposal that is clearly not describable as having the RGHM feature. And this necessary feature can perhaps be useful for illustrating the important difference between dealing with necessity and dealing with sufficiency. And for illustrating the role that a necessary feature still can play in a design process (in this case: the design of a Membrane formalism whose function is to keep human individuals safe in a certain context). Now consider an AI that hurts all humans as much as possible. Such an AI has this necessary RGHM feature (because any proposal that leads to non heretics getting hurt is also rejected by Gregg on moral grounds). This should drive home the point that the RGHM feature is definitely not sufficient for safety. And drive home the point that a proposal can have a necessary feature and still be arbitrarily bad. Now let's turn to the role that the RGHM feature could still play in the design process.
If Gregg is happy with some Membrane formalism (or some AI proposal), then this is a perfectly valid reason to reject the proposal in question out of hand. Because that proposal lacks a necessary feature. There will be many cases where it will be unclear whether or not Gregg can be reasonably described as rejecting a given proposal. In these cases, determining whether or not the proposal has the RGHM feature might be a fully arbitrary judgment call. There likely exist many border cases. But there will also be some cases where Gregg is clearly happy according to any reasonable set of definitions. There will be clear negatives: cases where it is clear that a given proposal does not have the RGHM feature. And in such a case, the proposal in question is known to be bad (it fails to achieve its purpose). Clear, unambiguous, rejection by Gregg does not settle things (as illustrated by the ``hurt-everyone-AI'' in the previous paragraph). And unclear cases also do not settle things. But clear approval by Gregg does in fact settle things (in other words: the clear absence of a necessary feature is informative, because it is a valid reason to reject a proposal).
The same type of considerations hold more generally when dealing with features that are necessary but not sufficient. In other words: the existence or non-existence of clear positives is not actually important. The existence of many cases that are hard to define is mostly irrelevant. The only thing that actually matters, for the feature to be able to fulfil its role in the design process, is that there exists clear negatives (in this case: Membrane formalisms that are clearly not describable as having a necessary feature). Identifying a feature as necessary can thus reduce risks from all proposals that clearly do not have the feature.
Now let's return to the EM feature. Establishing the necessity of the EM feature was useful for similar reasons. The clear presence of the EM feature does not settle things. There will also be many cases where it is not clear, whether or not it would be reasonable to describe a given formalism as having the EM feature. But the EM feature can still serve a role in the design process. Specifically: if a given Membrane formalism is clearly not describable as having the EM feature, then we know that we must reject the formalism. In other words: if a Membrane is clearly not describable as including the point at which a clever and powerful AI adopts preferences the refer to Steve, then the formalism must be rejected (assuming that the point of constructing a Membrane formalism was to offer reliable protection for Steve, in the context outlined above). This is the main takeaway of the present post. And this takeaway has probably been expressed in a sufficient number of ways at this point. So now the post will conclude with a brief discussion of a couple of tangents.
A brief discussion of a couple of tangents:
Davidad has a proposal for how to structure negotiations regarding AI actions. The set of actions under consideration are restricted to Pareto Improvements (relative to a baseline situation where the AI does not exist). This is not a Membrane formalism. But the idea is to protect individuals, in a way that is basically equivalent to extending individual influence in a way that I think is similar to a Membrane extension. The proposal gives each individual some measure of control over things defined in an outcome space. I think this is similar to extending a Membrane in an outcome space in a way that leads to contradictions due to overlap. The proposal is not a Membrane formalism, and the extension does not lead to contradictions. Instead, the extension results in the set of actions that can be considered during negotiations becoming empty (meaning that all possible actions are classified as unacceptable). This happens because the set of Pareto Improvements is always empty when billions of humans are negotiating about what to do with a powerful AI. In brief: the extension of individual influence in an outcome space leads to a malignant version of the problem in the historical monument example mentioned above. Consider two people with a type of morality along the lines of Gregg. Each view the other as a heretic. Both considers it to be a moral imperative to punish heretics as much as possible. Both view the existence of an immoral AI that neglects its duty to punish heretics as unacceptable (both also reject the scenario where everyone is punished as much as possible). A population of billions only has to include two people like this for the set of Pareto Improvements to be empty.
An almost identical dynamic has implications for work that is more explicitly about Membranes. In Andrew Critch's Boundaries / Membranes sequence, it is suggested that it might be possible to find a Membrane based Best Alternative To a Negotiated Agreement (BATNA), that can be viewed as having been acausally agreed upon by billions of humans. The problem is again that the existence of two people like Gregg (who view each other as heretics), means that this is not possible. There exists no BATNA that both will agree to, for the same reason that there exists no AI that both will consider acceptable. (both conclusions hold for any veil of ignorance, that does not transform a person like Gregg into a completely different person, with a completely new moral framework)
(I'm also posting this on LessWrong)