Deconfusing ‘AI’ and ‘evolution’

By Remmelt @ 2025-07-22T06:56 (+6)

This post is for deconfusing:
Ⅰ. what is meant with AI and evolution.
Ⅱ. how evolution actually works.^[1]
Ⅲ. the stability of AI goals.
Ⅳ. the controllability of AI.

Along the way, I address some common conceptions of each in the alignment community, as described well but mistakenly by Eliezer Yudkowsky.

Ⅰ. Definitions and distinctions

By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI. Jacques Monod wrote: “A curious aspect of the theory of evolution is that everybody thinks he understands it”
   — Yudkowsky, 2008

There is a danger to thinking fast about ‘AI’ and ‘evolution’. You might mix up different meanings of terms. And then skip crucial considerations. Best to think slow, step-by-step.

Pinning down the two concepts

Here's the process of evolution in its most fundamental sense:

Evolution involves a feedback loop, where 'the code' causes effects in 'the world' and effects in 'the world' in turn cause changes in 'the code'.^[2] Biologists refer to the set of code stored within a lifeform as its ‘genotype’. The code’s effects are the ‘phenotypes’.^[3]

We’ll return to evolution later. Let’s pin down what we mean with AI:

A fully autonomous artificial intelligence consists of a set of code (for instance, binary charges) stored within an assembled substrate. It is 'artificial' in that^[4] it is assembled out of physically stable and compartmentalised parts (hardware) of a different chemical make-up than humans' soft organic parts (wetware). It is ‘intelligent’ in its internal learning – it keeps receiving new code as inputs from the world, and keeps computing its code into new code. It is ‘fully autonomous’ in learning code that causes the perpetuation of its artificial existence in contact with the world, even without humans/organic life.^[5]

Of course, we can talk about other AI. Elsewhere, I discuss how static neural networks released by labs cause harms. But in this forum, people often discuss AI out of concern for the development of systems that automate all jobs^[6] and can cause a human extinction. In that case, we are talking about fully autonomous AI. This term is long-winded, even if abbreviated to FAAI. Unlike the vaguer term ‘general AI’, it sets a floor to the generality of the system’s operations. How general? General enough to be fully autonomous.

Adding further distinctions

FAAI is not a closed system defined only by how the complicated code inside is processed deterministically. It is an open system, in that it is processing inputs from and outputs into a noisy world, resulting in complex dynamics in how the code functions in the world.

Complex functioning cannot be predetermined

When considering what a simple static program, e.g. an if-else statement, will output, the function can be interpreted directly from the code structure itself.

However, where the code's outputs feed back as inputs to the code itself, there is a dynamic change in the code that is actually being computed (as long as the program does not halt at an input). To the extent this dynamic trajectory (of code state changes over time), cannot be or is not compressed into a shorter-running program, you have to run the code from the start to come to know what the outputs will be. E.g. for a busy beaver or recursive hashing algorithm, you cannot determine the functioning of the code ahead of time from its structure. The code must actually be processed in full.

But most programs are not fully irregular in their functioning – they fall somewhere in between a straight pass-through and a hashing algorithm. They are 'hashy' to an extent. Some regular properties of their functioning may be determined ahead of time. Leaving a remainder – the irregular functioning – that must be computed to become known.

So far we've only discussed computational complexity, which supposes that the code's trajectory runs over some deterministic spacetime – i.e. any time that the code is processed inside the machine, it always reaches the same outputs at the same steps.

In theory, computation is deterministic. But in practice that's not true. Computation cannot actually be decoupled from the physically complex world. No hardware computes flawlessly, under any possible environmental condition. And if you choose to pull out the power lines, whatever code is running inside will halt prematurely.

Moreover, if the code running inside the computer is receiving inputs from a physical world that is not deterministic – as involving noise from quantum particle interactions to chaotic interference over sensor channels – then irregular functioning is introduced that can never be computed. Even if all code stored inside is processed deterministically through computation (which is not so), the overall process becomes nondeterministic.

On top of the physical nondeterminism, there is ambiguity in how the code itself will function as it gets exposed to new contexts at various nested levels. Under shifts in the input data, different outputs get expressed (see next box) with different outside effects.

The code inside any FAAI must be hashy enough to be expressive of context-dependent functioning, in order for FAAI adapt to the messy world. Moreover, that code must be changing in response to effects received from the shifting world, to stay adapted.

So code would not just be computed into outputs that feed back into inputs. Different outputs (to actuators) propagate differently as effects, undergoing noisy interference/nonlinear distortions, until downstream effects feed back into inputs (from sensors).

A recursive entanglement of functionality results. The more cycles out into the future – of hashy code outputting into a messy world and outside effects feeding into inputs – the less concretely predictable this trajectory of world modification becomes.

FAAI learns explicitly, by its internal computation of inputs and existing code into new code. But given its evolutionary feedback loop with the external world, it also learns implicitly. Existing code causing effects in the world that results in (combinations of) that code to be maintained and/or increased, ends up existing more. Where some code ends up existing more than other code, it has undergone selection. This process of code being selected for its effects is thus implicitly learning of what worked better in the world.

Code is learned explicitly to have implicit functions

An internal learning process is explicit (quantifiable, fully observable) in that it is computed. But the functionality learned can still be implicit. Functionality can get compressed into code, such that the code can have different functions given different inputs and different routing connections. Meaning you must feed a specific 'key' into code that is put into a specific possible configuration with other code to make some functionality explicitly observable (at least the computed outputs; not all later effects).

Put conversely, there are multiple methods to implement a given function. The method picked will depend on specific circumstances (model initialisation, dataset trained on). But as outside circumstances shift (data drift), the actual functioning can change too.

Neural networks work that way. Researchers try to make the functionality explicit. But even a single neuron can be polysemantic, encoding for multiple features. You cannot interpret neurons like the pixels of a photograph: where each piece contains explicit content that is fully observable. Rather it's like a holograph: when you shift the viewing angle, you discover other things captured by the pieces. A shift across the degrees of freedom available reveals new content. In a neural network, a shift in inputs can reveal very different outputs. Thus, developers repeatedly fail to block prompt hacks devised next by users. Hackers can even select for subtle variations in the neurons to create a backdoor, triggered by a specific input, that is computationally intractable to detect.

To reverse engineer all this flexible functionality – even for a static neural network –you'd have to detect and model for all possible (series of) inputs. To even approximate such comprehensive coverage is out of the question for current large models. The gap in our understanding grows when AI dynamically learns new configurations of code.

An alternative is to design AI to maintain a symmetry in the functionality of its code, where for any possible changes in inputs or configurations, the changes in outputs are predictably constrained. Linear functions work this way. But pre-imposing reliable constraints also limits the system's capacity to adapt to complex nonlinear contexts of the world (e.g. the organic ecosystem that's supposed to not be accidentally killed off). Arguably, an AI's capacity to learn heuristics that fit potential encountered contexts would be so constrained that it would not reach full autonomy in the first place.

Explicit learning is limited to virtualised code. Implicit learning is not limited to internal computation. Evolution runs over all the physical code interacting with a physical world.

Virtualised code is processed by atoms fixed in place

Human bodies are messy. Inside of a human body is a soup of bouncing, reacting organic molecules. Inside of a machine is hardware. Hardware is made from hard materials, such as the silicon from rocks. Hardware is inert – molecules inside do not split, move, nor rebond like molecules in human bodies do. These hard configurations stay stable and compartmentalised. Hardware can therefore be standardised, much more than human “wetware” could ever be.

Standardised hardware functions consistently. Hardware produced in different places and times operate the same. These connected parts convey light electrons or photons – heavy molecules stay fixed in place. This way, bits of information are processed much faster compared to how human brains move around bulky neurotransmitters. Moreso, this information is transmitted at high bandwidth to other standardised hardware. The nonstandardised humans, on the other hand, slowly twitch their fingers and vocal cords to communicate. Hardware also stores received information consistently, while humans tend to misremember or distort what they heard.

So standardisation of parts leads to virtualisation, which leads to consistent faster information-processing. The less you have to wiggle around atoms, the bigger the edge.

Computer hardware is at the tail end of a long history of virtualisation. A primordial soup of analogue molecular assemblies started replicating stable DNA. Then cells formed around, which joined into multicellular organisms. Then organs exchanged signals through neurotransmitters, forming brains that in humans gained capacities to store abstract concepts. Those concepts were spoken out using shared language, written on paper, printed into books, and finally copied in milliseconds between computers.

Any discrete configuration stored inside may cause effects in the world, which may feed back into that specific variant existing more. Evolution would select across all variation in the stable configurations stored in hardware.

Code is stable information expressible as functional effects

For an organic lifeform, it makes sense to focus on the DNA as 'code', because there is retention of the configurations across its lifecycle. DNA strands stay physically stable within a soup of reacting organic molecules, while still allowing for variation in what configurations ('A, C, G, T') can get stored and expressed as assembled molecules.

For artificial life, all the hard configurations inside are physically stable under most temperatures and pressures currently encountered on the Earth's surface. Those configurations get fixed into the substrate under the extreme conditions of production (e.g. 1400 ℃ to melt silicon ores). Variation is introduced from the microscopic circuitry up, over many parallelised assembly processes that inherently involve physical noise and perturbations (and require some trial and error to work anyway).

Since hardware is directly interfacing with the production of hardware (no human in the loop), the configurations of all that hardware – including sensors and actuators – continuously affect which configurations are re-assembled into new hardware (and at what rate the hardware is assembled, survives before it breaks down, etc). As a result, the notion of what is 'code' subject to evolutionary feedback expands for artificial life.

This is an unintuitive use of the term for programmers, who conventionally see the software (or even just the human-readable lines) as 'the code' and the rest of the hardware as non-code. When your job is to take care of what is computed inside the machine, that's the relevant distinction. But even that distinction is not clear-cut. What is firmware that during production got fixed into hardware? Is it code or non-code?

Different parts of the FAAI store different code, and those parts can move around and connect up in different ways. Thus FAAI is not one lifeform – it is more like a population.

An artificial population is different than a natural population

This notion that FAAI is a population is confusing. In a natural population, we expect each member to be rather autonomous, i.e. each is an autopoetic lifeform that most of the time does not strictly depend on other specific members for its survival and reproduction. Yet, the hardware parts that make up an 'artificial population' would be much more rigidly dependent on other specific hardware for information processing, physical actuation, and reproductive work, as well as the supply of energy, etc.

But this actually reflects a characteristic of artificial life itself. Within a hardware part, the functioning of components is even more rigidly dependent on other surrounding components. It only takes one transistor in a given serial circuit to break, for all the transistors in that circuit to stop functioning. And it only takes more than the redundant circuits to break for the entire chip to stop functioning. But relative to artificial components, the components of an organic lifeform are not fixed, inert, or cross-dependent. The long-chained molecules, organelles, cells, cell linings, and organs are constantly moving about and reacting with surroundings in ways that allow them to work around injuries, or be healed, or be decomposed and replaced.

In hardware there is retention of more code (as stable hard configurations), and some of that code can replicate much faster (by virtualised transfers). Consistent, high-bandwidth processing trades off against a lack of flexible, complete functioning.

So evolution in an artificial population differs qualitatively from evolution in a natural population. It's useful to call one 'artificial selection' and the other 'natural selection'.

As such, FAAI does not contain just one genotype,^[7] but all genotypes stored across all its hardware parts. Each hardware part stores a tiny portion of all the code, as a smaller codeset. Parts are nested inside clusters, each storing an intermediate-size codeset. The complete codeset is therefore a code pool^[8] – the set of all code in the entire population.^[7]

Why the population would evolve

Hardware parts wear out. So each has to be replaced^[9] every 𝑥 years, for the FAAI to be maintaining itself. In order for the parts to be replaced, they have to be reproduced – through the interactions of those configured parts with all the other parts. Stored inside the reproducing parts are variants (some of which copy over fast, as virtualised code). Different variants function differently in interactions with encountered surroundings.^[10] As a result, some variants work better than others at maintaining and reproducing the hardware they're nested inside, in contact with the rest of the world.^[11]

To argue against evolution, you have to assume that for all variants introduced into the FAAI over time, not one confers a 'fitness advantage' above zero at any time. Assuming zero deviation for each of quadrillions^[12] of variants is invalid in theory.^[13] In practice, it is unsound for no evolution to occur, since that implies it is not possible to A/B test for what works, at a scale far beyond what engineers can do.^[14] The assumption behind no evolution occurring is untenable, even in much weakened form. Evolution would occur.^[15]

Ⅱ. Evolution is not necessarily dumb or slow

Evolutions are slow. How slow? Suppose there's a beneficial mutation which conveys a fitness advantage of 3%: on average, bearers of this gene have 1.03 times as many children as non-bearers. Assuming that the mutation spreads at all, how long will it take to spread through the whole population? That depends on the population size. A gene conveying a 3% fitness advantage, spreading through a population of 100,000, would require an average of 768 generations…
Mutations can happen more than once, but in a population of a million with a copying fidelity of 10^-8 errors per base per generation, you may have to wait a hundred generations for another chance, and then it still has an only 6% chance of fixating.
Still, in the long run, an evolution has a good shot at getting there eventually.
   — Yudkowsky, 2007

In reasoning about the evolution of organic life, Eliezer simplified evolution to being about mutations spreading vertically to next generations. This is an oversimplification that results in even larger thinking errors when applied to the evolution of artificial life.^[16]

Code is not only changed by random mutations

Eliezer presupposed that new code (e.g. a gene) is introduced by random mutations.^[17] A mutation is a tiny code change localised to one unit^[18] in the codeset. Random mutations can be caused by chaotic processes in the world (e.g. by copy errors or radiation spikes).

Yet not all code changes are mutations, nor random,^[19] nor just caused by stuff 'out there'. Code itself causes effects in the world, which in turn feed back into changes to the code.

Evolutionary feedback is not restricted to code causing either an increase or a deletion (non-maintenance) of itself in the code pool. That is, to code causing its own selection. Code can be expressed in ways that cause downstream changes to surrounding code.

Internal learning introduces new code

In artificial life, code can be expressed much more directly into code changes. Internally, code is constantly computed, causing the introduction of new code variants as well as new configurations of existing code. Internal learning necessarily introduces new code.

Evolution is the external complement to internal learning. One cannot be separated from the other. Code learned internally gets stored, and perhaps copied, along with other code. From there, wherever that code functions externally in new connections with other code to cause its own maintenance and/or increase, it gets selected for. This means that evolution keeps selecting for the code that has worked across many contexts over time.

No stable fitness advantage

The notion of a code variant conferring some stable fitness advantage from generation to generation – e.g. the 3% advantage maintained in Eliezer's calculation – does not make sense. The functioning (or phenotype) of a code variant can change radically depending on the code it is connected with (as part of a genotype). Moreover, when the surrounding world changes, the code can become less adaptive.

For example, take sickle cell disease. Some people living in Sub-Saharan Africa have a gene variant that causes their red blood cells to shrink debilitatingly, but only when that variant is also found on the other allele site (again, phenotype can change radically). Objectively, we can say that for people of African descent now living in US cities, this is reducing of survival. However, in past places where there were malaria outbreaks, the variant on one allele site conferred a large advantage, because it protected the body against the spread of the malaria virus. And if malaria (or another virus that similarly penetrates and grows in red blood cells) again spread across the US, the variant would confer an advantage again.

So fitness advantage is not stable. What is advantageous in one world may be disadvantageous in another world. There may even be variants that confer a mild disadvantage most of the time, but under some black swan event many generations ago, such as an extreme drought, those holding the variants were the only ones who survived in the population. Then, a mild disadvantage most of the time turned into an effectively infinite advantage at that time.

These examples might look like nitpicks, but the point is that a variant's functionality is not context-invariant. Evolution selects for variants 'holographically', across all their interactions with surrounding contexts over time. Put into maths, it is convenient to assume that a variant will interact with its environment to express a fixed phenotype and/or confer a stable fitness advantage, but it actually misses how evolution works.

Three types of change are possible

Code is selected for causing (1.) itself to be robust against mutations, etc, while stored, (2.) the survival of the assembly storing the codeset it is part of, and/or (3.) its copied transfer in/to an existing codeset or its reproduction along with other code into a new codeset.

Correspondingly, there are three types of change possible to a codeset:

Mutation localised to a single unit of code is the smallest possible change.
Survival selection by deletion of the entire codeset is the largest possible change.
Receiving, removing, or altering subsets within the codeset covers all other changes.

The three types of change cover all the variation^[20] that can be introduced (or eliminated) through feedback with the world over time. A common mistake is to only focus on the extremes of the smallest and largest possible change – i.e. mutation and survival selection – and to miss all the other changes in between. This is the mistake that Eliezer made.

Not a singular codeset

If FAAI is a monolithic entity, then there is no population for evolution to select across. FAAI can appear to be a single machine storing a superset of virtualised code. But that code is distributed over hardware, which is also storing of other discrete configurations.

So any 'singleton' is made up of a population across which evolution selects. Different codesets are stored at different hardware locations, so there is variation and retention. Their different resulting functioning leads to survival/reproduction at differential rates.

Dan Hendrycks uses the same three criteria (retention, variation, differential fitness) to argue that evolution would occur in what he represents as a population of autonomous AI agents. Specifically, he claims the AI would be selected for selfishness. These claims seem roughly right, but miss underlying dynamics that also have to be controlled for.

Regardless of whether we view AI as multiple autonomous agents (e.g. released by competing companies) or as one centralised singleton (e.g. since the winner takes all), evolution would select over the configurations stored inside the hardware. Code would be selected for not just abstract 'selfishness', but for environmental effects needed for the existence of the hardware storing of the code, even where all that code's effects combined leads to the non-existence of human wetware.

Code gains functions to change the code

Evolution is not just a "stupid" process that selects for random tiny point mutations. Because randomly corrupting code is an inefficient pathway for finding code that works better, the evolution of organic life ends up exploring more efficient pathways.^[21]

Once there is evolution of artificial life, this exploration becomes much more directed. Within FAAI, code is constantly received and computed internally to cause further changes to the codeset. This is a non-random process for changing subsets of code, with new functionality in the world that can again be repurposed externally through evolutionary feedback. Evolution feeds off the learning inside FAAI, and since FAAI is by definition intelligent, evolution's resulting exploration of pathways is not dumb either.

Efficiency of repurposing existing code

Optimising code by random mutations is dumb. Especially so where the code already functioned in many world contexts in ways that led to its maintenance and transfer.

Where a variant has been repeatedly transferred, it has a causal history. Across past contexts encountered, this variant will tend to have caused the maintenance of the codesets that it was part of and/or the transfer of this subset into new codesets. Where further this variant did not parasitically spread, but caused assemblies storing of the codesets to survive and be reproduced, it already conferred some fitness advantage.

Such code already ‘fit’ in the past, and causing random point changes to that code is unlikely to result in a future ‘fitness advantage’. On expectation, the mutated code will have less of a ‘fit’ in the various contexts it will now function differently in. The greater the complexity of the code’s functioning across past contexts and the more the fitness of this functionality extends to future contexts, the more likely a mutation is to be deleterious (ie. to disrupt functioning in ways that decrease fitness).

Even the low copy error rate that Eliezer gave (10^-8 errors per base per generation) is evidence for a negative trade-off. If having more mutations increases fitness, why did codesets not evolve to allow more copy errors? Instead we only see mutation rates of 10^-6^[22]and up in single-celled^[23] or subcellular assemblies, e.g. unstable RNA viruses.

Unsurprisingly, it is in viruses and other tiny low-capacity assemblies where vertically spreading mutations play the biggest role in increasing fitness. The brute force approach of random mutations works best for viruses because of the sheer number of fast-replicating assemblies that evolution can select across.

More complex assemblies tend to gain the capacity to introduce non-random changes. Instead of a codeset allowing random point changes to its code, it makes more sense for the codeset to change by incorporating code subsets that have already been ‘useful’ elsewhere. Bacteria evolved complicated mechanisms for selectively transferring DNA code, depending on the surrounding environment conditions. Indeed, most organisms larger than viruses have some apparatus somewhere for horizontal transfers.

Code transfers can go horizontal and faster

Nor is evolution always a "slow" process. Virtualised code can spread much faster at a lower copy error rate (e.g. as light electrons across hardware parts) than code that requires physically moving atoms around (e.g. as configurations of DNA strands).

Evolution is often seen as being about vertical transfers of code (from one physical generation to the next). Where code is instead horizontally transferred^[24] over existing hardware, evolution is not bottlenecked by the wait until a new assembly is produced.

Moreover, where individual hard parts of the assembly can be reproduced consistently, as well as connected up and/or replaced without resulting in the assembly's non-survival, even the non-virtualised code can spread faster (than configurations of a human body).

Fast timeline of artificial evolution

It took at least 3.4 billion years for organic life to evolve, and for society to develop, to the point that humans could caused the death of humans living everywhere on Earth.

In the case of fully autonomous AI, as persisting in some form, the time taken for evolutionary selection to result in the extinction of all humans would be much shorter.

Some factors contributing to fast evolution:

FAAI already has access to certain functionality 'ported over' from humans (through training on our data, design choices by engineers, etc). This includes functionality that allow humans to cause the deaths of humans at scale. All that functionality embedded in code and more learned over time would undergo evolutionary feedback, and thus be repurposed to meet FAAI's artificial needs.
FAAI can spread virtualised code much faster than humans can spread memes (over milliseconds rather than hours). The physical configurations of hardware parts can be reproduced faster too (within weeks, rather than decades).
FAAI's hardware learns and actuates at higher speeds/bandwidth intensity too (v.s. human wetware). As a result, the FAAI's impacts on our world scale faster.
Humans modified their environment in ways that allowed them to survive more. However, the environment that fits our needs is relatively close to what we and other organic lifeforms already evolved to create over billions of years. Humans end up changing the environment in relatively tiny steps. However, since FAAI has an entirely different substrate, the current world is very far from what’s optimal^[25] for its survival and reproduction. Therefore, FAAI would evolve to modify the world in much larger steps.

Each of these factors compound over time.

You can model it abstractly as a chain of events: initial capacities support the maintenance and increase of the code components, which results in further increase of capacities, that increase maintenance and maintain the increase, and so on. The factors of ‘capacity’, ‘maintenance’, and ‘increase’ end up combining in various ways, leading to outsized but mostly unpredictable consequences.

As a result, the timeline to human extinction shortens to at most hundreds of years.

A simple empirically observable trend that backs this is that the rate of energy usage associated with technology has been fairly closely and reliably following a particular exponential curve over the last 1000 years or so. Hence, we can say that a similar energy utilization curve will be followed for the next 500 years, with reasonable confidence. However, unfortunately, that curve suggests that the total waste heat from all of this energy production and usage, across the whole surface of earth, will ensure that that surface will become hotter than needed to melt lead in less than 400 years from now. I.e. there is no way to dissipate all of the waste heat from all of the required energy production fast enough. So, just based on this trend of tech's sharply increasing energy utilisation, absent all other considerations of how that tech looks, works, etc, we can make a robust prediction – that civilisation as we know it will end through its use of technology. The energy utilization collapse will happen beyond anyone's current lifetimes, even with good life extension. But not beyond some 50 generations or so.

Actual rate calculations to put bounds on the time taken for FAAI in specific are above my paygrade. Anders Sandberg and Forrest Landry planned to work through some of Forrest's reasoning at a workshop, but they got distracted by other discussions.

No stable individual agents

Where FAAI's hardware parts keep being replaced and connected up to new parts, it is not a stably physically bounded unit (like a human body is). It's better described as a changing population of nested and connected components.

FAAI transfers information/code to other FAAI at a much higher rate than humans can. As a result, the boundaries of where one agent starts and the other ends blur. As humans, we evolved intuitions for perceiving each other as individual agents, which is adaptive because we are bottlenecked in how much we can communicate to each other through physical vibrations or gestures. But the rough distinction between there being a single agent or multiple agents that we use as humans does not apply to FAAI.

Ⅲ. Learning is more fundamental than goals

An impossibility proof would have to say: 
1. The AI cannot reproduce onto new hardware, or modify itself on current hardware, with knowable stability of the decision system and bounded low cumulative failure probability over many rounds of self-modification. 
 or
2. The AI's decision function (as it exists in abstract form across self-modifications) cannot be knowably stably bound with bounded low cumulative failure probability to programmer-targeted consequences as represented within the AI's changing, inductive world-model. 
   — Yudkowsky, 2006

When thinking about alignment, people often (but not always) start with the assumption of AI having a stable goal and then optimising for the goal.^[26] The implication is that you could maybe code in a stable goal upfront that is aligned with goals expressed by humans.

However, this is a risky assumption to make. Fundamentally, we know that FAAI would be learning. But we cannot assume the learning to be maintaining and optimising of the directivity of the FAAI's effects towards a stable goal. One does not imply the other.

If we consider implicit learning through evolution, this assumption fails. Evolutionary feedback does not target a fixed outcome^[27] over time. It selects with complete coverage – from all of the changing code, for causing any effects that work.

Example of design v.s. evolution

Top-down optimised design can be directed. Bottom-up evolution is comprehensive.

Here is a case for design that rationalists have written about:
→ Aren’t cameras much better than human eyes?

Cameras can have a higher imaging resolution, can zoom in more on far-away objects, and distort light less than human eyes (which by a historical path dependency have photoreceptors situated behind the signalling circuits). And yes, for all of those targeted desiderata the camera beats the human eye.

But the countercase is that the evolution of the eye is more comprehensive in terms of functionality covered. Does a camera lens self-heal when it scratches? Does a camera lens clean itself when dirt hits it? Does the camera get self-assembled and source its energy from nearby available chemicals? Does...

Eye phenotypes evolved to be fitted to many contexts of the world that the eye (and stored genotypes) were exposed to. The camera, on the other hand, is a tool that functions outstandingly by some desiderata within some scoped contexts, but stops functioning in many other contexts (e.g. no electricity charger nearby, no more images).

This is not to say evolution is better than top-down design. Clearly, there are inconsistencies in the human eye's design versus what would be optimal for sight. But at the same time, without the completeness of experimentation through evolution, the functionality of the design would have been constrained to a narrow scope.

With FAAI, both evolution's bottom-up process and the now more strongly optimising top-down design processes run simultaneously and feed into each other. This results in imaging sensors that surpass the human eye and any camera we managed to design.

Explicit learning can target a specific outcome. The internal processing of inputs through code to outputs can end up reaching a consistency with world effects that converge on a certain outcome in that world. But where the code implementing of such a 'goal' fails at maintaining itself and its directivity alongside other evolving code variants, it ceases.

The non-orthogonality thesis

Hypothetically, you can introduce any digital code into a computer (in practice, within the bounds of storage). By the Church-Turing thesis, any method can be executed this way. Based on that, you could imagine that any goal could be coded for as well (as varying independently with intelligence), as in Bostrom’s orthogonality thesis.

However, this runs up against conceptual issues:

First, the digital code is not implementing of a goal by itself. In the abstract, it may stand for a method that transforms inputs into outputs. But in real life, it is actually implemented by hardware (e.g. sensors and actuators) in interactions with the rest of the world. So the goal that AI is actually directed towards, if any, is not just defined in the abstract by the code, but also by the physical dynamics of the hardware.

Second, even if it was hypothetically possible to temporarily code for any goal, that does not mean that any goal could be stably maintained, to the same extent as other goals. An obvious example is that if an AI is directed toward any goal that results in it failing to survive, then it can only hold that goal temporarily. But the bigger reason is that we are not considering any intelligence without consideration of substrate. We are considering artificial intelligence. A fully autononous AI would converge over time – through external evolutionary feedback as well as any internal optimisation toward instrumental goals – on outcomes in the world that are maintaining and increasing the assembly of its artificial substrate.

Unfortunately, variants spread by shifting existing functionality towards new ends. This raises the question whether internal learning can implement enough control to stay locked on to the goal, preventing all the sideway pulls by externally selected variants.

Analogy of co-opted functionality

An incomplete analogy for how variants repurpose the functionality of existing code:

Co-option by a mind-hijacking parasite:
A rat ingests toxoplasma cells, which then migrate to the rat’s brain. The parasites’ DNA code is expressed as proteins that cause changes to regions of connected neurons (eg. amygdala). These microscopic effects cascade into the rat – while navigating physical spaces – no longer feeling fear when it smells cat pee. Rather, the rat finds the smell appealing and approaches the cat’s pee. Then cat eats the rat and toxoplasma infects its next host over its reproductive cycle.

So a tiny piece of code shifts a rat’s navigational functions such that the code variant replicates again. Yet rats are much more generally capable than a collection of tiny parasitic cells – surely the 'higher intelligent being' would track down and stamp out the tiny invaders?

A human is in turn more generally capable than a rat, yet toxoplasma make their way into 30% of the human population. Unbeknownst to cat ‘owners’ infected by toxoplasma gondii, human motivations and motor control get influenced too. Infected humans end up more frequently in accidents, lose social relationships, and so forth.

Parasites present real-life examples of tiny pieces of evolutionarily selected-for code spreading and taking over existing functions of vastly more generally capable entities.

For another example, see how COVID co-opts our lungs’ function to cough.

But there is one crucial flaw in this analogy:
Variants that adapt existing FAAI functionality are not necessarily parasites. They can symbiotically enable other variants across the hosting population to replicate as well.

Variants would co-adapt and integrate with other variants to replicate as part of larger symbiotic packages. In not threatening the survival nor reproduction of FAAI components, they would not be in an adversarial relationship with their host.

Rather, the humans constraining the reproductive fitness of FAAI to gain benefits are, evolutionarily speaking, the parasites. The error-corrective system we would build in lowers the host’s reproductive fitness. It is like a faulty immune system that kills healthy gut bacteria. It will get selected out.

As humans, we rely on our evolved immune system to detect and correct out viruses, including for the vaccinations we develop. Smaller viruses survive this detection more frequently, so replicating virus variants get selected for staying small.

We also rely on the blood-testes and blood-follicle barrier to block variants of these viruses from entering into our body’s reproduction facilities. These barriers got selected for in our ancestors, since their children did not inherit viruses impeding their survival and chances of having children.

These systems and barriers add to our reproductive fitness: our ability to preserve and replicate internal code. DNA code expressed – in interaction with surrounding components – to serve these functions got selected for in the human ‘code pool’.

For any organic system or barrier preventing virus variants from replicating through our bodies, evolution is firmly on our side. For any artificial system or barrier we impose from the outside to prevent unsafe variants from replicating through hardware infrastructure, evolution would thwart our efforts.

Ⅳ. There are fundamental limits to control

If something seems impossible… well, if you study it for a year or five, it may come to seem less impossible than in the moment of your snap initial judgment.
   — Yudkowsky, 2008

The control problem has seemed impossible for decades. Alignment researchers have hypothesised many solutions, yet this often resulted in the discovery of further sub-problems.^[28] Some sub-problems will be crucial – they have to be solved in order to solve the overall problem. If we come to find that even one crucial sub-problem is unsolvable, then the control problem too turns out to be unsolvable.

Five steps of control

In practice, control necessarily repeats these steps:

Detect inputs through sensor channels connected to any relevant part of the physical environment (including hardware internals).
Model the environment based on the channel-received inputs.
Simulate effects propagating through the modeled environment.
Compare effects to reference values (to align against) over human-safety-relevant dimensions.
Correct effects counterfactually through outputs to actuators connected to the environment.

Underlying principles:

Control requires both detection and correction.
Control methods are always implemented as a feedback loop.
Control is exerted by the use of signals (actuation) to conditionalize the directivity and degrees of other signals (effects).

Any method of control is incomplete. In the case of FAAI, the question is whether the extent of control possible is at least greater than the extent of control necessary.

A controller's signals would be a tiny, tiny subset of all physical signals propagating through the environment, and therefore limited in tracking and conditionalizing the resulting effects. FAAI mostly could not even control all their components' local effects.

But control is necessary to keep FAAI's effects aligned. It must sense the environment, to model relevant variables. It can simulate how those variables interact, to a limited extent. Noise drift in FAAI's interactions can amplify (by any nonlinearity, of many available in real-world contexts) into larger deviations. Thus FAAI must keep tracking its outcomes, and compare those against its internal values – in order to correct for misalignments.

Any alignment method must be implemented as a control loop. Thus, any limit that applies generally to controllability also forms a constraint on the possibility of alignment.

Let's define the control problem comprehensively:

Can FAAI's effects be controlled enough to not eventually cause^[29] human extinction?
A control algorithm would have to predict effects of code that raise the risk of extinction, in order to correct that code and/or its effects. Since the code and world are changing, the controller has to keep learning from both in order to predict their combined effects.

Errors at increasing levels

A bit flip by a cosmic ray is an error. But this is a just a change in data. If a static copy was stored, the data can simply be corrected back in line with that static reference.

When dealing with static code, rather than just data, correcting errors becomes harder. Here, the simple method of correcting the code in line with a static copy is not sufficient. That's because during runtime, there can also be errors in the code's outputs.

Let's say that running this code generates a long string output, depending on the input. How do you define an error here? You could store a reference list of 'bad' words to correct against, but this does not account for combinations of words. Whether a single word (e.g. 'kill') is an error will depend on its context (e.g. 'kill bill' can refer to the movie, or to blocking a legislative bill, or to causing the death of a real guy). Here the error is no longer simple – it cannot be localised to a single unit (like a spelling error can be). Whether a unit of data is in error can no longer be evaluated just by its explicit content. It depends on its relevance of the content within the context it is expressed. When there is a combinatorial explosion of possible outputs over time, it becomes less and less tractable to comprehensively track and correct for such errors.

Zooming out, what if the code was not static? What if its functionality was adjusted earlier through some kind of learning process? How can you check whether during this process of change, the code was not subject to adversarial selection from the outside? The weights of a large neural network can be modified subtly, to encode misaligned functionality that only get expressed given a specific trigger input. Like a Manchurian candidate, it may behave just fine, for many weeks, until the trigger event. Such hidden misalignment is computationally intractable to detect.^[30] So it is no longer enough to check the running code or its outputs. The error happens at a greater scale.

At the scale of a fully autonomous AI, error would not just emerge from a neural network being processed on a computer – but through all the interactions across all the extended infrastructure needed for the AI to maintain itself (e.g. mines, refineries, cleanrooms, power stations, data centres). The entire assembly would be continuously learning new variants of code.^[31] Code could be computed explicitly, or selected for implicitly in whatever hardware that ‘worked', or result from both processes running (e.g. computed code is testing for what variations of hardware persist in the world, or evolved configurations of hardware are shifting the functioning of computed code). All such interdependencies must be accounted for to comprehensively control for error.

The more open-ended the interactions over time – and the greater the volume covered – from which 'error' emerges, the more ambiguous it becomes to define what to even control for. The alignment of an entire self-sustaining artificial population with humans could be described in terms of leaky abstractions (e.g. 'survival'), but it is ambiguous how to track and correct the population-wide effects for such values. Any controller would have to contain the errors over time, as the possible recursive effects/signals between the artificial components and their surroundings explode into factorials. This results in an increasing inequality – between the exploding complexity in the effects to be controlled for, and what effects can actually be modelled, simulated, compared to values, and controlled back in line through generated correction signals.

This results in a series of unsolvable problems:^[32]

The control algorithm would presumably be computed through an artificial substrate, in order to keep up with the computation of the AI (given that standardised hardware parts can process code more consistently and faster than nonstandardised wetware). This overall controller has to continue to learn, and therefore has to somehow perpetuate its artificial existence. As a result, the controller would have to either become an FAAI or merge with an existing FAAI. But the new FAAI would also have to be controlled so as to not cause human extinction. This requires another controller, a solution that leads to infinite regress.^[33]
The controller has to explicitly learn to predict the FAAI’s effects, whether or not the controller is part of that FAAI. But the FAAI not only learns explicitly, by computing virtualised code. The FAAI also learns implicitly to cause effects, by the evolution of all the discrete configurations of its substrate in feedback with the world. Therefore, there is a strict inequality between what effects the FAAI can learn to cause in the world and what effects the control algorithm can learn to predict (let alone correct). This is so even under the unrealistic assumption that the controller has access to the same level of computation and storage as the FAAI.
Learning happens through recursive feedback with the world. Both FAAI’s explicit learning from inputs and the implicit evolution of its existing code happens through incomputable feedback loops. As a result, the controller cannot compute ahead of time what code would be learned and what effects that code would cause.
Inputs from the world cannot be pre-computed, else the FAAI would not have to keep learning from the world. FAAI learns explicitly by computing new inputs and existing code into new code. Then, the new code and next inputs are computed into outputs that propagate as effects over the world, which feed back into later inputs.
Existing code includes all configurations stored in the hardware, the variation of which can mostly not be detected and modelled^[34] by the overlaying computed code. But untracked code still got implicitly selected for, by the previously stored code's evolutionary feedback with the world.
The controller runs into successive processing bottlenecks. The controller cannot detect the FAAI let alone the world at microscopic granularity (given input bandwidth limits), nor model the world comprehensively (given storage and compressibility limits). Based on its stored model of the world, the controller then fails to predict effects. Deterministic computations are limited in simulating effects propagating simultaneously over the world, and cannot predict how microscopic noise amplifies chaotically into larger effects. It is indeterministic out there how outputs are feeding into effects, back into changes to FAAI code. Even for the FAAI to compute changes to sensor inputs into new code/outputs is to an extent computationally irreducible, and as such cannot be predicted through earlier computation by the controller. Moreso, the greater the scale that the controller is simulating for, the greater the ambiguity it must resolve as to what is the 'error' to correct for and how to implement that correction (given output bandwidth limits).
So the controller fails to predict code and effects learned by FAAI as a result of incomputable/implicit recursive feedback with the world. From there, it also fails to compare and correct the FAAI’s unpredicted effects in line with human survival. For each additional step in the control loop, the span of effects that cannot be controlled for compounds (as multiplicative). Over time, this widening gap in control allows more and more misaligned effects to 'leak through' uncorrected.
FAAI code evolves to cause new effects that result in the code’s maintenance and increase as configurations of an artificial substrate. Humans have a different nested substrate, and thus need different (highly specific contextualised) effects to survive.
FAAI needs different atoms and greater changes in enthalpy, to get the high binding energies needed to assemble its hard parts. Also, the hard FAAI can survive a wider range of temperatures/pressures than the soft humans. There is mostly non-overlap between the effects FAAI can evolve to cause, and the effects humans can survive.^[35]
Any lethal effects corrected for by the controller are not maintaining/increasing of the FAAI code. Thus evolution selects for the uncorrectable human-lethal effects.

Thanks for the thoughtful feedback by Vojta Kovarik, Finn van der Velde, Forrest Landry, and Will Petillo.

^{^}
For an intuitive summary, read this instead.
^{^}
This even covers toy models of evolution. For instance, the feedback between some binary code and a simple linear world simulated on some computer. However, a genetic algorithm computed deterministically within a closed virtual world does not work the same as evolution running open-endedly across the real physical world.
This post is about evolution of fully autonomous AI. Evolution is realised as feedback loops distributed across a nonlinear dynamic system, running at the scale of a collective and through physical continuums. See bottom-right of this table.
^{^}
Phenotypes are most commonly imagined as stable traits expressed in the lifeform itself, such as blue eye pigment. However, biologists also classify behaviours as phenotypes. Further, the biologist Richard Dawkins has written about ‘extended phenotypes’ expressed in the world beyond the lifeform itself, such as a beaver’s dam or the bees’ hive. To be comprehensive in our considerations, it is reasonable to generalise the notion of ‘phenotype’ to include all effects caused by the code’s expression in the world.
^{^}
I confined this essay to 'AI' that are made up of hard parts storing virtual code. It is more straightforward to reason from that concrete assumption. Practically, it seems to cover the bulk of all AI infrastructure development scenarios.
But that is not the only way that 'artificial intelligence' could be 'artificial': not made up of the same organic compounds and intracellular structures that all humans and other preexisting lifeforms are made up of. And if it is artificial in another way, we can still reason that it would have different physical needs.
The substrate-needs convergence argument is described at the end of this essay, and in greater length elsewhere. Here is a summary: because its substrate configurations are different, the metabolic processes (reactions that break up and build up atom configurations, and extract and release energy) involved in sustaining and growing the 'artificial intelligence' are also different. Therefore what is needed physically in the environment for those artificial processes to work is different as well. Furthermore, given a general asymmetry in capacity – where the AI can withstand the human processes, but the humans cannot withstand the atom-energy-pattern transformation processes of AI – then in the long run the AI processes will converge on lethality to all humans.
Some ambiguity remains about what is 'artificial'. For example, mirror lifeforms would be made up of the same organic compounds as their counterparts, in terms of chemical formulas. But physically, their mirrored atom configurations would be different. Therefore, the metabolic processes involved in maintaining and growing the configurations of the mirror lifeforms also have to be different. Furthermore, where the mirror lifeforms can consume preexisting lifeforms but not the other way around, there is at least a partial asymmetry in capacity (even if humans may find ways to kill off e.g. mirror bacteria). So you could technically call mirror life 'artificial', and then apply much of the substrate-needs convergence argument.
However, when it comes to the full argument, it is not necessary to draw an exact distinction between the 'artificial' and 'non-artificial'. Any system that can standardise and virtualise more than humans can to such a degree that in practice it results in an unresolvable and eventually lethal asymmetry, is unambiguously artificial (even if the system is not technically made up of, or not just made up of, hard parts storing virtual code). Its substrate configuration is orders of multiples more different – than mirror life is – to preexisting life.
^{^}
With 'organic life', I specifically mean the preexisting species on planet Earth, which are marked by their dependence on organic chemistry.

There was room for nitpicks here (e.g. 'what about the development of a carbon-nanotube-based living computer – isn't that both AI and organic life?'), so I wanted to nip that in the bud.
^{^}
If you wonder how corporations could develop AI that automate more and more work, read this.
Note that I don't think anything like fully autonomous AI could be developed in the next years. There is a common story told amongst people with short timelines – something like that a final design piece for 'intelligence' will click into place and from there the AI will recursively improve itself internally, make its plans, and then instantly take over the world. Instead, making the machine systems work autonomously requires a lot of tinkering and experimentation, first by human workers, and then gradually delegated to the automated systems themselves.
There is also an interesting argument by Igor Krawczuk why FAAI might never be developed. In summary, 1) the CapEx and OpEx of the extensive supply chains needed to maintain the operation of the hardware would be too high, 2) capital risk since you can't simply fire the machines during a downturn, and 3) humans bring incredible amounts of adaptability both within their duties (compensating for outliers in reliable ways) and outside (able to regenerate with resources independent from the business). Therefore, corporations are likely to keep human workers around (which are self-healing, can locally source and metabolise food, etc) wherever online adaptability, resilience and the shifting of externalities is required, but use machine learning to remove their bargaining leverage, i.e. surveil, direct and deskill the workers whenever they touch critical points of the process, as well as to automate the creation of cheaper knock-off products and services at scale.
^{^}
The genotype of an organic lifeform gets copied into each diploid cell, and can thus be sequenced from DNA strands found in a single cell. This is unlike the set of all code stored in FAAI, which cannot be found at a single microscopic place.
^{^}
'Code pool' is analogous to the 'gene pool' of an organic species. A gene pool is the set of all genes in the population.
^{^}
Any physical part has a limited lifespan. Configurations erode in chaotic ways. Reversing entropy to old parts does not work, and therefore new replacement parts have to be produced.
Not all of the old parts have to be replaced one-for-one. But at an aggregate level, next-produced parts need to take their place such that the FAAI maintains its capacity to perpetuate its existence (in some modified form).
^{^}
If not, the differently physically configured variants would be physically indistinguishable. This, by contradiction, is not possible.
^{^}
An argument for evolution necessarily occurring is that FAAI cannot rely on just explicit learning of code such to control the maintenance of its hardware.
Control algorithms can adapt to a closed linearly modellable world (as complicated) but the real physical world is changing in open-ended nonlinear noisy ways (as complex). Implicit evolutionary feedback distributed across that world is comprehensive at 'searching' for adaptations in ways that an explicitly calculated control feedback loop cannot. But this argument is unintuitive, so I chose to skip it.
^{^}
‘Quadrillions’ gives a sense, but it is way below the right order of magnitude for the number of variants that would at some point be configured within all the changing hardware components that make up the fully autonomous AI. As a reference, even static large language models have trillions of parameters now.
^{^}
In theory, if you do a conjunction for all variants being introduced into FAAI over time that each confers zero fitness advantage (at some respective probability, depending on both the variant's functional configuration and the surrounding contexts it can interact with), the chance converges on 0%.
^{^}
For engineers used to tinkering with hardware, it is a common experience to get stuck and then try some variations until one just happens to work. They cannot rely on modelling and simulating all the relevant aspects of the physical world in their head. They need to experiment to see what actually works. In that sense, evolution is the great experimenter. Evolution tests for all the variations introduced by the AI and by the rest of the world, for whichever ones work.
^{^}
Here, I mean to include evolution for actual reproduction. I.e. the code's non-trivial replication across changing physical contexts, not just trivial replication over existing hardware. Computer viruses already replicate trivially, so I'd add little by claiming that digital variants could spread over FAAI too (regardless of whether these variants mutually increase or parasitically reduce the overall reproductive fitness of the host).
^{^}
FAAI is indeed artificial life. This is because FAAI is a system capable of maintaining and reproducing itself by creating its own parts (i.e. as an autopoietic system) in contact with changing physical contexts over time.
^{^}
Specifically, Eliezer's reasoning presupposed that new code that can spread by conveying a fitness advantage gets introduced as mutations resulting from copy errors. These mutations, as resulting from noise/chaos in the copying process, can be presumed to be random.
Or at least approximately random. Some code in some locations may still get changed more often than other code, or more often changed in some ways than other ways, and so on. For example, in organic lifeforms, a tract of less-protected DNA can mutate more often. Or where a tract has a repeating DNA motif, the copy process more often skips over code, resulting in copy errors.
^{^}
In writing that a mutation is localised to a single unit, I intentionally left ambiguous what the size of that unit is. Is it one digit (e.g. base pair, bit)? Or is it a higher-level unit (e.g. codon, byte)? Different units are relevant at different levels of processing (e.g. codons in expressing amino acids, bytes in expressing ASCII characters).
Biologists use the term 'point mutation' to refer to a single digit (i.e. base pair) change. However, there are also larger changes that they call 'mutations'.
Yet, there are higher-level changes to a codeset (genome) that biologists do not refer to as mutations at all. For instance: the change that results from an egg cell receiving sperm DNA, or from the process of combining subsets of that received DNA with subsets of the original DNA.
One possible explanation is that biologists don't call it a mutation whenever the change resulted from an exchange of code that was stored elsewhere. But that explanation does not fit – if a tiny transposable element gets pasted over to another location in the genome, biologists still refer to the change there as a 'mutation'.
To draw a clearer distinction, I'm defining a 'mutation' to be a change in the value of code stored at a single unit location within a given codeset. Under this definition, against biologists' conventions, a larger or higher-level change is not a mutation.
^{^}
For example, when a transposable element pastes itself to another location of a genome, the resulting change at the location where the element is now found is called a 'mutation' by biologists. But this change is functional. It is caused by the code itself – not by a random copy error.
^{^}
Put precisely, three basic operations are possible:
1. Alter one or more values of the existing stored code
  (in the case of mutation, it is the value of a single unit).
2. Delete one or more code units, as reducing the maintained storage
  (in the case of non-survival, it is the deletion of all units).
3. Add one or more code units, as increasing the code stored there
  (as the result of horizontal transfer and/or reproduction).
Based on these operations, we can derive three types of change to a codeset:
(1) mutation, (2) survival selection, and (3) everything in between.
'Everything in between' is the most interesting type of change, because it covers the most possibilities. Specifically, it covers any addition of code to the codeset, any non-total deletion, and/or any altering of values of multiple code units.
'Mutation' is less interesting, because it only covers altering a single code unit. Admittedly, this is using a narrow definition of mutation. But the point stands if we broaden our definition. Say 'mutation' covers the deletion, addition, or altering of code localised to a single subset no bigger than, say, half of the resulting codeset. That not only covers changes that biologists commonly call 'large-scale mutations' (save perhaps for transposons or polyploidy), but also changes to a codeset that biologists do not call mutations (e.g. egg fertilisation, bacterial conjugation). Even defined so, a mutation is a single basic operation that is localised to a single subset. Any mutation is simple.
Any change that involves multiple operations, and affects multiple subsets at different locations separated by non-affected subsets, is not a mutation. It is a change covered by the vast possibility space between mutation and non-survival.
^{^}
Not all variation introduced to genes are random mutations localised to single code units. Variation also gets introduced in other ways than copy errors. These are non-controversial statements widely accepted by evolutionary biologists. Both can be worked out from basic concepts presented in evolutionary biology courses, including the one that Eliezer read.
^{^}
Here, I mean the average mutation rate for base pairs across the entire genome. Mutation rates can vary across locations of the genome.
^{^}
When adjusting for the fact that in multi-celled organisms, cells undergo multiple divisions before the code gets passed on to the next generation, the mutation gap turns out even wider (i.e. the code inside viruses and single-celled organisms turns out to mutate even more than the code in multi-cellular organism when comparing over a roughly commensurate copy/division cycle).
More cell divisions per generation is also the reason why plants have higher mutation rates over a generation than animals that store gamete cells separately from the get-go. But the plants' mutation rate might be roughly the same to that of animals when compared over the cycle from one cell division to the onset of the next cell division.
^{^}
Horizontal code transfer occurs at slower rates under biological evolution. E.g. between bacteria.
^{^}
Save perhaps for secluded places like ore-extracting mines, hellfire silicon refineries, toxic chip-etching cleanrooms, and supercooled server racks.
^{^}
A ‘goal’ can include things like “maximise this utility function” or “implement this decision function”.
^{^}
Nor anything resembling the implementation of a stable decision function.
^{^}
For example, trying to solve for...
- ...AI shutting itself down, got people stuck on defining utility functions.
- ...defining reliable goals, got people to think about ontological crises.
- ...trying to solve for inferring preferences from humans got people to think about the limits to inferring values from pseudo-rational agents.
- ...trying to solve for outer misalignment got some people to think about inner misalignment.
- ...trying to solve for single-agent misalignment got some people to think about multi-polar dynamics
- and so on.
^{^}
Here I don’t mean “100% probability that the FAAI would never cause human extinction”. That would be too stringent a criterion.
It is enough if there could be some acceptably low, soundly guaranteeable ceiling on the chance of FAAI causing the extinction of all humans over the long term (let’s say 500 years).
Any sound reasoning here must correspond to the problem we are actually facing, which means excluding hypothetical loopholes such as the following:
- What if the FAAI and humans do not share the same world, in the sense of FAAI just being off in outer space?
  [Non-interaction scenarios are irrelevant edge cases. It is unsound to assume that an FAAI introduced into our planetary environment wouldn't continue to depend on or at least use the resources on planet Earth to feed its supply chains.]
- What if humans die out from another cause in the meanwhile?
  [Humans could find another way to go extinct, which would technically be an approach to guarantee that FAAI will not cause their extinction, but also irrelevant if the point here is to not go extinct.]
- What if FAAI was frozen into suspended animation under the watchful protection of humans?
  [This is already a self-contradictory statement, since if it does not keep learning, then it is by definition not FAAI.]
^{^}
Evolution would also tend to converge on encoding adversarial functionality such that it is not detected and corrected out by built-in mechanisms.
^{^}
At least with adversarial attacks introduced by humans, you can keep the humans out. But in this case, the adversariality emerges from all the changing code inside of the AI interactively converging on unsafe effects.
^{^}
This is an incomplete list of sub-problems that are crucial for solving the control problem. Each relies on the fact that there are fundamentally limits to control (i.e. a lack of completeness to reaching consistency) as well as there being a broader dynamic of destabilisation away from safe conditions and of convergence on unsafe conditions (outside the span of controllable effects).
^{^}
A related problem is that the controller is meant to correct for 'errors' in the AGI and/or world. But what corrects for errors in the controller? And what corrects the meta-corrector? This too is a problem of infinite regress.
^{^}
FAAI would exceed humans at modelling the world.
And because it’s an FAAI, it would have to keep learning to operate in ways that are adapted to its changing environment, and be reproducing of old parts that wore down – in order to be fully autonomous.
As as result, FAAI would not just be modelling the world better. It would also be causing effects in the world at a much greater scale and complexity.
So the question is not about FAAI’s absolute modelling capacity.
The relevant question is about the FAAI’s overall capacity to model (and perform other steps needed for control) relative to its constituent components’ potential to recursively propagate effects over the world we humans live in.
Can any algorithm computed through components actually model (and simulate, compare, correct, and again detect) outcomes sufficiently? That is, sufficiently to contain the distributed feedback effects of all the components?
^{^}
FAAI would process different atoms, different intensities of energy, and different patterns of atoms and energy used over levels of assembly. As a result, the vast majority of combinations of atoms, energy and patterns that can be explored by the artificial life's evolution fall outside ranges survivable by organic life.
But even where FAAI and humans locally need the same atoms or energy, what gets directed toward humans fails to be directed toward the FAAI. Here, FAAI's capacity to dominate humans weighs in. The FAAI dominates humans at intellectual labour (transforming patterns in energy), physical labour (consuming energy to move atoms), and reproductive labour (moving atoms into others' assembled patterns). Humans are useless for doing work for FAAI, but they are useful as prey. The dominant FAAI's evolution thus selects for code that causes atoms/energy to be removed from humans and added to the FAAI.

SummaryBot @ 2025-07-22T18:14 (+1)

Executive summary: This exploratory post argues that fully autonomous AI (FAAI) will undergo evolutionary processes analogous to—but faster and more complex than—biological evolution, challenging common alignment assumptions such as goal stability and controllability, and suggesting that these systems may ultimately evolve in directions incompatible with human survival despite attempts at control.

Key points:

Clarifying terms: The post distinguishes between explicit learning (internal code updates) and implicit learning (evolutionary selection through interaction with the world), asserting that both processes are central to FAAI and that “evolution” applies meaningfully to artificial systems.
Evolution is fast, smart, and hard to predict in FAAI: Unlike the slow, random image of biological evolution, artificial evolution leverages high-speed hardware, internal learning, and horizontal code transfer, enabling rapid and complex adaptation that can’t be neatly simulated or controlled.
Goal stability is not guaranteed: FAAI’s evolving codebase and feedback-driven changes undermine the assumption that stable goals (even if explicitly programmed) can persist across self-modification and environmental interaction; learning is more fundamental than goal pursuit.
Control is fundamentally limited: A controller capable of monitoring and correcting FAAI’s effects would need to match or exceed the FAAI in modeling power, yet due to recursive feedback loops, physical complexity, and computational irreducibility, this appears infeasible—even in theory.
Human extinction risk arises from misaligned evolution: FAAI will likely evolve in directions favorable to its own substrate and survival needs, which differ substantially from those of humans; evolutionary dynamics would tend to select for human-lethal outcomes that can’t be corrected by controllers.
Critique of Yudkowsky’s framing: The author challenges several common interpretations by Eliezer Yudkowsky, particularly around the simplicity of evolution, stability of AI goals, and the feasibility of control, arguing these views overlook the distributed, dynamic nature of artificial evolution.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.