My disagreements with "AGI ruin: A List of Lethalities"

By Sharmake @ 2024-09-15T17:22 (+23)

This is going to probably be a long post, so do try to get a drink and a snack while reading this post.

This is an edited version of my own comment on the post below, and I formatted and edited the quotes and content in line with what @MondSemmel recommended:

My comment: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD

MondSemmel's comment: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=WcKi4RcjRstoFFvbf

The post I'm responding to: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/

To start out my disagreement, I have this to talk about:

Response to Lethality 3

We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.

I think this is actually wrong, because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistinguishable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn't know whether we tested it or whether it's in the the real world:

Because of this, given our theoretical complete control over the data input of these models, it will be possible to train them inside what is effectively a simulated world slightly different from our own. This will allow us to maximize their capabilities along axes that are safe and useful to us, while minimizing the risks of deception and take-over and being able to setup traps to detect when these are likely. If the model is not trained on human data, then we can make its understanding of our world subtly different from the reality. If done carefully and cleverly, this will make it very easy to fool and be able to discover e.g. deceptiveness, or attempted takeover in a ‘realistic’ simulated world that the model is put inside during red-teaming. While deceiving an adversary with direct access to your cognition is hard enough, doing so when you have been brought up in a world entirely controlled by that adversary, and which is feeding you subtly incorrect and misleading information about the adversary and their capabilities is likely orders of magnitude more difficult. Jacob Cannell has argued along similar lines of training potential AGI systems in simulations of fantasy worlds, where takeover can be simulated but is far enough away that skills do not transfer. Fully synthetic datasets are necessary for this and indeed, a fantasy world is likely far from what we need. Instead, we only need to simulate a world very close to our own but with slightly different affordances – maybe specific software vulnerabilities are simulated to exist but are actually patched in our world, maybe certain peptide-making companies accept all orders in the simulated world but have more checks in our world, maybe certain interpretability techniques are claimed not to work in the simulation, but do in fact work in our world – to cause any attempted takeover behaviour or adversarial deception to appear extremely plausible to the AI but in fact be detectable and defendable in our reality. In fact, we should design simulated worlds with ‘honeytraps’, which appear to be loopholes subtly planted in the AI’s input data to encourage them to attempt to pursue their misaligned behaviour, but which we know of ahead of time and can monitor and catch.

It's the same reason for why we can't break out of the simulation IRL, except we don't have to face adversarial cognition, so the AI's task is even harder than our task.

Response to Lethality 6

6. We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.

I think this is wrong, and a lot of why I disagree with the pivotal act framing is probably due to disagreeing with the assumption that future technology will be radically biased towards to offense, and while I do think biotechnology is probably pretty offense-biased today, I also think it's tractable to reduce bio-risk without trying for pivotal acts.

Also, I think @evhub's point about homogeneity of AI takeoff bears on this here, and while I don't agree with all the implications, like there being no warning shot for deceptive alignment (because of synthetic data), I think there's a point in which a lot of AIs are very likely to be very homogenous, and thus break your point here:

https://www.lesswrong.com/posts/mKBfa8v4S9pNKSyKK/homogeneity-vs-heterogeneity-in-ai-takeoff-scenarios

Response to AI fragility claims

Running AGIs doing something pivotal are not passively safe, they're the equivalent of nuclear cores that require actively maintained design properties to not go supercritical and melt down.

I think that AGIs are more robust to things going wrong than nuclear cores, and more generally I think there is much better evidence for AI robustness than fragility.

@jdp's comment provides more evidence on why this is the case:

Where our understanding begins to diverge is how we think about the robustness of these systems. You think of deep neural networks as being basically fragile in the same way that a Boeing 747 is fragile. If you remove a few parts of that system it will stop functioning, possibly at a deeply inconvenient time like when you're in the air. When I say you are systematically overindexing, I mean that you think of problems like SolidGoldMagikarp as central examples of neural network failures. This is evidenced by Eliezer Yudkowsky calling investigation of it "one of the more hopeful processes happening on Earth". This is also probably why you focus so much on things like adversarial examples as evidence of un-robustness, even though many critics like Quintin Pope point out that adversarial robustness would make AI systems strictly less corrigible.
By contrast I tend to think of neural net representations as relatively robust. They get this property from being continuous systems with a range of operating parameters, which means instead of just trying to represent the things they see they implicitly try to represent the interobjects between what they've seen through a navigable latent geometry. I think of things like SolidGoldMagikarp as weird edge cases where they suddenly display discontinuous behavior, and that there are probably a finite number of these edge cases. It helps to realize that these glitch tokens were simply never trained, they were holdovers from earlier versions of the dataset that no longer contain the data the tokens were associated with. When you put one of these glitch tokens into the model, it is presumably just a random vector into the GPT-N latent space. That is, this isn't a learned program in the neural net that we've discovered doing glitchy things, but an essentially out of distribution input with privileged access to the network geometry through a programming oversight. In essence, it's a normal software error not a revelation about neural nets. Most such errors don't even produce effects that interesting, the usual thing that happens if you write a bug in your neural net code is the resulting system becomes less performant. Basically every experienced deep learning researcher has had the experience of writing multiple errors that partially cancel each other out to produce a working system during training, only to later realize their mistake.
Moreover the parts of the deep learning literature you think of as an emerging science of artificial minds tend to agree with my understanding. For example it turns out that if you ablate parts of a neural network later parts will correct the errors without retraining. This implies that these networks function as something like an in-context error correcting code, which helps them generalize over the many inputs they are exposed to during training. We even have papers analyzing mechanistic parts of this error correcting code like copy suppression heads. One simple proxy for out of distribution performance is to inject Gaussian noise, since a Gaussian can be thought of like the distribution over distributions. In fact if you inject noise into GPT-N word embeddings the resulting model becomes more performant in general, not just on out of distribution tasks. So the out of distribution performance of these models is highly tied to their in-distribution performance, they wouldn't be able to generalize within the distribution well if they couldn't also generalize out of distribution somewhat. Basically the fact that these models are vulnerable to adversarial examples is not a good fact to generalize about their overall robustness from as representations.

Link here:

https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/?commentId=7iBb7aF4ctfjLH6AC

Response to Lethality 10

10. You can't train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions. (Some generalization of this seems like it would have to be true even outside that paradigm; you wouldn't be working on a live unaligned superintelligence to align it.) This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they'd do, in order to align what output - which is why, of course, they never concretely sketch anything like that. Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you.

I think that there will be generalization of alignment, and more generally I think that alignment generalizes further than capabilities by default, contra you and Nate Soares because of these reasons:

2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.
4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.
5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.

See also this link for more, but I think that's the gist for why I expect AI alignment to generalize much further than AI capabilities. I'd further add that I think evolutionary psychology got this very wrong, and predicted much more complex and fragile values in humans than is actually the case:

https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/

Response to Lethality 11

11. If cognitive machinery doesn't generalize far out of the distribution where you did tons of training, it can't solve problems on the order of 'build nanotechnology' where it would be too expensive to run a million training runs of failing to build nanotechnology. There is no pivotal act this weak; there's no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world and prevent the next AGI project up from destroying the world two years later.

I also want to talk more about why I find the pivotal act framing unnecessary for AI safety, and I want to focus on this comment by @Rob Bensinger:

https://www.lesswrong.com/posts/PKBXczqhry7iK3Ruw/pivotal-acts-means-something-specific#bwqCE633AFqCN4tym

I think the argument for thinking in terms of pivotal acts is quite strong. AGI destroys the world by default; a variety of deliberate actions can stop this, but there isn't a smooth and continuous process that rolls us from 2022 to an awesome future, with no major actions or events deliberately occurring to set us on that trajectory.
I don't think it's harder; i.e., I don't think a significant fraction of our hope rests on long-term processes that trend in good directions but include no important positive "events" or phase changes. (And, more strongly, I think humanity will in fact die if we don't leverage some novel tech to prevent the proliferation of AGI systems.)

I think there were 2 positive events that were pretty hugely impactful for AI safety that happened continuously:

Nate Soares was wrong in the assumption that capabilities generalized farther than alignment, and the opposite looks likely to be true, that alignment generalizes farther than capabilities.
We are having more and more control over AI values and capabilities on the natural paths to superhuman AI because of synthetic data helping capabilities and alignment simultaneously.

This is why I disagree with the assumption that a pivotal act is necessary.

Response to Lethality 15

15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.

Re the sharp capability gain breaking alignment properties, one very crucial advantage we have over evolution is that our goals are much more densely defined, constraining the AI more than evolution, where very, very sparse reward was the norm, and critically sparse-reward RL does not work for capabilities right now, and there are reasons to think it will be way less tractable than RL where rewards are more densely specified.

Another advantage we have over evolution, and chimpanzees/gorillas/orangutans is far, far more control over their data sources, which strongly influences their goals.

This is also helpful to point towards more explanation of what the differences are between dense and sparse RL rewards:

This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don't know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from "complete failure" to "insane useless crap." It's notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
https://www.lesswrong.com/posts/rZ6wam9gFGFQrCWHc/#mT792uAy4ih3qCDfx

Response to Lethality 16

Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.

Yeah, I covered this above, but evolution's loss function was neither that simple, compared to human goals, and it was ridiculously inexact compared to our attempts to optimize AIs loss functions, for the reasons I gave above.

Response to Lethality 17

17. More generally, a superproblem of 'outer optimization doesn't produce inner alignment' is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.

I've answered that concern above in synthetic data for why we have the ability to get particular inner behaviors into a system.

Responses to Lethalities 18 and 19

18. There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned', because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function.
19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.

I think that the answer to how we get them to point to particular things in the environment is basically language and visual data.

The points were also covered above, but synthetic data early in training + densely defined reward/utility functions = alignment, because they don't know how to fool humans when they get data corresponding to values yet.

Response to Lethality 21

21. There's something like a single answer, or a single bucket of answers, for questions like 'What's the environment really like?' and 'How do I figure out the environment?' and 'Which of my possible outputs interact with reality in a way that causes reality to have certain properties?', where a simple outer optimization loop will straightforwardly shove optimizees into this bucket. When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases. This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of 'relative inclusive reproductive fitness' - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else. This abstract dynamic is something you'd expect to be true about outer optimization loops on the order of both 'natural selection' and 'gradient descent'. The central result: Capabilities generalize further than alignment once capabilities start to generalize far.

The key is that data on values is what constrains the choice of utility functions, and while values aren't in physics, they are in human books and languages, and I've explained why alignment generalizes further than capabilities above.

Response to Lethality 22

22. There's a relatively simple core structure that explains why complicated cognitive machines work; which is why such a thing as general intelligence exists and not just a lot of unrelated special-purpose solutions; which is why capabilities generalize after outer optimization infuses them into something that has been optimized enough to become a powerful inner optimizer. The fact that this core structure is simple and relates generically to low-entropy high-structure environments is why humans can walk on the Moon. There is no analogous truth about there being a simple core of alignment, especially not one that is even easier for gradient descent to find than it would have been for natural selection to just find 'want inclusive reproductive fitness' as a well-generalizing solution within ancestral humans. Therefore, capabilities generalize further out-of-distribution than alignment, once they start to generalize at all.

I think that there is actually a simple core of alignment to human values, and a lot of the reasons for why I believe this is because I believe about 80-90%, if not more of our values is broadly shaped by the data, and not the prior, and that the same algorithms that power our capabilities is also used to influence our values, though the data matters much more than the algorithm for what values you have.

More generally, I've become convinced that evopsych was mostly wrong about how humans form values, and how they get their capabilities in ways that are very alignment relevant.

I also disbelieve the claim that humans had a special algorithm that other species don't have, and broadly think human success was due to more compute, data and cultural evolution.

This thread below explains why I'm skeptical of the assumption that humans have a special generalizing algorithm that animals don't have:

https://x.com/tszzl/status/1832645062406422920

Response to Lethality 23

23. Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee. We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.

Alright, while I think your formalizations of corrigibility failed to get any results, I do think there's a property close to corrigibility that is likely to be compatible with consequentialist reasoning, and that's instruction following, and there are reasons to think that instruction following and consequentialist reasoning go together:

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than

https://www.lesswrong.com/posts/ZdBmKvxBKJH2PBg9W/corrigibility-or-dwim-is-an-attractive-primary-goal-for-agi

https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow

https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty

https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization

Response to Lethality 24

24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult. The first approach is to build a CEV-style Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it. The second course is to build corrigible AGI which doesn't want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.
The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It's not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You're not trying to make it have an opinion on something the core was previously neutral on. You're trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it's incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.

I'm very skeptical that a CEV exists for the reasons @Steven Byrnes addresses in the Valence sequence here:

https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_2_Implications_for_the__true_nature_of_morality___if_any_

But it is also unnecessary for value learning, because of the data on human values and alignment generalizing farther than capabilities.

I addressed why we don't need a first try above.

For the point on corrigibility, I disagree that it's like training it to say that as a special case 222 + 222 = 555, for 2 reasons:

I think instrumental convergence pressures are quite a lot weaker than you do.
Instruction following can be pretty easily done with synthetic data, and more importantly I think that you can have optimizers who's goals point to another's goals.

Response to Lethality 25

25. We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers. Drawing interesting graphs of where a transformer layer is focusing attention doesn't help if the question that needs answering is "So was it planning how to kill us or not?"

I disagree with this, but I do think that mechanistic interpretability does have lots of work to do.

Responses to Lethalities 28 and 29

28. The AGI is smarter than us in whatever domain we're trying to operate it inside, so we cannot mentally check all the possibilities it examines, and we cannot see all the consequences of its outputs using our own mental talent. A powerful AI searches parts of the option space we don't, and we can't foresee all its options.
29. The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI's output to determine whether the consequences will be good.

The key disagreement is I believe we don't need to check all the possibilities, and that even for smarter AIs, we can almost certainly still verify their work, and generally believe verification is way, way easier than generation.

Response to Lethality 32

32. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

I basically disagree with this, both in the assumption that language is a weak guide to our thoughts, and importantly I believe no AGI-complete problems are left, for the following reasons quoted from Near-mode thinking on AI:

"But for the more important insight: The history of AI is littered with the skulls of people who claimed that some task is AI-complete, when in retrospect this has been obviously false. And while I would have definitely denied that getting IMO gold would be AI-complete, I was surprised by the narrowness of the system DeepMind used."
"I think I was too much in the far-mode headspace of one needing Real Intelligence - namely, a foundation model stronger than current ones - to do well on the IMO, rather than thinking near-mode "okay, imagine DeepMind took a stab at the IMO; what kind of methods would they use, and how well would those work?"
"I also updated away from a "some tasks are AI-complete" type of view, towards "often the first system to do X will not be the first systems to do Y".
I've come to realize that being "superhuman" at something is often much more mundane than I've thought. (Maybe focusing on full superintelligence - something better than humanity on practically any task of interest - has thrown me off.)"
Like:
"In chess, you can just look a bit more ahead, be a bit better at weighting factors, make a bit sharper tradeoffs, make just a bit fewer errors. If I showed you a video of a robot that was superhuman at juggling, it probably wouldn't look all that impressive to you (or me, despite being a juggler). It would just be a robot juggling a couple balls more than a human can, throwing a bit higher, moving a bit faster, with just a bit more accuracy. The first language models to be superhuman at persuasion won't rely on any wildly incomprehensible pathways that break the human user (c.f. List of Lethalities, items 18 and 20). They just choose their words a bit more carefully, leverage a bit more information about the user in a bit more useful way, have a bit more persuasive writing style, being a bit more subtle in their ways. (Indeed, already GPT-4 is better than your average study participant in persuasiveness.) You don't need any fundamental breakthroughs in AI to reach superhuman programming skills. Language models just know a lot more stuff, are a lot faster and cheaper, are a lot more consistent, make fewer simple bugs, can keep track of more information at once. (Indeed, current best models are already useful for programming.) (Maybe these systems are subhuman or merely human-level in some aspects, but they can compensate for that by being a lot better on other dimensions.)"
"As a consequence, I now think that the first transformatively useful AIs could look behaviorally quite mundane."

https://www.lesswrong.com/posts/ASLHfy92vCwduvBRZ/near-mode-thinking-on-ai

Response to Lethality 39

To address an epistemic point:

39. I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them.

You cannot actually do this and hope to get any quality of reasoning, for the same reason that you can't update on nothing/no evidence.

The data matters way more than you think, and there's no algorithm that can figure out stuff with 0 data, and Eric Drexler didn't figure out nanotechnology using the null string as input.

This should have been a much larger red flag for problems, but people somehow didn't realize how wrong this claim was.

Conclusion

So now that I finished listing all of my arguments against specific lethalities, I want to point out an implication from this post, which is that this is already done:

So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it. Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as "less than roughly certain to kill everybody", then you can probably get down to under a 5% chance with only slightly more effort.

And we can get those probabilities down by several orders of magnitude, and get the number of people killed down by several orders of magnitude.

And that concludes the post.

SummaryBot @ 2024-09-16T16:57 (+1)

Executive summary: The author disagrees with many of Eliezer Yudkowsky's claims about AI alignment being extremely difficult or impossible, arguing that synthetic data, instruction following, and other approaches make alignment more tractable than Yudkowsky suggests.

Key points:

Synthetic data and honeypot traps can help detect and prevent deceptive AI behavior.
Alignment likely generalizes further than capabilities, contrary to Yudkowsky's claims.
Dense reward functions and control over data sources give humans advantages over evolution for shaping AI goals.
Language and visual data can ground AI systems in real-world concepts and values.
Instruction-following may be a viable alternative to formal corrigibility for aligning AI systems.
Many tasks previously thought to require general intelligence can be solved by narrower systems, suggesting transformative AI may not require full superintelligence.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Sharmake @ 2024-09-17T14:50 (+3)

This is mostly correct as a summary of my position, but for point 6, I want to point out while this is technically true, I do fear economic incentives are against this path.

Agree with the rest of the summary though.