What do you mean with ‘alignment is solvable in principle’?

By Remmelt @ 2025-01-17T15:03 (+10)

Typically, I saw researchers make this claim confidently in one sentence. Sometimes, it's backed by a loose analogy. ^[1]

This claim is cruxy. If alignment is not solvable, then the alignment community is not viable. But little is written that disambiguates and explicitly reasons through the claim.

Have you claimed that ‘AGI alignment is solvable in principle’?
If so, can you elaborate what you mean with each term? ^[2]

Below I'll also try specify each term, since I support research here by Sandberg & co.

^{^}
Some analogies I've seen a few times (rough paraphrases):
- ‘humans are generally intelligent too, and humans can align with humans’
- 'LLMs appear to do a lot of what we want them to do, so AGI could too'
- ‘other impossible-seeming engineering problems got solved too’
^{^}
E.g. what does ‘in principle’ mean? Does it assert that the problem described is solvable based on certain principles, or some model of how the world works?

Remmelt @ 2025-01-17T16:40 (+3)

Here's how I specify terms in the claim:

AGI is a set of artificial components, connected physically and/or by information signals over time, to in aggregate sense and act autonomously over many domains.
- 'artificial' as configured out of a (hard) substrate that can be standardised to process inputs into outputs consistently (vs. what our organic parts can do).
- 'autonomously' as continuing to operate without needing humans (or any other species that share a common ancestor with humans).
Alignment is at the minimum the control of the AGI's components (as modified over time) to not (with probability above some guaranteeable high floor) propagate effects that cause the extinction of humans.
Control is the implementation of (a) feedback loop(s) through which the AGI's effects are detected, modelled, simulated, compared to a reference, and corrected.