Success without dignity: a nearcasting story of avoiding catastrophe by luck
By Holden Karnofsky @ 2023-03-15T20:17 (+113)
This is a crosspost, probably from LessWrong. Try viewing it there.
nullMaxRa @ 2023-03-17T13:13 (+3)
Thanks for laying out these points! After having engaged with many people's thoughts on these issues I'm similarly unconvinced about the very unfavourable odds many people seem to assign, so I really look forward to the discussion here.
I'm particularly curious about this point, because when I think about AI risk scenarious I put quite some stock into the potential for very direct government interventions when the risks are more obvious and more clearly a near term problem:
Some decisive demonstration of danger is achieved, and AIs also help to create a successful campaign to persuade key policymakers to aggressively work toward a standards and monitoring regime. (This could be a very aggressive regime if some particular government, coalition or other actor has a lead in AI development that it can leverage into a lot of power to stop others’ AI development.)
AI seems to me to already clearly be among the top priorities for geopolitical considerations for the US, and it seems like when this is the case the space of options is fairly unrestricted.
Vasco Grilo @ 2023-03-16T16:43 (+2)
Thanks for the post!
More broadly, it seems to me like essentially all attempts to make the most important century go better also risk making it go a lot worse, and for anyone out there who might’ve done a lot of good to date, there are also arguments that they’ve done a lot of harm (e.g., by raising the salience of the issue overall).
Even “Aligned AI would be better than misaligned AI” seems merely like a strong bet to me, not like a >95% certainty, given what I see as the appropriate level of uncertainty about topics like “What would a misaligned AI actually do, incorporating acausal trade considerations and suchlike?”; “What would humans actually do with intent-aligned AI, and what kind of universe would that lead to?”; and “How should I value various outcomes against each other, and in particular how should I think about hopes of very good outcomes vs. risks of very bad ones?”
To reiterate, on balance I come down in favor of aligned AI, but I think the uncertainties here are massive - multiple key questions seem broadly “above our pay grade” as people trying to reason about a very uncertain future.
I really like these points. It is often easy to forget how uncertain is the future.
Michael_Cohen @ 2023-03-15T21:50 (+2)
How “natural” are intended generalizations (like “Do what the supervisor is hoping I’ll do, in the sense that most humans would mean this phrase rather than in a precise but malign sense”) vs. unintended ones (like “Do whatever maximizes reward”)?
I think this is an important point. I consider the question in this paper, published last year at AI Magazine. See the "Competing Models of the Goal" section, and in particular the "Arbitrary Reward Protocols" subsection. (2500 words)
I think there's something missing from the discussion here, which the key point of that section.First, I claim that sufficiently advanced agents will likely need to engage in hypothesis testing between multiple plausible models of what worldly events lead to reinforcement, or else they would fail at certain tasks. So even if the "intended generalization" is a quite bit more plausible to the agent than the unintended one, as long as it is cheap to test them, and as long as it has a long horizon, it would likely deem wireheading to be worth trying out, just in case. That said, in some situations (I mention a chess game in the paper) I expect the intended generalization to be so much simpler that it isn't even worth trying out.
Just a warning before you read it, I use the word "reward" a bit differently than you appear to. In my terminology, I would phrase is this as "Do what the supervisor is hoping" vs. "Do whatever maximizes the relevant physical signal", and the agent would essentially wonder which of the two constitutes "reward", rather than being a priori sure that its past rewards "are" those physical signals.