Where I currently disagree with Ryan Greenblatt’s version of the ELK approach

By So8res @ 2022-09-29T21:19 (+21)

Context: This post is my attempt to make sense of Ryan Greenblatt's research agenda, as of April 2022. I understand Ryan to be heavily inspired by Paul Christiano, and Paul left some comments on early versions of these notes.

Two separate things I was hoping to do, that I would have liked to factor into two separate writings, were (1) translating the parts of the agenda that I understand into a format that is comprehensible to me, and (2) distilling out conditional statements we might all agree on (some of us by rejecting the assumptions, others by accepting the conclusions). However, I never got around to that, and this has languished in my drafts folder too long, so I'm lowering my standards and putting it out there.

The process that generated this document is that Ryan and I bickered for a while, then I wrote up what I understood and shared it with Ryan, and we repeated this process a few times. I've omitted various intermediate drafts, on the grounds that sharing a bunch of intermediate positions that nobody endorses is confusing (moreso than seeing more of the process is enlightening), and on the grounds that if I try to do something better then what happens instead is that the post languishes in the drafts folder for half a year.

(Thanks to Ryan, Paul, and a variety of others for the conversations.)

 

Nate's model towards the end of the conversation

Ryan’s plan, as Nate currently understands it:

 

Nate's response:

 

An attempt at conditional agreement

I suggested the following:

 

If it is the case that:

... THEN the Paulian family of plans don't provide much hope.

 

My understanding is that Ryan was tentatively on board with this conditional statement, but Paul was not.

 

Postscript

Reiterating a point above: observe how this whole scheme has basically assumed that capabilities won't start to generalize relevantly out of distribution. My model says that they eventually will, and that this is precisely when things start to get scary, and that one of the big hard bits of alignment is that once that starts happeningthe capabilities generalize further than the alignment. A problem that has been simply assumed away in this agenda, as far as I can tell, before we even dive into the details of this framework.

To be clear, I'm not saying that this decomposition of the problem fails to capture difficult alignment problems. The "prevent the AGI from figuring out it's in deployment" problem is quite difficult! As is the "get an ELK head that can withstand superintelligent adversaries" problem. I think these are the wrong problems to be attacking, in part on account of their difficulty. (Where, to be clear, I expect that toy versions of these problems are soluble, just not solutions rated for the type of opposition it sounds like the rest of this plan requires.)