Quick takes on "AI is easy to control"

By So8res @ 2023-12-02T22:33 (–12)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Quintin Pope @ 2023-12-03T00:19 (+13)

(Didn't consult Nora on this; I speak for myself)


I only briefly skimmed this response, and will respond even more briefly.

Re "Re: "AIs are white boxes""

You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It's entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally. 

Re: "Re: "Black box methods are sufficient"" (and the other stuff about evolution)

Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics. You should not rely on one to predict the other, as I have argued extensively elsewhere
 

Trying to draw inferences about ML from bio evolution is only slightly less absurd than trying to draw inferences about cheesy humor from actual dairy products. Regardless of the fact they can both be called "optmization processes", they're completely different things, with different causal structures responsible for their different outcomes, and crucially, those differences in causal structure explain their different outcomes. There's thus no valid inference from "X happened in biological evolution" to "X will eventually happen in ML", because X happening in biological evolution is explained by evolution-specific details that don't appear in ML (at least for most alignment-relevant Xs that I see MIRI people reference often, like the sharp left turn).

Re: "Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care""

This wasn't the point we were making in that section at all. We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that's aligned before you end up with one that's so capable it can destroy the entirety of human civilization by itself. 
 

Re "Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are."

I think you badly misunderstood the post (e.g., multiple times assuming we're making an argument we're not, based on shallow pattern matching of the words used: interpreting "whitebox" as meaning mech interp and "values are easy to learn" as "it will know human values"), and I wish you'd either take the time to actually read / engage with the post in sufficient depth to not make these sorts of mistakes, or not engage at all (or at least not be so rude when you do it). 
 

(Note that this next paragraph is speculation, but a possibility worth bringing up, IMO): 

As it is, your response feels like you skimmed just long enough to pattern match our content to arguments you've previously dismissed, then regurgitated your cached responses to those arguments. Without further commenting on the merits of our specific arguments, I'll just note that this is a very bad habit to have if you want to actually change your mind in response to new evidence/arguments about the feasibility of alignment.


Re: "Overall take: unimpressed."

I'm more frustrated and annoyed than "unimpressed". But I also did not find this response impressive. 

Chris Leong @ 2023-12-03T10:09 (+1)

I'm against downvoting this article into the negative.

I think it is worthwhile hearing someone's quick takes even when they don't have time to write a full response. Even if the article contains some misunderstandings (not claiming it does one way or the other), it still helps move the conversation forward by clarifying where the debate is at.

David Seiler @ 2023-12-03T23:16 (+13)

...it still helps move the conversation forward by clarifying where the debate is at.

Anything Nate writes would do that, because he's one of the debaters, right?  He could have written "It's a stupid post and I'm not going to read it", literally just that one sentence, and it would still tell us something surprising about the debate.  In some ways that post would be better than the one we got: it's shorter, and much clearer about how much work he put in.  But I would still downvote it, and I imagine you would too.  Even allowing for the value of the debate itself, the bar is higher than that.

For me, that bar is at least as high as "read the whole article before replying to it".  If you don't have time to read an article that's totally fine, but then you don't have time to post about it either.

niplav @ 2023-12-03T21:16 (+3)

I felt-sense-disagree. (I haven't yet downvoted the article, but I strongly considered it). I'll try to explore why I feel that way.

One reason probably is that I treat posts as having a different claim than other forms of publishing on this forum (and LessWrong)—they (implicitly) make a claim that they're finished & polished content. When I open a post I expect a person to have done some work that tries to uphold standards of scholarship and care, which this post doesn't show. I'd've been far less disappointed if this were a comment or a shortform post.

The other part is probably paying attention to status and the standards that are put upon people with high status: I expect high status people to not put much effort into whatever they produce as they can coast on status, which seems like the thing that's happening here. (Although one could argue that the MIRI fraction is losing status/already low-ish status and this consideration doesn't apply here).

Additionally, I was disappointed that the text didn't say anything that I wouldn't have expected, which probably fed into my felt-sense of wanting to downvote. I'm not sure I reflectively endorse this feeling.