How I Formed My Own Views About AI Safety

By Neel Nanda @ 2022-02-27T18:52 (+134)

This is a linkpost to https://www.neelnanda.io/blog/47-inside-views

Disclaimer: I work as a researcher at Anthropic, but this post entirely represents my own views, rather than the views of my own employer

Introduction

I’ve spent the past two years getting into the field of AI Safety. One important message I heard as I was entering the field was that I needed to “form an inside view about AI Safety”, that I needed to form my own beliefs and think for myself rather than just working on stuff because people smarter than me cared about it. And this was incredibly stressful! I think the way I interpreted this was pretty unhealthy, caused me a lot of paralysing uncertainty and anxiety, and almost caused me to give up on getting into the field. But I feel like I’ve now reached a point I’m comfortable with, and where I somewhat think I have my own inside views on things and understand how to form them. 

In this post, I try to explain the traps I fell into and why, what my journey actually looked like, and my advice for how to think about inside views, now I’ve seen what not to do. This is a complex topic and I think there are a lot of valid perspectives, but hopefully my lens is novel and useful for some people trying to form their own views on confusing topics (AI Safety or otherwise)! (Note: I don’t discuss why I do now think AI Safety is important and worth working on - that’s a topic for a future post!)

The Message of Inside Views

First, context to be clear about what I mean by inside views. As I understand it, this is a pretty fuzzily defined concept, but roughly means “having a clear model and argument in my head, starting from some basic and reasonable beliefs about the world, that get to me to a conclusion like ‘working on AI Safety is important’ without needing to rely on deferring to people”. This feels highly related to the concept of gears-level models. This is in comparison to outside views, or deferring to people, where the main reason I believe something is because smart people I respect believe it. In my opinion, there’s a general vibe in the rationality community that inside views are good and outside views are bad (see Greg Lewis’ In Defence of Epistemic Modesty for a good argument for the importance of outside views and deferring!). Note that this is not the Tetlockian sense of the words, used in forecasting, where outside view means ‘look up a base rate’ and inside view means ‘use my human intuition, which is terribly calibrated’, where the standard wisdom is outside view > inside view.

Good examples of this kind of reasoning: Buck Shlegeris’ My Personal Cruxes for Working on AI Safety, Richard Ngo’s AGI Safety from First Principles, Joseph Carlsmith’s report on Existential Risk from Power-Seeking AI. Note that, while these are all about the question of ‘is AI Safety a problem at all’, the notion of an inside view also applies well to questions like ‘de-confusion research/reinforcement learning from human feedback/interpretability is the best way to reduce existential risk from AI’, arguing for specific research agendas and directions.

How I Interpreted the Message of Inside Views

I’m generally a pretty anxious person and bad at dealing with uncertainty, and sadly, this message resulted in a pretty unhealthy dynamic in my head. It felt like I had to figure out for myself the conclusive truth of ‘is AI Safety a real problem worth working on’ and which research directions were and were not useful, so I could then work on the optimal one. And that it was my responsibility to do this all myself, that it was bad and low-status to work on something because smart people endorsed it.

This was hard and overwhelming because there are a lot of agendas, and a lot of smart people with different and somewhat contradictory views. So this felt basically impossible. But it also felt like I had to solve this before I actually started any permanent research positions (ie by the time I graduated) in case I screwed up and worked on something sub-optimal. And thus, I had to solve this problem that empirically most smart people must be screwing up, and do it all before I graduated. This seemed basically impossible, and created a big ugh field around exploring AI Safety. Which was already pretty aversive, because it involved re-skilling, deciding between a range of different paths like PhDs vs going straight into industry, and generally didn’t have a clean path into it. 

My Journey

So, what actually happened to me? I started taking AI Safety seriously in my final year of undergrad. At the time, I bought the heuristic arguments for AI Safety (like, something smarter than us is scary), but didn’t really know what working in the field looked like beyond ‘people at MIRI prove theorems I guess, and I know there are people at top AI labs doing safety stuff?’ I started talking to lots of people who worked in the field, and gradually got data on what was going on. This was all pretty confusing and stressful, and was competing with going into quant finance - a safe, easy, default path that I already knew I’d enjoy.

After graduating, I realised I had a lot more flexibility than I thought. I took a year out, and managed to finagle my way into doing three back-to-back AI Safety internships. The big update was that I could explore AI Safety without risking too much - I could always go back into finance in a year or two if it didn’t work out. I interned at FHI, DeepMind and CHAI - working on mathematical/theoretical safety work, empirical ML based stuff to do with fairness and bias, and working on empirical interpretability work respectively. I also did the AGI Fundamentals course, and chatted to a lot of people at the various orgs I worked at and at conference. I tried to ask all the researchers I met about their theory of change for how their research actually matters. One thing that really helped me was chatting to a researcher at OpenAI who said that, when he started, he didn’t have clear inside views. But that he’d formed them fairly organically over time, and just spending time thinking and being in a professional research environment was enough.

At the end of the year, I had several offers and ended up joining Anthropic to work on interpretability with Chris Olah. I wasn’t sure this was the best option, but I was really excited about interpretability, and it seemed like the best bet. A few months in, this was clearly a great decision and I’m really excited about the work, but it wouldn’t have been the end of the world if I’d decided the work wasn’t very useful or a bad fit, and I expect I could have left within a few months without hard feelings. As I’ve done research and talked to Chris + other people here, I’ve started to form clearer views on what’s going on with interpretability and the theory of impact for it and Anthropic’s work, but there’s still big holes in my understanding where I’m confused or deferring to people. And this is fine! I don’t think it’s majorly holding me back from having an impact in the short-term, and I’m forming clearer views with time.

My Advice for Thinking About & Forming Inside Views

Why to form them?

I think there are four main reasons to care about forming inside views:

These are pretty different, and it’s really important to be clear about which reasons you care about! Personally, I mostly care about motivation > research quality = impact >> community epistemics 

How to form them?

Misc


Miranda_Zhang @ 2022-02-28T13:49 (+12)

Thank you for this. I am a community-builder and I've definitely started emphasizing the importance of developing inside views to my group members. However, it seems like there may be domains where developing an inside view is relatively less important (e.g., algebraic geometers vs moral philosophers), because experts in that field appear to have better feedback loops. Given this, I'm curious whether you think community-builders might want to form inside views* on which areas to emphasize inside view formation for, to help communicate more accurately to our members?

*I'm not confident I'm describing an  'inside view.' Maybe this is something like, 'getting a sense of outside views across an array of domains?'

I found your post doubly useful because I've recently been exploring how I can form inside views, which I've found both practically and emotionally difficult. Not being familiar with the rationality or AI safety community, I was surprised by how much emphasis was placed on inside views and started feeling a bit like an imposter in the EA community.  I definitely felt like it was "low-status" to not have inside views on the causes I prioritized, though I expect at least some of this was due to my  own anxiety.

Being able to see how you tackled this is really useful, as it gives me another model for how I could develop inside views (particularly on AI risk, which is the first thing I'm working on). It also reinforces that a lot of people have more career flexibility than they think - and so, perhaps, it's okay if I haven't figured out whether I should switch from community building into AI safety research in the three months before I graduate!

Jamie Bernardi @ 2022-03-03T10:54 (+4)

Hey! I have been thinking about this a lot from the perspective of a confused community builder / pipeline strategist, too. I didn't get so far as Neel, it's been great to read this post before getting anywhere near finishing my thoughts. It captures a lot of the same things better than I had. Thanks for your comment too - definitely a lot of overlap here!

I have got as far as some ideas, here, and would love any initial thoughts before I try to write it up with more certainty?

First a distinction which I think you're pointing at - an inside view on what? The thing I can actually have an excellent inside view about as a (full-time) community builder is how community building works. Like, how to design a programme, how people respond to certain initiatives, what the likelihood certain things work are, etc.

Next, programmes that lead to working in industry, academic field building, independent research, etc, look different. How do I decide which to prioritise? This might require some inside view on how each direction changes the world (and interacts with the others), and lead to an answer on which I’m most optimistic about supporting. There is nobody to defer to here, as practitioners are all (rightly) quite bullish about their choice. Having an inside view on which approach I find most valuable will lead to quite concrete differences in the ultimate strategy I’m working towards or direction I’m pointing people in, I think.

When it comes to what to think about object-level work (i.e. how does alignment happen, technically), I get more hazy on what I should aim for. By statistical arguments, I reckon most inside views that exist on what work is going to be valuable are probably wrong. Why would mine be different? Alternatively, they might all be valuable, so why support just one. Or something in between. Either way, if I am doing meta work, it will probably be wrong to be bullish about my single inside view on 'what will go wrong'. I think I should aim to support a number of research agenda if I don't have strong reasons to believe some are wrong. I think this is where I will be doing most of my deferral, ultimately (and as the field shifts from where I left it).

However, understanding how valuable the object-level work is does seem important for deciding which directions to support (e.g. academia vs industry), so I’m a bit stuck on where to draw a kune. As Neel says, I might hope to get as far understanding what other people believe about their agenda and why - I always took this as "can I model the response person X would give, when considering an unseen question", rather than memorising person X's response to a number of questions.

I think where I am landing on this is that it might be possible to assume uniform prior over the directions I could take, and adjust my posterior by 'learning things' and understanding their models on both the direction-level and object-level, properly. Another thought I want to explore - is this something like a worldview diversification over directions? It feels similar, as we’re in a world where it ‘might turn out’ some agenda or direction was correct, but there’s no way of knowing that right now.

To confirm - I believe people doing the object-level work (i.e. alignment research) should be bullish about their inside view. Let them fight it out, and let expert discourse decide what is “right” or “most promising”. I think this amounts to Neel’s “truth-seeking” point.

Miranda_Zhang @ 2022-03-04T03:43 (+2)

Hey Jamie, thanks for this! Seems like you've thought about it quite a bit - probably more than I have - but here are my initial thoughts. Hope this is helpful to you; if so, maybe we should chat more!

First a distinction which I think you're pointing at - an inside view on what? [...]  How do I decide which to prioritise? This might require some inside view on how each direction changes the world (and interacts with the others), and lead to an answer on which I’m most optimistic about supporting. There is nobody to defer to here, as practitioners are all (rightly) quite bullish about their choice. Having an inside view on which approach I find most valuable will lead to quite concrete differences in the ultimate strategy I’m working towards or direction I’m pointing people in, I think.

Agree! When I first wrote my comment, I labelled this a 'meta-inside view:' an inside view on what somebody (probably you, but possibly others like your group members) need to form inside views on. But this might be too confusing compared to less jargon-y phrases like,  'prioritizing what you form an inside view on first' or something.

Regardless, I think we are capturing the same issue here - although I don't use 'issue' in a negative sense. In my ideal world, community-builders would form pretty different views on causes to prioritize because this would help increase intellectual diversity and the discovery of the 'next-best' thing to work on. That doesn't mean, however, that there couldn't be some sort of guidance for how community-builders might go about figuring out what to prioritize.

I think this is where I will be doing most of my deferral, ultimately (and as the field shifts from where I left it).

Yeah, I think this is the status quo for any field that one isn't an expert on. Community-builders may be experts on community-building, but that doesn't extend to other domains, hence the need for deferral. Perhaps the key difference here is that community-builders need to be extra aware of the ever-shifting landscape and stay plugged-in, since their advice may directly impact the 'next generation' of EAs.

However, understanding how valuable the object-level work is does seem important for deciding which directions to support (e.g. academia vs industry), so I’m a bit stuck on where to draw a kune. As Neel says, I might hope to get as far understanding what other people believe about their agenda and why - I always took this as "can I model the response person X would give, when considering an unseen question", rather than memorising person X's response to a number of questions.

Hmm, I think you're right that developing an inside view for a specific cause would influence the levers that you think are most important (which has effects on your CB efforts, etc.) - but I'm not sure this has much implications for what CBs should do. My prior is that it is very unlikely that there are any causes where only a handful of levers and skillsets would be relevant, such that I would feel comfortable suggesting that people rely more on personal fit to figure out their careers once they've chosen a cause area. However, I acknowledge that there is definitely more need in certain causes (e.g., software engineers for AI safety): I just don't think that the CB level is the right level to apply this knowledge. I would feel more comfortable  having cause-specific recruiters (c.f., University community building seems like the wrong model for AI safety).

I definitely agree on the latter point. I see community-builders as both building and embodying pipelines to the EA community! As the 'point-of-entry' for many potential EAs, I think it is sufficient for CBs to be able to model the mainstream views for core cause areas. I expect that the most talented CBs will probably have developed inside views for a specific cause outside of CB, but that doesn't seem necessary to me for good CB work. 

I think where I am landing on this is that it might be possible to assume uniform prior over the directions I could take, and adjust my posterior by 'learning things' and understanding their models on both the direction-level and object-level, properly. Another thought I want to explore - is this something like a worldview diversification over directions? It feels similar, as we’re in a world where it ‘might turn out’ some agenda or direction was correct, but there’s no way of knowing that right now.

Oh, I'm a huge fan of worldview diversification! I don't currently have thoughts on starting with a non-/uniform prior ... I am, honestly, somewhat inclined to suggest that CBs 'adapt' a bit to the communities in which they are working. That is, perhaps what should partly affect a CB's prioritization re: inside view development is the existing interests of their group.  For example, considering the Bay Area's current status as a tech hub, it seems pretty important for CBs in the Bay Area to develop inside views on, say, AI safety - even if AI safety may not be what they consider the most pressing issue in the entire world. What do you think?

To confirm - I believe people doing the object-level work (i.e. alignment research) should be bullish about their inside view. Let them fight it out, and let expert discourse decide what is “right” or “most promising”.

Also completely agree here. : )

Akash @ 2022-02-28T23:22 (+9)

Thank you for this post, Neel! I think it's really useful to amplify stories about how people developed (and are continuing to develop) inside views. 

I've recently been thinking about an analogy between "developing inside views" and "thinking deliberately about one's career." I think there's a parallel with 80,000 Hours' general advice. It's like, "We have 80,000 Hours in our career; career planning is really hard, but wouldn't it make sense to spend at least a few dozen or a few hundred hours planning before making big choices? Isn't is super weird that we live in a society in which the "default" for career planning is usually so unhelpful? Even if we won't land on the perfect career, it seems worth it to spend some time actually trying to figure out what we should do, and we might be surprised with how far that takes us."

The same frame might be useful for developing inside views in AI safety. When I think about someone with 80,000 Hours in their career, I'm like "wow, wouldn't it be nice to spend at least a few dozen hours upfront to try to see if they can make progress developing an inside view?. It's quite plausible they just spend a few dozen hours being super confused, and then they can go on and do what they were going to do anyways. But the potential upside is huge!"

This has motivated me to ask myself these questions, and I might experiment with asking others: How much have you really tried to form an inside view? If you could free up a few hours, what would you do to make progress?

I've found these questions (and similar questions) to be pretty helpful. They allow me to keep my eye on the objective (developing an inside view on a highly complicated topic in which many experts disagree) without losing sight of the trail (there are things I can literally do, right now, to continue taking small steps in the right direction).

Sam Clarke @ 2022-03-02T10:35 (+8)

Nice post! I agree with ~everything here. Parts that felt particularly helpful:

One thing I disagree with: the importance of forming inside views for community epistemic health. I think it's pretty important. E.g. I think that ~2 years ago, the arguments for the longterm importance of AGI safety were pretty underdeveloped; that since then lots more people have come out with their insidee views about it; and that now the arguments are in much better shape.

Sam Clarke @ 2022-03-02T14:29 (+9)

Also, nitpick, but I find the "inside view" a more confusing and jargony way of just saying "independent impressions" (okay, also jargon to some extent, but closer to plain English), which also avoids the problem you point out: inside view is not the opposite of the Tetlockian sense of outside view (and the other ambiguities with outside view that another commenter pointed out).

Neel Nanda @ 2022-03-02T16:52 (+3)

The complaint that it's confusing jargon is fair. Though I do think the Tetlock sense + phrase inside view captures something important - my inside view is what feels true to me, according to my personal best guess and internal impressions. Deferring doesn't feel true in the same way, it feels like I'm overriding my beliefs, not like how they world is

This mostly comes under the motivation point - maybe, for motivation, inside views matter but independent impressions don't? And people differ on how they feel about the two?

Sam Clarke @ 2022-03-14T16:19 (+8)

I'm still confused about the distinction you have in mind between inside view and independent impression (which also have the property that they feel true to me)?

Or do you have no distinction in mind, but just think that the phrase "inside view" captures the sentiment better?

Neel Nanda @ 2022-03-16T05:12 (+3)

Inside view feels deeply emotional and tied to how I feel the world to be, independent impression feels cold and abstract

Neel Nanda @ 2022-03-02T16:51 (+3)

One thing I disagree with: the importance of forming inside views for community epistemic health. I think it's pretty important. E.g. I think that ~2 years ago, the arguments for the longterm importance of AGI safety were pretty underdeveloped; that since then lots more people have come out with their insidee views about it; and that now the arguments are in much better shape.

I want to push back against this. The aggregate benefit may have been high, but when you divide it by all the people trying, I'm not convinced it's all that high.

Further, that's an overestimate - the actual question is more like 'if the people who are least enthusiastic about it stop trying to form inside views, how bad is that?'. And I'd both guess that impact is fairly heavy tailed, and that the people most willing to give up are the least likely to have a major positive impact.

I'm not confident in the above, but it's definitely not obvious

Sam Clarke @ 2022-03-14T16:12 (+4)

Thanks - good points, I'm not very confident either way now

Yonatan Cale @ 2022-02-28T00:01 (+1)

Linking: Taboo Outside View [Lesswrong post, 292 karma]