A freshman year during the AI midgame: my approach to the next year

By Buck @ 2023-04-14T00:38 (+179)

I recently spent some time reflecting on my career and my life, for a few reasons:

It was my 29th birthday, an occasion which felt like a particularly natural time to think through what I wanted to accomplish over the course of the next year 🙂.
It seems like AI progress is heating up.
It felt like a good time to reflect on how Redwood has been going, because we’ve been having conversations with funders about getting more funding.

I wanted to have a better answer to these questions:

What’s the default trajectory that I should plan for my career to follow? And what does this imply for what I should be doing right now?
How much urgency should I feel in my life?
- How hard should I work?
- How much should I be trying to do the most valuable-seeming thing, vs engaging in more playful exploration and learning?

In summary:

For the purposes of planning my life, I'm going to act as if there are four years before AGI development progresses enough that I should substantially change what I'm doing with my time, and then there are three years after that before AI has transformed the world unrecognizably.
I'm going to treat this phase of my career with the urgency of a college freshman looking at their undergrad degree--every month is 2% of their degree, which is a nontrivial fraction, but they should also feel like they have a substantial amount of space to grow and explore.

The AI midgame

I want to split the AI timeline into the following categories.

The early game, during which interest in AI is not mainstream. I think this ended within the last year 😢
The midgame, during which interest in AI is mainstream but before AGI is imminent. During the midgame:
- The AI companies are building AIs that they don’t expect will be transformative.
- The alignment work we do is largely practice for alignment work later, rather than an attempt to build AIs that we can get useful cognitive labor from without them staging coups.
- For the purpose of planning my life, I’m going to imagine this as lasting four more years. This is shorter than my median estimate of how long this phase will actually last.
The endgame, during which AI companies conceive of themselves as actively building models that will imminently be transformative, and that pose existential takeover risk.
- During the endgame, I think that we shouldn’t count on having time to develop fundamentally new alignment insights or techniques (except maybe if AIs do most of the work? Idt we should count on this); we should be planning to mostly just execute on alignment techniques that involve ingredients that seem immediately applicable.
- For the purpose of planning my life, I’m going to imagine this as lasting three years. This is about as long as I expect this phase to actually take.

I think this division matters because several aspects of my current work seem like they’re optimized for midgame, and I should plausibly do something very differently in the endgame. Features of my current life that should plausibly change in the endgame:

I'm doing blue-sky alignment research into novel alignment techniques–during the endgame, it might be too late to do this.
I'm working at an independent alignment org and not interacting with labs that much. During the endgame, I probably either want to be working at a lab or doing something else that involves interacting with labs a lot. (I feel pretty uncertain about whether Redwood should dissolve during the AI endgame.)
I spend a lot of my time constructing alignment cases that I think analogous to difficulties that we expect to face later. During the endgame, you probably have access to the strategy “observe/construct alignment cases that are obviously scary in the models you have”, which seems like it partially obseletes this workflow.
Doing research that is practice rather than an actual attempt at aligning models or safely extracting cognitive labor from them. Some of the work I expect to want takeover-concerned people do during the endgame is probably very practical/empirical. But I expect us to also want to do some difficult-to-empirically-ground work to answer questions like “How could this particular model be scarily misaligned? How might our alignment strategy have failed such that this particular model will try to kill us?”

One core question here is: How is my impact distributed between work I do in the midgame vs the endgame? (As in, how much of my career value do I expect to lose if I suddenly die at the end of the midgame?)

Midgame impact:
- The main mechanism here is that I think I (as part of Redwood) have a shot at developing alignment techniques (or other knowledge) that make a serious difference to the alignment plan, that can then be adopted by AI labs with no further actions from me/Redwood.
- Secondarily, I think we have a shot at developing an alignment research methodology and/or organizational structure that allows Redwood and maybe the broader alignment community to do much more good work during the midgame.
- Other midgame impacts I can have:
  - helping other people to gain skills such that they have a really big impact in the endgame.
  - setting up an organization which lasts into the endgame and is really productive at doing something crucial.
Endgame impact:
- I have been assuming that most of my impact was in the midgame, but after thinking about it more over the last few weeks I’m now less sold on that.
- Mechanisms by which Redwood-style backchained alignment researchers might have a lot of impact in the endgame:
  - Being an expert on some particular thing which is extremely useful when trying to align AIs or safely extract cognitive labor from them. When I’m thinking about this, I’m mostly trying to imagine the head of the alignment department at an AI lab being ecstatically delighted to be able to hire me, and saying “I am so glad we are able to work with Buck; it’s so incredibly convenient that he has been obsessively focused on this topic for the last four years and he’s already understood a bunch of the ins and outs; if it wasn’t for him we’d be yoloing stuff way more.”
    - This could be a class of techniques, e.g. interpretability or MAD.
    - This could be a certain aspect of the alignment situation, from a more technique-agnostic perspective. For example, it seems pretty plausible to me that when we’re using AI to make lots of autonomous decisions, we’ll have some limited budget to oversee these decisions such that we can afford to oversee like 0.1% of them, and we’d really like to be smart about which 0.1% to review so that we maximize the probability of catching bad actions. There are lots of different approaches you might use here; e.g. you probably want to try some kind of anomaly detection, and you probably also want to concentrate oversight on examples where the AI action was higher stakes. If I/some Redwood people obsessively focused on the version of this problem that we expect to happen with TAI, I can imagine us being incredible hires due to having so much relevant experience and cached thinking.
      - As another example, if I decided to just focus on anti-adversarial training, it’s easy to imagine being way more familiar with that problem than anyone at an AI lab.
  - Having a bunch of practice at thinking about AI alignment in principle, which might be really useful for answering difficult-to-empirically-resolve questions about the AIs being trained.
  - Being well-prepared to use AI cognitive labor to do something useful, by knowing a lot about some research topic that we end up wanting to put lots of AI labor into. Maybe you could call this “preparing to be a research lead for a research group made up of AIs”. Or “preparing to be good at consuming AI research labor”.

Pacing: a freshman year

I think I want to treat my next year with the pacing of a freshman year in a US undergrad degree, for someone who wants to go into startups and thinks there’s some chance that they’ll want to graduate college early. I think that people going into their freshman year should be thinking a little bit about what they want to do after college. They should understand things that they need to do during college in order to be set up well for their post-college activities (e.g. they probably want to do some research as an undergrad, and they probably need to eventually learn various math). But meeting those requirements probably isn’t going to be where most of their attention goes.

Similarly, I think that I should be thinking a bit about my AI endgame plans, and make sure that I’m not failing to do fairly cheap things that will set me up for a much better position in the endgame. But I should mostly be focusing on succeeding during the midgame (at some combination of doing valuable research and at becoming an expert in topics that will be extremely valuable during the endgame).

When you’re a freshman, you probably shouldn’t feel like you’re sprinting all the time. You should probably believe that skilling up can pay off over the course of your degree. Every month is about 2% of your degree.

I think that this is how I want to feel. In a certain sense, four years is a really long time. I spent a reasonable amount of the last year feeling kind of exhausted and wrecked and rushed, and my guess is that this was net bad for my productivity. I think I should feel like there is real urgency, but also real amounts of space to learn and grow and play.

I went back and forth a lot on how I wanted to set up this metaphor; in particular, I was pretty tempted to suggest that I should think of this as a sophomore year rather than a freshman year. I think that freshmen should usually mostly ignore questions about career planning, whereas I think I should e.g. spend at least some time talking to labs about the possibility of them working with me/Redwood in various ways. I ended up choosing freshman rather than sophomore because I think that 3 years is less reasonable than 4.

And so, my plan is something like:

Put a bit of work into setting up my AI endgame plans.
- E.g. talk to some people who are at labs and make sure they don’t think that my vague aspirations here are insane. I’m interested in more suggestions along these lines.
- I think that if I feel more like I’ve deliberated once about this, I’ll find it easier to pursue my short-term plans wholeheartedly.
Mostly (like with 70% of my effort), push hard on succeeding at my midgame plans.
Spend about 20% of my effort on learning things that don’t have immediate benefits.
- For example, I’ve spent some time over the last few weeks learning about generative modeling, and I plan to continue studying this. I have a few motivations here:
  - Firstly, I think it’s pretty healthy for me to know more about how ML progress tends to happen, and I feel much more excited about this subfield of ML than most subfields of ML. I feel intuitively really impressed and admiring of the researchers in this field, and it seems healthy for me to have a research field with researchers who I look up to and who I wholeheartedly believe I can learn a lot from.
  - Secondly, I have a crazy take that the kind of reasoning that is done in generative modeling has a bunch of things in common with the kind of reasoning that is valuable when developing algorithms for AI alignment.

Ren Springlea @ 2023-04-16T23:28 (+28)

Thank you for this post. I work in animal advocacy rather than AI, but I've been thinking about some similar effects of transformative AI on animal advocacy.

I've been shocked by the progress of AI, so I've been thinking it might be necessary to update how we think about the world in animal advocacy. Specifically, I've been thinking roughly along the lines of "There's a decent chance that the world will be unrecognisable in ~15-20 years or whatever, so we should probably be less confident in our ability to reliably impact the future via policies, so interventions that require ~15-20 years to pay off (e.g. cage-free campaigns, many legislative campaigns) may end up having 0 impact." This is still a hypothesis, and I might make a separate forum post about it.

It struck me that this is very similar to some of the points you make in this post.

In your post, you've said you're planning to act as though there are 4 years of the "AI midgame" and 3 years of the "AI endgame". If I translated this into animal advocacy terms, this could be equivalent to something like "we have ~7 years to deliver (that is, realise) as much good as we can for animals". (The actual number of years isn't so important, this is just for illustration.)

Would you agree with this? Or would you have some different recommendation for animal advocacy people who share your views about AI having the potential to pop off pretty soon?

(Some context as to my background views: I think preventing suffering is more important than generating happiness; I think the moral values of animals is comparable to humans, e.g. within 0-2 orders of magnitude depending on species; I don't think creating lives is morally good; I think human extinction is bad because it could directly cause suffering and death, but not so much because of its effects on loss of potential humans who do not yet exist; I think S-risks are very very bad; I'm skeptical that humans will go extinct in the near future; I think society is very fragile and could be changed unrecognisably very easily; I'm concerned more about misuse of AI than any deliberate actions/goals of an AI itself; I have a great deal of experience in animal advocacy and zero experience in anything AI-related. The person reading this certainly doesn't need to agree with any of these views, but I wanted to highlight my background views so that it's clear why I believe both "AI might pop off really soon" and "I still think helping animals is the best thing I can do", even if that latter belief isn't common among the AI community.)

Aaron_Scher @ 2023-04-20T02:06 (+5)

I'm not Buck, but I can venture some thoughts as somebody who thinks it's reasonably likely we don't have much time.

Given that "I'm skeptical that humans will go extinct in the near future" and that you prioritize preventing suffering over creating happiness, it seems reasonable for you to condition your plan on humanity surviving the creation of AGI. You might then back-chain from possible futures you want to steer toward or away from. For instance, if AGI enables space colonization, it sure would be terrible if we just had planets covered in factory farms. What is the path by which we would get there, and how can you change it so that we have e.g., cultured meat production planets instead. I think this is probably pretty hard to do; the term "singularity" has been used partially to describe that we cannot predict what would happen after it. That said, the stakes are pretty astronomical such that I think it would be pretty reasonable for >20% of animal advocacy effort to be specifically aimed at preventing AGI-enabled futures with mass animal suffering. This is almost the opposite of "we have ~7 years to deliver (that is, realise) as much good as we can for animals." Instead it might be better to have an attitude like "what happens after 7 years is going to be a huge deal in some direction, let's shape it to prevent animal suffering."

I don't know what kind of actions would be recommended by this thinking. To venture a guess: trying to accelerate meat alternatives, doing lots of polling around public opinions on moral questions around eating meat (with the goal of hopefully finding that humans think factory farming is wrong so a friendly AI system might adopt such a goal as well; human behavior in this regard seems like a particularly bad basis on which to train AIs). Pretty uncertain about these two idea and I wouldn't be surprised if they're actually quite bad.

Ren Springlea @ 2023-04-21T05:56 (+1)

Thank you, I appreciate you taking the time to construct this convincing and high-quality comment. I'll reflect on this in detail.

I did do some initial scoping work for longtermist animal stuff last year, of which AGI-enabled mass suffering was a major part of course, so might be time to dust that off.

Greg_Colbourn @ 2023-04-16T09:04 (+12)

How confident are you that alignment will be solved in time? What level of x-risk do you think is acceptable for the endgame to be happening in? How far do you think your contribution could go toward reducing x-risk to such an acceptable level? These are crucial considerations. I think that we are pretty close to the end game already (maybe 2 years), and that there's very little chance for alignment to be solved / x-risk reduced to acceptable levels in time. Have you considered that the best strategy now is a global moratorium on AGI (hard as that may be)? I think having more alignment researchers openly advocating for this would be great. We need more time.

Sharmake @ 2023-04-16T22:42 (+8)

I think that we are pretty close to the end game already (maybe 2 years), and that there's very little chance for alignment to be solved / x-risk reduced to acceptable levels in time.

I believe with 95-99.9% probability that this is purely hype, and that we will not in fact see AI that radically transforms the world or that is essentially able to do every task that automates physical infrastructure in 2 years.

Given this, I'd probably disagree with this:

Have you considered that the best strategy now is a global moratorium on AGI (hard as that may be)? I think having more alignment researchers openly advocating for this would be great. We need more time.

Or at least state it less strongly.

I see a massive claim here without much evidence shown here, and I'd like to see why you believe that we are so close to the endgame of AI.

Steven Byrnes @ 2023-04-18T18:59 (+16)

You’re entitled to disagree with short-timelines people (and I do too) but I don’t like the use of the word “hype” here (and “purely hype” is even worse); it seems inaccurate, and kinda an accusation of bad faith. “Hype” typically means Person X is promoting a product, that they benefit from the success of that product, and that they are probably exaggerating the impressiveness of that product in bad faith (or at least, with a self-serving bias). None of those applies to Greg here, AFAICT. Instead, you can just say “he’s wrong” etc.

Rohin Shah @ 2023-04-19T06:58 (+25)

“Hype” typically means Person X is promoting a product, that they benefit from the success of that product, and that they are probably exaggerating the impressiveness of that product in bad faith (or at least, with a self-serving bias).

All of this seems to apply to AI-risk-worriers?

AI-risk-worriers are promoting a narrative that powerful AI will come soon
AI-risk-worriers are taken more seriously, have more job opportunities, get more status, get more of their policy proposals, etc, to the extent that this narrative is successful
My experience is that AI products are less impressive than the impression I would get from listening to AI-risk-worriers, and self-serving bias seems like an obvious explanation for this.

I generally agree that as a discourse norm you don't want to go around accusing people of bad faith, but as a matter of truth-seeking my best guess is that a substantial fraction of short-timelines amongst AI-risk-worriers is in fact "hype", as you've defined it.

Steven Byrnes @ 2023-04-19T15:25 (+5)

Hmm. Touché. I guess another thing on my mind is the mood of the hype-conveyer. My stereotypical mental image of “hype” involves Person X being positive & excited about the product they’re hyping, whereas the imminent-doom-ers that I’ve talked to seem to have a variety of moods including distraught, pissed, etc. (Maybe some are secretly excited too? I dunno; I’m not very involved in that community.)

Greg_Colbourn @ 2023-04-19T20:16 (+3)

FWIW I am not seeking job opportunities or policy proposals that favour me financially. Rather - policy proposals that keep me, my family, and everyone else alive. My self-interest here is merely in staying alive (and wanting the rest of the planet to stay alive too). I'd rather this wasn't an issue and just enjoy my retirement. I want to spend money on this (pay for people to work on Pause / global AGI moratorium / Shut It Down campaigns). Status is a trickier thing to untangle. I'd be lying, as a human, if I said I didn't care about it. But I'm not exactly getting much here by being an "AI-risk-worrier". And I could probably get more doing something else. No one is likely to thank me if a disaster doesn't happen.

Re AI products being less impressive than the impression you get from AI-risk-worriers, what do you make of Connor Leahy's take that LLMs are basically "general cognition engines" and will scale to full AGI in a generation or two (and with the addition of various plugins etc to aid "System 2" type thinking, which are freely being offered by the AutoGPT crowd)?

Rohin Shah @ 2023-04-20T07:50 (+14)

First off, let me say that I'm not accusing you specifically of "hype", except inasmuch as I'm saying that for any AI-risk-worrier who has ever argued for shorter timelines (a class which includes me), if you know nothing else about that person, there's a decent chance their claims are partly "hype". Let me also say that I don't believe you are deliberately benefiting yourself at others' expense.

That being said, accusations of "hype" usually mean an expectation that the claims are overstated due to bias. I don't really see why it matters if the bias is survival motivated vs finance motivated vs status motivated. The point is that there is bias and so as an observer you should discount the claims somewhat (which is exactly how it was used in the original comment).

what do you make of Connor Leahy's take that LLMs are basically "general cognition engines" and will scale to full AGI in a generation or two (and with the addition of various plugins etc to aid "System 2" type thinking, which are freely being offered by the AutoGPT crowd)?

Could happen, probably won't, though it depends what is meant by "a generation or two", and what is meant by "full AGI" (I'm thinking of a bar like transformative AI).

(I haven't listened to the podcast but have thought about this idea before. I do agree it's good to think of LLMs as general cognition engines, and that plugins / other similar approaches will be a big deal.)

Sharmake @ 2023-04-19T13:34 (+2)

This. I generally also agree with your 3 observations, and the reason I was focusing on truth seeking is because my epistemic environment tends to reward worrying AI claims more than it probably should due to negativity bias, as well as looking at AI Twitter hype.

Greg_Colbourn @ 2023-04-17T11:17 (+10)

Also, reversing 95-99.9%, are you ok with a 0.1-5% x-risk?

Sharmake @ 2023-04-17T12:55 (+1)

No, at the higher end of probability. Things still need to be done. It does mean we should stop freaking out every time a new AI capability is released, and importantly it means we probably don't need to go to extreme actions, at least right away.

Greg_Colbourn @ 2023-04-17T11:15 (+5)

The "Sparks" paper; chatGPT plugins, AutoGPTs and other scaffolding to make LLM's more agent-like. Given these, I think there's way too much risk for comfort of GPT-5 being able to make GPT-6 (with a little human direction that would be freely given), leading to a foom. Re physical infrastructure, to see how this isn't a barrier, consider that a superintelligence could easily manipulate humans into doing things as a first (easy) step. And such an architecture, especially given the current progress on AI Alignment, would be default unaligned and lethal to the planet.

Sharmake @ 2023-04-17T12:52 (+1)

The reason I'm so confident in this is that right now is because I assess a significant probability of the AI developments starting with GPT-4 to be a hype cycle, and in particular I am probably >50% confident in the idea that most of the flashiest stuff on AI will prove to be over hyped.

In particular, I am skeptical of the general hype on AI right now, and that a lot of capabilities tests essentially test it on paper tests, not real world tasks, which are much less Goodhartable than paper tests

Now I'd agree with you conditioning on the end game being 2 years or less that surprisingly extreme actions would have to be taken, but even then I assess the current techniques for alignment as quite a bit better than you imply.

I also am 40% confident in a model where the capabilities for human level AI in almost every human domain is in the 2030s.

Given this, I think you're being overly alarmed right now, and that we probably have at least some chance of a breakthrough in AI alignment/safety comparable to what happened to the climate warming problem.

Also, the evals ARC is doing is essentially a best case scenario for the AI, as it has access to the weights, and most importantly no human resistance is modeled, like say blocking the ability of the AI to get GPUs, and it assumes that the AI can set up arbitrarily scalable ways of gaining power. We should expect the true capabilities of an AI attempting to gain control as reliably less than what ARC evals show.

Greg_Colbourn @ 2023-04-17T13:49 (+4)

What kind of a breakthrough are you envisaging? How do we get from here to 100% watertight alignment of an arbitrarily capable AGI? Climate change is very different in that the totality of all emissions reductions / clean tech development can all add up to solving the problem. AI Alignment is much more all or nothing. For the analogy to hold it would be like emissions rising on a Moore's Law (or faster) trajectory, and the threshold for runaway climate change reducing each year (cf algorithm improvements / hardware overhang), to a point where even a single start up company's emissions (Open AI; X.AI) could cause the end of the world.

Re ARC Evals, on the flip side, they aren't factoring in humans doing things that make things worse - chatGPT plugins, AutoGPT, BabyAGI, ChaosGPT etc all showing that this is highly likely to happen!

We may never get a Fire Alarm of sufficient intensity to jolt everyone into high gear. But I think GPT-4 is it for me and many others. I think this is a Risk Aware Moment (Ram).

Sharmake @ 2023-04-18T01:20 (+1)

What kind of a breakthrough are you envisaging? How do we get from here to 100% watertight alignment of an arbitrarily capable AGI?

Scalable alignment is the biggest way to align a smarter intelligence.

Now, Pretraining from Human Feedback showed that at least for one of the subproblems of alignment, outer alignment, we managed to make the AI more aligned as it gets more data.

If this generalizes, it's huge news, as it implies we can at least align an AI's goals with human goals as we get more data. This matters because it means that scalable alignment isn't as doomed as we thought.

Re ARC Evals, on the flip side, they aren't factoring in humans doing things that make things worse - chatGPT plugins, AutoGPT, BabyAGI, ChaosGPT etc all showing that this is highly likely to happen!

The point is that ARC Evals will be an upper bound mostly, not a lower bound, given that it generally makes very optimistic assumptions for the AI under testing. Maybe they're right, but the key here is that they are closer to a maximum for an AI's capabilities than a minimum for analysis, which means that the most likely bet is on reduced impact, but there's a possibility that the ARC Evals are close to what happens in real life.

It's possible for an early takeoff of AI to happen in 2 years, I just don't consider that possibility very likely right now.

Greg_Colbourn @ 2023-04-18T07:41 (+5)

Generalizing is one thing, but how can scalable alignment ever be watertight? Have you seen all the GPT-4 jailbreaks!? How can every single one be patched using this paradigm? There needs to be an ever decreasing number of possible failure modes, as power level increases, to the limit of 0 failure modes for a superintelligent AI. I don't see how scalable alignment can possibly work that well.

Open AI says in their GPT-4 release announcement that "GPT-4 responds to sensitive requests (e.g., medical advice and self-harm) in accordance with our policies 29% more often." A 29% reduction of harm. This is the opposite of reassuring when thinking about x-risk.

(And all this is not even addressing inner alignment!)

Lizka @ 2023-04-17T14:01 (+5)

I think many people on the Forum have been thinking about related questions (I have been, at least), and I'm excited to see this post here. I'm curating it.

I also really appreciated Buck's talk from EA Global:

Buck @ 2023-04-19T15:02 (+2)

Thanks Lizka. I think you mean to link to this video:

pseudonym @ 2023-04-19T15:53 (+14)

fwiw the two videos linked look identical to me (EAG Bay Area 2023, "The current alignment plan, and how we might improve it")

Benjamin_Todd @ 2023-04-29T03:57 (+3)

I really liked this post. I find the undergrad degree metaphor useful in some ways (focus on succeeding in your studies over 4 years, but give a bit of thought to how it sets you up for the next stage), but since the end game is only 3 years (rather than a normal 40 year career), overall it seems like your pacing and attitude could end up pretty different.

Maybe the analogy could be an undergrad where their only goal is to get the "best" graduate degree possible. Then high school = early game, undergrad = midgame, graduate degree = end game. Maybe you could think of "best" as "produce the most novel & useful result in their thesis" or "get the highest possible score".

Another analogy could be an 18 year old undergrad who knows they need to retire at age 25, but I expect that throws the analogy off a lot, since if impact is the goal, I wouldn't spend 4 years in college in that scenario.

trevor1 @ 2023-04-15T15:06 (+1)

I tried really hard to think of a better way of looking at things than the "freshman year" method you described here, and so far I've failed to think of anything better.

No matter how you look at it (long timelines and short), the critical window period resembles an undergrad degree better than anything else; there's a standard of one year for orienting, and more time afterwards to do exploration in order to really make sure you find a place you can shine (though three years is too late). But anyone is much much better off if they can successfully finish up the exploration phase ASAP.