Don't Over-Optimize Things

By Owen Cotton-Barratt @ 2022-06-16T16:28 (+53)

or Optimizing Optimization

The definition of optimize is:

to make something as good as possible

It's hard to argue with that. It's no coincidence that a lot of us have something of an optimization mindset.

But sometimes trying to optimize can lead to worse outcomes (because we don't fully understand what to aim for). It's worth understanding how this happens. We can try to avoid it by a combination of thinking more what to aim for, and (sometimes) simply optimizing less hard.

Reflection on purpose vs optimizing for that purpose

What does the activity of making something as good as possible look like in practice? I think often there are two stages:

  1. Reflection on the purpose — thinking about what the point of the thing at hand is, and identifying what counts as "good" in context
  2. Optimizing for that purpose — identifying the option(s) which do best at the identified purpose

Both of these stages are important parts of optimization in the general sense. But I think it's optimization-for-a-given-purpose that feels like optimization. When I say "over-optimization" I mean doing too much optimization for a given purpose.

What goes wrong when you over-optimize

Consider this exaggerated example:

Alice is a busy executive. She needs to get from one important meeting to another in a nearby city; she's definitely going to be late to the second meeting. She asks her assistant Bob to sort things out. "What should I be optimizing for?", Bob asks. "Just get me there as fast as possible", Alice replies, imagining that Bob will work out whether a taxi or train is faster.

Bob is on this. Eager to prove himself an excellent assistant, he first looks into a taxi (about 90 minutes) and a train (about 60 minutes plus 10 minutes travel at each end — but there's a 20 minute wait for the right train). So the taxi looks better.

But wait. Surely he can do better than 90 minutes? OK, so the journey is too short for a private jet to make sense, but what about a helicopter? Yep, 15 minutes to get to a helipad, plus 45 minutes flight time, and it can land on the hotel roof! Even adding in 5 minutes for embarking/disembarking, this is 25 minutes faster.

Or ... was he assuming that the drivers were sticking to the speed limit? Yeah, if he make the right phone calls he can find someone who can drive door to door in 60 minutes. 

Can he get the helicopter to be faster than that? Yeah, the driver can speed to the helipad, and bring it down to 57 minutes. Or what if he doesn't have it take off from a helipad? He just needs to find the closest possible bit of land and pay the owners to allow it to land there (or pay security people to temporarily clear the land even if they don't have permission to land). Surely that will come in under 55 minutes. Actually, if he's not concerned about proper airfields, he can revisit the option of a private jet ... just clear the street outside and use that as a runway, then have a skydiving instructor jump with Alice to land on the roof of the hotel ...

What's going wrong here? It isn't just that Bob is wasting time doing too much optimization, but that his solutions are getting worse as he does more optimization. This is because he has an imperfect understanding of the purpose. Goodhart's law is biting, hard.

It's also the case that Bob has a bunch of other implicit knowledge baked into how he starts to search for options. He first thinks of taking a taxi or the train. These are unusually good options overall among possible ways to get from one city to the other; they're salient to him because they're common, and they're common because they're often good choices. Too much optimization is liable to throw out the value of this implicit knowledge.

So there are two ways Bob could do a better job:

  1. He could reflect more on the purpose of what he's doing (perhaps consulting Alice to understand that budget starts to matter when it's getting into the thousands of dollars, and that she really doesn't want to do things that bring legal or physical risk)
  2. He could do something other than pure optimization; like "find the first pretty good option and stop searching"[1], or "find a set of pretty good options and then pick the one that he gut-level feels best about"[2]
    • It's not obvious which of these will produce better outcomes; it depends how much of his implicit knowledge is known to his gut vs encoded in his search process

I'm generally a big fan of #1. Of course it's possible to go overboard, but I think it's often worth spending 3-30% of the time you'll spend on an activity reflecting on the purpose.[3] And it doesn't have much downside beyond the time cost.

Of course you'd like to sequence things such that you do the reflection on the purpose first ("premature optimization is the root of all evil"), but even then we're usually acting based on an imperfect understanding of the purpose, which means that more optimization for the purpose doesn't necessarily lead to better things. So some combination of #1 and #2 will often be best.

When is lots of optimization for a purpose good?

Optimization for a purpose is particularly good when:

See also Perils of optimizing in social contexts for an important special case where it's worth being wary about optimizing.

  1. ^

    I owe this general point, which was the inspiration for the post, to Jan Kulveit, who expressed it concisely as "argmax -> softmax".

  2. ^

    This takes advantage of the fact that his gut is often implicitly tracking things, without needing to do the full work of reflecting on the purpose to make them explicit.

  3. ^

    As a toy example, suppose that every doubling of the time you spend reflecting on the purpose helps you do things 10% better; then you should invest about 12% of your time reflecting on purpose [source: scribbled calculation]. 

    Activities will vary a lot on how much you actually get benefits from reflecting on the purpose, but I don't think it's that unusual to see significant returns, particularly if the situation is complicated (& e.g. involving other people very often makes things complicated).


David Johnston @ 2022-06-17T01:33 (+9)

Lately, I tend to think of this as a distinction between the "proxy optmization" algorithm and the "optimality" of the actual plan. The algorithm: specify a proxy reward and a proxy set of plans, and search for the best one. You could call this "proxy optimization". 

The results: whatever actually happens, and how good it actually is. There's not really a verb associated with this - you can't just make something as good as it can possibly be (not even "in expectation" - you can only optimize proxies in expectation!). But it still seems like there's a loose sense in which you can be aiming for optimality.

Off the top of my head, there are a few ways proxy optimization can hurt, and most of them seem to come down to "better optimizing a worse proxy". You could deliberately alter the problem so that it is tractable for proxy optimization, you could just invest too much in proxy optimization vs trying to construct a good proxy. This seems to roughly agree with your advice: investing lots in proxy optimization is particularly beneficial when the proxy is already pretty good, or when it will reveal very large differences in prospective plans (which are unlikely to be erased by considering a better proxy). I actually feel that some caution might be needed in the setting where there are apparently many orders of magnitude between the value of different plans (according to a proxy) - something like, if the system is apparently so sensitive to the things you are taking into account, then there's reason to believe it might also be quite sensitive to the things you're not taking into account.

Emrik @ 2022-06-24T09:09 (+3)

If you think of thinking as generating a bunch of a priori datapoints (your thoughts) and trying to find a model that fits those data, we can use this to classify some overthinking failure-modes. These classes may overlap somewhat.

  1. You overfit your model to the datapoints because you underestimate their variance (regularization failure).
  2. Your datapoints may not compare very well to the real-world thing you're trying to optimize, so by underestimating their bias you may make your model less generalizable out of training distribution (distribution mismatch).
  3. If you over-update on each new datapoint because you underestimate the breadth of the landscape (your a priori datapoints about a thing may be a very limited distribution compared to the thing-in-itself), you may prematurely descend into a local optima (greediness failure).
Joseph Lemien @ 2022-06-17T01:47 (+3)

I'm afraid that I don't remember the specific name nor the specific formula (and a cursory Google search hasn't been able to job my memory), but there is also the concept within operations management of not optimizing a system too much, because that decreases effectiveness. If my memory serves, you can roughly think of it that if you are too highly optimized, your system is rigid/fragile and lacks the slack/flexibility to deal with unexpected but inevitable shocks.

Owen Cotton-Barratt @ 2022-06-17T09:46 (+2)

I posted this to LessWrong as well, and one of the commenters there mentions the "performance / robustness stability tradeoff in controls theory". Is that the same as what you're thinking of?

Gavin @ 2022-06-19T17:48 (+3)

Reminds me of the result in queueing theory, where (in the simplest queue model) going above ~80% utilisation of your capacity leads to massive increases in waiting time.

Joseph Lemien @ 2022-06-17T01:40 (+3)

I'm glad you included the Tony Hoare/Donald Knuth quote about premature optimization. As soon as I saw the title of this post I was hoping there would be  at least some reference to that.

brb243 @ 2022-06-17T21:06 (+1)

This is funny but I do not think that Goodhart's law is biting. Sometimes, measures can be optimal. Consider that all this Bob's 'suboptimal' development leads him to step back and see a better solution: the car driver is going to drive safely and Alice joins the introductions via videoconference. So, even being frantic about a metric, which is possibly not the ultimate target (the meeting going well?) can lead to a solution that meets this target, while also optimizes for the 'proxy' metric.

There is no need to explicitly reflect on purpose for this likely development toward an optimal solution. Actually, are you frantic about optimization for suboptimal optimization? I think so because perceived imposition may create inefficiencies. (Also take this as expressive writing.)