My notes on: Searching for outliers
By Vasco Grilo🔸 @ 2022-06-03T16:19 (+9)
I summarised in the table below the properties of light-tailed and heavy-tailed distributions as described in Ben Kuhn's article Searching for outliers. Any errors/misinterpretations are my own. The following sections contain some transcriptions from Ben Kuhn's article respecting each of the properties.
Distribution | Light-tailed | Heavy-tailed |
---|---|---|
Ratio between the top percentiles and the median | Low | High |
Generation mechanism | Additive | Multiplicative |
Importance of outliers | Low | High |
Number of samples to get a really good outcome | Low | High |
What to filter for | "Probably good" | "Maybe amazing" |
Ratio between the top percentiles and the median
As a rule of thumb, a heavy-tailed distribution is one where the top few percent of outcomes are a large multiple of the typical or median outcome:
- Income is heavy-tailed: the median person globally lives on $2,500 a year, while the top 1% live on $45,000, almost 20Ă— more.
- Height is light-tailed: the tallest people are only a few feet taller than average.
- If height followed the same distribution as income, Elon Musk, who made $121b in 2021, would be about 85,000 km tall, or about ÂĽ of the distance from the earth to the moon.
Generation mechanism
Light-tailed distributions most often occur because the outcome is the result of many independent contributions, while heavy-tailed distributions often arise from the result of processes that are multiplicative or self-reinforcing:
- For example, the richer you are, the easier it is to earn more money.
- The more Twitter followers you have, the more retweets you’ll get, and the more you’ll be exposed to new potential followers.
- The cost-effectiveness of a global health intervention comes from multiplying many different variables:
- How bad the disease you’re fighting is.
- How much of an impact the intervention has on the disease.
- How costly doing the intervention for one person is.
- Each of which itself is the product of several other factors.
Importance of outliers
Notably, in a light-tailed distribution, outliers don’t matter much:
- The 1% of tallest people are still close enough to the average person that you can safely ignore them most of the time.
- By contrast, in a heavy-tailed distribution, outliers matter a lot: even though 90% of people live on less than $15,000 a year, there are large groups of people making 1,000 times more.
- Because of this, heavy-tailed distributions are much less intuitive to understand or predict.
Number of samples to get a really good outcome
The most important thing to remember when sampling from heavy-tailed distributions is that getting lots of samples improves outcomes a ton:
- In a light-tailed context—say, picking fruit at the grocery store—it’s fine to look at two or three apples and pick the best-looking one:
- It would be completely unreasonable to, for example, look through the entire bin of apples for that one apple that’s just a bit better than anything you’ve seen so far.
- In a heavy-tailed context, the reverse is true:
- It would be similarly unreasonable to, say, pick your romantic partner by taking your favorite of the first two or three single people you run into.
- Every additional sample you draw increases the chance that you get an outlier.
- So one of the best ways to improve your outcome is to draw as many samples as possible.
What to filter for
Another consequence of the numbers game is that the strategy that you use to filter your samples is very important:
- It’s very important for your filters to be as tightly correlated with what you actually care about as possible, so that you don’t rule candidates out for bad reasons.
- A subtlety here is that the traits that make a candidate a potential outlier are often very different from the traits that would make them “pretty good”.
- So improving your filtering process to produce more “pretty good” candidates won’t necessarily increase the rate of finding outliers, and might even decrease it.
- Because of this, it’s important to filter for “maybe amazing”, not “probably good”.