My notes on: Searching for outliers

By Vasco Grilo @ 2022-06-03T16:19 (+9)

I summarised in the table below the properties of light-tailed and heavy-tailed distributions as described in Ben Kuhn's article Searching for outliers. Any errors/misinterpretations are my own. The following sections contain some transcriptions from Ben Kuhn's article respecting each of the properties.

DistributionLight-tailedHeavy-tailed
Ratio between the top percentiles and the medianLowHigh
Generation mechanismAdditiveMultiplicative
Importance of outliersLowHigh
Number of samples to get a really good outcomeLowHigh
What to filter for"Probably good""Maybe amazing"

Ratio between the top percentiles and the median

As a rule of thumb, a heavy-tailed distribution is one where the top few percent of outcomes are a large multiple of the typical or median outcome:

Generation mechanism

Light-tailed distributions most often occur because the outcome is the result of many independent contributions, while heavy-tailed distributions often arise from the result of processes that are multiplicative or self-reinforcing:

Importance of outliers

Notably, in a light-tailed distribution, outliers don’t matter much:

Number of samples to get a really good outcome

The most important thing to remember when sampling from heavy-tailed distributions is that getting lots of samples improves outcomes a ton:

What to filter for

Another consequence of the numbers game is that the strategy that you use to filter your samples is very important: