O n August 24, 1966, a talented playwright by the name Tom Stoppard staged a play in Edinburgh, Scotland. The play had a curious title, “Rosencrantz and Guildenstern Are Dead.” Its central characters, Rosencrantz and Guildenstern, are childhood friends of Hamlet (of Shakespearean fame). The play opens with Guildenstern repeatedly tossing coins which keep coming up Heads. Each outcome makes Guildenstern’s money-bag lighter and Rosencrantz’s, heavier. As the drumbeat of Heads continues with a pitiless persistence, Guildenstern is worried. He worries if he is secretly willing each coin to come up Heads as a self-inflicted punishment for some long-forgotten sin. Or if time stopped after the first flip, and he and Rosencrantz are experiencing the same outcome over and over again.
Stoppard does a brilliant job of showing how the laws of probability are woven into our view of the world, into our sense of expectation, into the very fabric of human thought. When the 92nd flip also comes up as Heads, Guildenstern asks if he and Rosencrantz are within the control of an unnatural reality where the laws of probability no longer operate.
Guildenstern’s fears are of course unfounded. Granted, the likelihood of getting 92 Heads in a row is unimaginably small. In fact, it is a decimal point followed by 28 zeroes followed by 2. Guildenstern is more likely to be hit on the head by a meteorite.
Guildenstern only has to come back the next day to flip another sequence of 92 coin tosses and the result will almost certainly be vastly different. If he were to follow this routine every day, he’ll discover that on most days the number of Heads will more or less match the number of tails. Guildenstern is experiencing a fascinating behavior of our universe known as the Law of Large Numbers.
The LLN, as it is called, comes in two flavors: the weak and the strong. The weak LLN can be more intuitive and easier to relate to. But it is also easy to misinterpret. I’ll cover the weak version in this article and leave the discussion on the strong version for a later article.
The weak Law of Large Numbers concerns itself with the relationship between the sample mean and the population mean. I’ll explain what it says in plain text:
Suppose you draw a random sample of a certain size, say 100, from the population. By the way, make a mental note of the term sample size. The size of the sample is the ringmaster, the grand pooh-bah of this law. Now calculate the mean of this sample and set it aside. Next, repeat this process many many times. What you’ll get is a set of imperfect means. The means are imperfect because there will always be a ‘gap’, a delta, a deviation between them and the true population mean. Let’s assume you’ll tolerate a certain deviation. If you select a sample mean at random from this set of means, there will be a chance that the absolute difference between the sample mean and the population mean will exceed your tolerance.
The weak Law of Large Numbers says that the probability of this deviation’s exceeding your selected level of tolerance will shrink to zero as the sample size grows to either infinity or to the size of the population.
No matter how tiny is your selected level of tolerance, as you draw sets of samples of ever increasing size, it’ll become increasingly unlikely that the mean of a randomly chosen sample from the set will exceed this tolerance.
To see how the weak LLN works we’ll run it through an example. And for that, allow me, if you will, to take you to the cold, brooding expanse of the Northeastern North Atlantic Ocean.
Every day, the Government of Ireland publishes a dataset of water temperature measurements taken from the surface of the North East North Atlantic. This dataset contains hundreds of thousands of measurements of surface water temperature indexed by latitude and longitude. For instance, the data for June 21, 2023 is as follows:
It’s kind of hard to imagine what eight hundred thousand surface temperature values look like. So let’s create a scatter plot to visualize this data. I’ve shown this plot below. The vacant white areas in the plot represent Ireland and the United Kingdom.
As a student of statistics, you will never have access to the ‘population’. So you’ll be correct in severely chiding me if I declare this population of 800,000 temperature measurements as the ‘population’. But bear with me for a little while. You will soon see why, in our quest to understand the LLN, it helps us to consider this data as the ‘population’.
So let’s assume that this data is — ahem…cough — the population. The average surface water temperature across the 810219 locations in this population of values is 17.25840 degrees Celsius. 17.25840 is simply the average of the 810K temperature measurements. We’ll designate this value as the population mean, μ. Remember this value. You’ll need to refer to it often.
Now suppose this population of 810219 values is not accessible to you. Instead, all you have access to is a meager little sample of 20 random locations drawn from this population. Here’s one such random sample:
The mean temperature of the sample is 16.9452414 degrees C. This is our sample mean X_bar which is computed as follows:
X_bar = (X1 + X2 + X3 + … + X20) / 20
You can just as easily draw a second, a third, indeed any number of such random samples of size 20 from the same population. Here are a few random samples for illustration:
A quick aside on what a random sample really is
Before moving ahead, let’s pause a bit to get a certain degree of perspective on the concept of a random sample. It will make it easier to understand how the weak LLN works. And to acquire this perspective, I must introduce you to the casino slot machine:
The slot machine shown above contains three slots. Every time you crank down the arm of the machine, the machine fills each slot with a picture that the machine has selected randomly from an internally maintained population of pictures such as a list of fruit pictures. Now imagine a slot machine with 20 slots named X1 through X20. Assume that the machine is designed to select values from a population of 810219 temperature measurements. When you pull down the arm, each one of the 20 slots — X1 through X20 — fills with a randomly selected value from the population of 810219 values. Therefore, X1 through X20 are random variables that can each hold any value from the population. Taken together they form a random sample. Put another way, each element of a random sample is itself a random variable.
X1 through X20 have a few interesting properties:
- The value that X1 acquires is independent of the values that X2 thru X20 acquire. The same applies to X2, X3, …,X20. Thus X1 thru X20 are independent random variables.
- Because X1, X2,…, X20 can each hold any value from the population, the mean of each of them is the population mean, μ. Using the notation E() for expectation, we write this result as follows:
E(X1) = E(X2) = … = E(X20) = μ.
- X1 thru X20 have identical probability distributions.
Thus, X1, X2,…,X20 are independent, identically distributed (i.i.d.) random variables.
…and now we get back to showing how the weak LLN works
Let’s compute the mean (denoted by X_bar) of this 20 element sample and set it aside. Now let’s once again crank down the machine’s arm and out will pop another 20-element random sample. We’ll compute its mean and set it aside too. If we repeat this process one thousand times, we will have computed one thousand sample means.
Here’s a table of 1000 sample means computed this way. We’ll designate them as X_bar_1 to X_bar_1000:
Now consider the following statement carefully:
Since the sample mean is calculated from a random sample, the sample mean is itself a random variable.
At this point, if you are sagely nodding your head and stroking your chin, it is very much the right thing to do. The realization that the sample mean is a random variable is one of the most penetrating realizations one can have in statistics.
Notice also how each sample mean in the table above is some distance away from the population mean, μ. Let’s plot a histogram of these sample means to see how they are distributed around μ:
Most of the sample means seem to lie close to the population mean of 17.25840 degrees Celsius. However, there are some that are considerably distant from μ. Suppose your tolerance for this distance is 0.25 degrees Celsius. If you were to plunge your hand into this bucket of 1000 sample means, grab whichever mean falls within your grasp and pull it out. What will be the probability that the absolute difference between this mean and μ is equal to or greater than 0.25 degrees C? To estimate this probability, you must count the number of sample means that are at least 0.25 degrees away from μ and divide this number by 1000.
In the above table, this count happens to be 422 and so the probability P(|X_bar — μ | ≥ 0.25) works out to be 422/1000 = 0.422
Let’s park this probability for a minute.
Now repeat all of the above steps, but this time use a sample size of 100 instead of 20. So here’s what you will do: draw 1000 random samples each of size 100, take the mean of each sample, store away all those means, count the ones that are at least 0.25 degrees C away from μ, and divide this count by 1000. If that sounded like the labors of Hercules, you were not mistaken. So take a moment to catch your breath. And once you are all caught up, notice below what you have got as the fruit for your labors.
The table below contains the means from the 1000 random samples, each of size 100:
Out of these one thousand means, fifty-six means happen to deviate by least 0.25 degrees C from μ. That gives you the probability that you’ll run into such a mean as 56/1000 = 0.056. This probability is decidedly smaller than the 0.422 we computed earlier when the sample size was only 20.
If you repeat this sequence of steps multiple times, each time with a different sample size that increases incrementally, you will get yourself a table full of probabilities. I’ve done this exercise for you by dialing up the sample size from 10 through 490 in steps of 10. Here’s the outcome:
Each row in this table corresponds to 1000 different samples that I drew at random from the population of 810219 temperature measurements. The sample_size column mentions the size of each of these 1000 samples. Once drawn, I took the mean of each sample and counted the ones that were at least 0.25 degrees C apart from μ. The num_exceeds_tolerance column mentions this count. The probability column is num_exceeds_tolerance / sample_size.
Notice how this count attenuates rapidly as the sample size increases. And so does the corresponding probability P(|X_bar — μ | ≥ 0.25). By the time the sample size reaches 320, the probability has decayed to zero. It blips up to 0.001 occasionally but that’s because I have drawn a finite number of samples. If each time I draw 10000 samples instead of 1000, not only will the occasional blips flatten out but the attenuation of probabilities will also become smoother.
The following graph plots P(|X_bar — μ | ≥ 0.25) against sample size. It puts in sharp relief how the probability plunges to zero as the sample size grows.
In place of 0.25 degrees C, what if you chose a different tolerance — either a lower or a higher value? Will the probability decay irrespective of your selected level of tolerance? The following family of plots illustrates the answer to this question.
No matter how frugal, how tiny, is your choice of the tolerance (ε), the probability P(|X_bar — μ | ≥ ε) will always converge to zero as the sample size grows. This is the weak Law of Large Numbers in action.
The behavior of the weak LLN can be formally stated as follows:
Suppose X1, X2, …, Xn are i.i.d. random variables that together form a random sample of size n. Suppose X_bar_n is the mean of this sample. Suppose also that E(X1) = E(X2) = … = E(Xn) = μ. Then for any non-negative real number ε the probability of X_bar_n being at least ε away from μ tends to zero as the size of the sample tends to infinity. The following exquisite equation captures this behavior:
Over the 310 year history of this law, mathematicians have been able to progressively relax the requirement that X1 through Xn be independent and identically distributed while still preserving the spirit of the law.
The principle of “convergence in probability”, the “plim” notation, and the art of saying really important things in really few words
The particular style of converging to some value using probability as the means of transport is called convergence in probability. In general, it is stated as follows:
In the above equation, X_n and X are random variables. ε is a non-negative real number. The equation says that as n tends to infinity, X_n converges in probability to X.
Throughout the immense expanse of statistics, you’ll keep running into a quietly unassuming notation called plim. It’s pronounced ‘p lim’, or ‘plim’ (like the word ‘ plum’ but with in ‘i’), or probability limit. plim is the short form way of saying that a measure such as the mean converges in probability to a specific value. Using plim, the weak Law of Large Numbers can be stated pithily as follows:
Or simply as:
The brevity of notation is not the least surprising. Mathematicians are drawn to brevity like bees to nectar. When it comes to conveying profound truths, mathematics could well be the most ink-efficient field. And within this efficiency-obsessed field, plim occupies podium position. You will struggle to unearth as profound a concept as plim expressed in lesser amount of ink, or electrons.
But struggle no more. If the laconic beauty of plim left you wanting for more, here’s another, possibly even more efficient, notation that conveys the same meaning as plim:
At the top of this article, I mentioned that the weak Law of Large Numbers is noteworthy for what it does not say as much as for what it does say. Let me explain what I mean by that. The weak LLN is often misinterpreted to mean that as the sample size increases, its mean approaches the population mean or various generalizations of that idea. As we saw, such ideas about the weak LLN harbor no attachment to reality.
In fact, let’s bust a couple of myths regarding the weak LLN right away.
MYTH #1: As the sample size grows, the sample mean tends to the population mean.
This is quite possibly the most frequent misinterpretation of the weak LLN. However, the weak LLN makes no such assertion. To see why that is, consider the following situation: you have managed to get your arms around a really large sample. While you gleefully admire your achievement, you should also pose yourself the following questions: Just because your sample is large, must it also be well-balanced? What’s preventing nature from sucker punching you with a giant sample that contains an equally giant amount of bias? The answer is absolutely nothing! In fact, isn’t that what happened to Guildenstern with his sequence of 92 Heads? It was, after all, a completely random sample! If it just so happens to have a large bias, then despite the large sample size, the bias will blast away the sample mean to a point that is far away from the true population value. Conversely, a small sample can prove to be exquisitely well-balanced. The point is, as the sample size increases, the sample mean isn’t guaranteed to dutifully advance toward the population mean. Nature does not provide such unnecessary guarantees.
MYTH #2: As the sample size increases, pretty much everything about the sample — its median, its variance, its standard deviation — converges to the population values of the same.
This sentence is two myths bundled into one easy-to-carry package. Firstly, the weak LLN postulates a convergence in probability, not in value. Secondly, the weak LLN applies to the convergence in probability of only the sample mean, not any other statistic. The weak LLN does not address the convergence of other measures such as the median, variance, or standard deviation.
It’s one thing to state the weak LLN, and even demonstrate how it works using real-world data. But how can you be sure that it always works? Are there circumstances in which it will play spoilsport — situations in which the sample mean simply does not converge in probability to the population value? To know that, you must prove the weak LLN and, in doing so, precisely define the conditions in which it will apply.
It so happens that the weak LLN has a deliciously mouth-watering proof that uses as one of its ingredients, the endlessly tantalizing Chebyshev’s Inequality. If that whets your appetite, stay tuned for my next article on the proof of the weak Law of Large Numbers.
It will be impolite to take leave off this topic without assuaging our friend Guildenstern’s worries. Let’s develop an appreciation for just how unquestionably unlikely a result it was that he experienced. We’ll simulate the act of tossing 92 unbiased coins using a pseudo-random generator. Heads will be encoded as 1 and tails as 0. We’ll record the mean value of the 92 outcomes. The mean value is the fraction of times that the coin came up Heads. We’ll repeat this experiment ten thousand times to obtain ten thousand means of 92 coin tosses, and we’ll plot their frequency distribution. After completing this exercise, we will get the following kind of histogram plot:
We see that most of the sample means are grouped around the population mean of 0.5. Guildenstern’s result — getting 92 Heads in a row —is an exceptionally unlikely outcome. Therefore, the frequency of this outcome is also vanishingly small. But contrary to Guildenstern’s fears, there is nothing unnatural about the outcome and the laws of probability continue to operate with their usual gusto. Guildenstern’s outcome is simply lurking inside the remote regions of the left tail of the plot, waiting with infinite patience to pounce upon some luckless coin-flipper whose only mistake was to be unimaginably unlucky.