Central Limit Theorem explained in a few lines of code

Jaroslaw Goslinski
3 min readJun 6, 2020

In machine learning the Central Limit Theorem is crucial, everyone should at least know what it is and how to use it. My personal feeling is that it isn't’ easy to find a good source that explains the CLT well. What is the CLT? “In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed. (Wikipedia)”
So in plain English, if we have some independent samples of data, their overall distribution tends toward a normal distribution even in the case when the samples don’t have a normal distribution. That doesn’t mean much, does it? I found this: https://statisticsbyjim.com/basics/central-limit-theorem/ the provided there explanation is clear and it gives us a good set of examples. The missing puzzle in the great majority of explanations on the internet is that the Sample Means have a Gaussian distribution not just a combination of samples. The important thing is used nomenclature:
The statistical population — a set of items that have the same distribution.
The sample — a subset selected from the population (we draw samples)
The data points — each sample consists of data points.

That’s it! now let’s construct an example in python.
To show working CLT we need a distribution with finite variance, different than a Gaussian (Gaussian distribution is good as well but it is harder to show the idea of CLT). I decided to use Chi-square with k=2 degrees of freedom:

s10 = np.random.chisquare(2, (10, 1500000))

We produced 1.5M samples with 10 observations each, all having chi-square distribution. Now let’s generate other 1.5M samples but with 30 observations each:

s30 = np.random.chisquare(2, (30, 1500000))

Now we need to calculate the mean for every sample:

s_mean_30 = np.mean(s30, 0)
s_mean_10 = np.mean(s10, 0)

Note that s_mean_10 and s_mean_30 has a 1.5M size, as there were 1.5M samples in each vector. Finally lets plot histograms of chi-square distribution and distributions of the means:

# create bins
bins = np.linspace(0, 10, 1000)
# plot histogram of one of the samples and distribuition of the mean
plt.hist(s30[0, :], bins, alpha=0.4, label='chi-square, k = 2')
plt.hist(s_mean_10, bins, alpha=0.4, label='distribution of sample means, s = 10')
plt.hist(s_mean_30, bins, alpha=0.4, label='distribution of sample means, s = 30')
plt.legend(loc='upper right')
plt.show()

We get what follows:

We can see that the original distribution is right-skewed, chi-square, and that the distributions of sample means are Gaussian-like. The more data points in samples we have, the more “normal” it becomes. Another fact is that the more data points in samples the more narrowed the resultant distribution is. It stems from the CLT, which states that the mean of sample means is equal to the mean of the original distribution, and the standard deviation is equal to the population sd divided by the square root of the number of data points in samples. To get Gaussian distribution we should take at least 30 data points.

That all! I hope you like this post!

https://github.com/jaroslav87/CentralLimitTheorem

--

--

Jaroslaw Goslinski

AI researcher, Interested in estimation theory, sensors, robotics and machine learning