Statistics - Central limit theorem (CLT)

1 - About

The central limit theorem (CLT) is a probability theorem (unofficial sovereign)

It establishes that when:

The first version of this theorem was postulated by the French-born mathematician Abraham de Moivre who, in a remarkable article published in 1733, used the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin.

The actual term “central limit theorem” (in German: “zentraler Grenzwertsatz”) was first used by George Pólya in On the central limit theorem of calculus of probability and the problem of moments (German). He uses the term central to emphasize its importance in probability theory.

3 - More

The sum of k random variables (independent) approaches a normal distribution as k increases,

The central limit theorem began in 1733 when de Moivre approximated binomial probabilities using the integral of <math>exp(-x^2)</math> (gaussian_function) The central limit theorem achieved its final form around 1935 in papers by Feller, Lévy, and Cramér.

The central limit theorem is a fundamental component of inferential statistics

The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

4 - Application

4.1 - Random Samples

The central limit theorem says that the averages of several samples obtained from the same population (ie a sampling distribution) following the central limit theorem rules (see below) will be distributed according to the normal distribution.

Therefore:

The population doesn't have to be normally distributed, as long as we get multiple samples of large enough size (N>30) then the sampling distribution will take on a normal distribution.

Rules

  • The sample must contain a large number of observations (N>30)
  • Each observation must be randomly generated (No relationship/dependencies between the observations)
  • The shape of the distribution of sample means is always normal (not negatively or positively skewed, not uniform)

Creation of a sampling distribution based on the mean estimator.


//  A helper function to draw an histogram
function histogram(params) {
  
  selector = params.selector
  data = params.data;
  
  // data
  min = d3.min(data);
  max = d3.max(data);

  // Graphics data
  var margin = { top: 30, right: 30, bottom: 30, left: 50 },
    width = 460 - margin.left - margin.right,
    height = 400 - margin.top - margin.bottom;
  // The number of bins
  Nbin = 20;

  // Histogram gets the threshold from the x ticks
  // X axis (the ticks of the x axis will be the threshold/breaks of the histogram function)
  var x = d3
    .scaleLinear()
    .domain([min, max]) // can use this instead of 1000 to have the max of data: d3.max(data, function(d) { return +d.price })
    .range([0, width]); // Map of the data to the graphic

  // set the parameters for the histogram
  var histogram = d3
    .histogram()
    .domain(x.domain()) // then the domain of the graphic
    .thresholds(x.ticks(Nbin)); // then the numbers of bins

  // And apply this function to data to get the bins
  var bins = histogram(data);

  // append the svg object to the body of the page
  // Set the dimensions and margins of the graph
  var svg = d3
    .select("#"+selector)
    .append("svg")
    .attr("width", width + margin.left + margin.right)
    .attr("height", height + margin.top + margin.bottom)
    .append("g")
    .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

  // Add the x axis
  svg
    .append("g")
    .attr("transform", "translate(0," + height + ")")
    .call(d3.axisBottom(x));

  // Y axis: scale and draw:
  var y = d3
    .scaleLinear()
    .range([height, 0])
    .domain([
      0,
      d3.max(bins, function(d) {
        return d.length;
      })
    ]);

  svg.append("g").call(d3.axisLeft(y));

  // // append the bar rectangles to the svg element
  svg
    .selectAll("rect")
    .data(bins)
    .enter()
    .append("rect")
    .attr("x", 1)
    .attr("transform", function(d) {
      return "translate(" + x(d.x0) + "," + y(d.length) + ")";
    })
    .attr("width", function(d) {
      return x(d.x1) - x(d.x0) - 1;
    })
    .attr("height", function(d) {
      return height - y(d.length);
    })
    .style("fill", "#69b3a2");
}

  • Creating the population data randomly distributed

population_n = 10000;
population_data = [];
population_max = 100;
population_data = [];

for (i = 0; i < population_n; i++) {
  random_value = Math.floor(Math.random() * Math.floor(population_max));
  population_data.push(random_value);
}

histogram({ selector: "population", data: population_data});

  • Sampling the population 1000 times with a sample size of 20, calculating the mean and adding it to the sample distribution

// Sample Data
sample_distribution_data = [];
sample_distribution_n = 1000;
for (j = 0; j < sample_distribution_n; j++) {
  sample_data = [];
  sample_n = 20;
  for (i = 0; i < sample_n; i++) {
    population_random_index = Math.floor(
      Math.random() * Math.floor(population_max)
    );
    sample_data.push(population_data[population_random_index]);
  }
  sample_distribution_data.push(d3.mean(sample_data));
}
histogram({ selector: "sample", data:sample_distribution_data});


<!-- The HTML page -->
<h1>The Population Distribution</h1>
<p>The population was generated with random data</p>
<div id="population"></div>
<h1>The Sample Distribution (Distribution of the sample mean)</h1>
<p>The sample distribution created from the mean of 1000 samples follows a normal distribution as stated by the Central Limit Theorem</p>
<div id="sample"></div>

4.2 - Tosses of a fair coin

Coin Probability of getting a given number of heads in a series

If you flipped a coin 10 times over and over. You would expect to get 5 heads and 5 tails most often, but would get 6 and 4 sometimes, and so on, with normal distribution.

A simple example of this is that if one flips a coin many times the probability of getting a given number of heads in a series of flips will approach a normal curve, with mean equal to half the total number of flips in each series. In the limit of an infinite number of flips, it will equal a normal curve.


// A helper function to draw an histogram
function histogram(params) {
  
  selector = params.selector;
  data = params.data;
  bins = params.bins;
  min = params.min;
  max = params.max;
  

  // Graphics data
  var margin = { top: 30, right: 30, bottom: 30, left: 50 },
    width = 460 - margin.left - margin.right,
    height = 400 - margin.top - margin.bottom;
  

  // Histogram gets the threshold from the x ticks
  // X axis (the ticks of the x axis will be the threshold/breaks of the histogram function)
  var x = d3
    .scaleLinear()
    .domain([min, max]) // can use this instead of 1000 to have the max of data: d3.max(data, function(d) { return +d.price })
    .range([0, width]); // Map of the data to the graphic

  // set the parameters for the histogram
  var histogram = d3
    .histogram()
    .domain(x.domain()) // then the domain of the graphic
    .thresholds(x.ticks(bins)); // then the numbers of bins

  // And apply this function to data to get the bins
  var bins = histogram(data);

  // append the svg object to the body of the page
  // Set the dimensions and margins of the graph
  var svg = d3
    .select("#"+selector)
    .append("svg")
    .attr("width", width + margin.left + margin.right)
    .attr("height", height + margin.top + margin.bottom)
    .append("g")
    .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

  // Add the x axis
  svg
    .append("g")
    .attr("transform", "translate(0," + height + ")")
    .call(d3.axisBottom(x));

  // Y axis: scale and draw:
  var y = d3
    .scaleLinear()
    .range([height, 0])
    .domain([
      0,
      d3.max(bins, function(d) {
        return d.length;
      })
    ]);

  svg.append("g").call(d3.axisLeft(y));

  // // append the bar rectangles to the svg element
  svg
    .selectAll("rect")
    .data(bins)
    .enter()
    .append("rect")
    .attr("x", 1)
    .attr("transform", function(d) {
      return "translate(" + x(d.x0) + "," + y(d.length) + ")";
    })
    .attr("width", function(d) {
      return x(d.x1) - x(d.x0) - 1;
    })
    .attr("height", function(d) {
      return height - y(d.length);
    })
    .style("fill", "#69b3a2");
}

  • Creating the coin flip simulation

flip_n = 50;
head_distribution_n = 10000;
head_distribution = [];
for (i = 0; i < head_distribution_n; i++) {
  flip_results = [];
  for (j = 0; j < flip_n ; j++){
      flip_value = Math.round(Math.random()); // 0 or 1
      flip_results.push(flip_value );
  }
  head_distribution.push(d3.sum(flip_results))
}

histogram({ 
    selector: "head_distribution", 
    data: head_distribution, 
    bins: flip_n,
    min: 0,
    max: flip_n
    });


<h1>The heads probability distribution</h1>
<p>The probability of getting a given number of heads in a series of flips should approach a normal curve</p>
<p>This probability is the result of the number of heads calculated from 10000 series of 50 flip coins </p>
<div id="head_distribution"></div>

4.3 - Errors of measurements

The occurrence of the Gaussian probability density in errors of measurements, which result in the combination of very many and very small elementary errors, in diffusion processes etc., can be explained, by the very same limit theorem.

The central limit theorem explains the common appearance of the “bell curve” in density estimates applied to real world data. In cases like electronic noise, examination grades, and so on, we can often regard a single measured value as the weighted average of many small effects.

Demo with the error of a pseudo-random number generator:


// A helper function to draw an histogram
function histogram_graphic(params) {
  
  var selector = params.selector
  var data = params.data;
  var bins = params.bins;
  
  // data
  var min = d3.min(data);
  var max = d3.max(data);

  // Graphics data
  var margin = { top: 30, right: 30, bottom: 30, left: 50 },
    width = 460 - margin.left - margin.right,
    height = 400 - margin.top - margin.bottom;
  

  // Histogram gets the threshold from the x ticks
  // X axis (the ticks of the x axis will be the threshold/breaks of the histogram function)
  var x = d3
    .scaleLinear()
    .domain([min, max]) // can use this instead of 1000 to have the max of data: d3.max(data, function(d) { return +d.price })
    .range([0, width]); // Map of the data to the graphic


  // append the svg object to the body of the page
  // Set the dimensions and margins of the graph
  var svg = d3
    .select("#"+selector)
    .append("svg")
    .attr("width", width + margin.left + margin.right)
    .attr("height", height + margin.top + margin.bottom)
    .append("g")
    .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

  // Add the x axis
  svg
    .append("g")
    .attr("transform", "translate(0," + height + ")")
    .call(d3.axisBottom(x));

  // Y axis: scale and draw:
  var y = d3
    .scaleLinear()
    .range([height, 0])
    .domain([
      0,
      d3.max(bins, function(d) {
        return d.length;
      })
    ]);

  svg.append("g").call(d3.axisLeft(y));

  // // append the bar rectangles to the svg element
  svg
    .selectAll("rect")
    .data(bins)
    .enter()
    .append("rect")
    .attr("x", 1)
    .attr("transform", function(d) {
      return "translate(" + x(d.x0) + "," + y(d.length) + ")";
    })
    .attr("width", function(d) {
      return x(d.x1) - x(d.x0) - 1;
    })
    .attr("height", function(d) {
      return height - y(d.length);
    })
    .style("fill", "#69b3a2");
}

  • Creating the population data (10000) randomly generated with value between 0 and 100

population_n = 10000;
population_data = [];
population_max = 100;
population_data = [];

for (i = 0; i < population_n; i++) {
  random_value = Math.floor(Math.random() * Math.floor(population_max));
  population_data.push(random_value);
}


var thresholds= [];
for (var i = 0; i <= population_max; i++) {
   thresholds.push(i);
};

var histogram = d3
    .histogram()
    .domain([0,population_max]) // then the domain of the graphic
    .thresholds(thresholds); // then the threshold
var bins = histogram(population_data);


histogram_graphic({ selector: "population", data: population_data, bins: bins});

  • Calculating the errors for each bin

// The length of each bins
lengths = bins.map(function (d) { return d.length })
// The mean of the length of each bean
lengths_mean = d3.mean(lengths)
console.log("Mean of each length bin = "+lengths_mean)
// The errors (mean - length of each bin)
errors = bins
    .filter(function(d) { return Math.abs(d.length - lengths_mean) < 50 }) // One outlier, why ?
    .map(function(d) { return d.length - lengths_mean; } )

// Plotting the errors
errors_min = d3.min(errors)
errors_max = d3.max(errors)
errors_bins = d3.histogram()
    .domain([errors_min,errors_max]) // then the domain of the graphic
    .thresholds(30)
    (errors ); // 30 bins
histogram_graphic({ selector: "error", data: errors , bins: errors_bins });


<h1>The Population Distribution</h1>
<p>A population  was generated with pseudo-random data and has an uniform shape</p>
<p>Number of point (N) = 10000, Pseudo-Randomness Range = [0, 100], Number of bin = 100</p>
<div id="population"></div>
<h1>The Distribution of the error against the mean of each bin</h1>
<p>The error distribution should follow a normal distribution (as stated by the Central Limit Theorem)</p>
<div id="error"></div>

4.4 - Galtonboard

The Galton board is a physical model of the binomial distribution which beautifully illustrates the central limit theorem.

It is a visual proof (not a rigorous one) of the central limit theorem where:

  • The variable is whether it goes left or right. (The ball is not the variable.)
  • The randomness comes from the fact that every ball is randomly pushed left and right. The choice at each peg remains binary and random (50/50).
  • The final random variable is the bin

More … see Galton board

5 - Documentation / Reference


Data Science
Data Analysis
Statistics
Data Science
Linear Algebra Mathematics
Trigonometry

Powered by ComboStrap