Statistics - Assumptions underlying correlation and regression analysis (Never trust summary statistics alone)

Thomas Bayes

About

The magnitude of a correlation depends upon many factors, including:

Anscombe's quartet

In 1973, statistician Dr. Frank Anscombe developed a classic example to illustrate several of the assumptions underlying correlation and linear regression.

The below scatter-plots have the same correlation coefficient and thus the same regression line.

They have also the same mean and variance.

<MATH> Y = 3 + 0.5 X </MATH>

Only the first one on the upper left satisfies the assumptions underlying a:

Anscombe S Quartet 3

Datasaurus: Never trust summary statistics alone; always visualize your data

Datasaurus

The Datasaurus Dozen. While different in appearance, each dataset has the same summary statistics (mean, standard deviation, and Pearson's correlation) to two decimal places.

See:

Bring your own doodles linear regression

Most of the examples of using linear regression just show a regression line with some dataset. it's much more fun to understand it by drawing data in. Bring your own doodles linear regression

How to

test the assumptions in a regression analysis ?

To test the assumptions in a regression analysis, we look a those residual as a function of the X productive variable. (X remaining on the X axis and the residuals coming on the Y axis).

For each of the individual, the residual can be calculated as the difference between the predicted score and a actual score.

If the assumptions are good, there must be:

  • no relationship between X and the residual. They must be independent. The relation coefficient must be zero.
  • some of the points above zero and some of them below zero. It will indicate Homoscedasticity





Discover More
P Value Pipeline
Data Mining - (Life cycle|Project|Data Pipeline)

Data mining is an experimental science. Data mining reveals correlation, not causation. With good data, you will make good algorithm. The most preferable solution is then to work on good features....
Linear Vs True Regression Function
Machine Learning - Linear (Regression|Model)

Linear regression is a regression method (ie mathematical technique for predicting numeric outcome) based on the resolution of linear equation. This is a classical statistical method dating back more...
Data System Architecture
Statistics - (Data|Data Set) (Summary|Description) - Descriptive Statistics

Summary are a single value summarizing a array of data. They are: selected or calculated through reduction operations. They are an important element of descriptive analysis One of the most important...
Univariate Linear Regression
Statistics - (Univariate|Simple|Basic) Linear Regression

A Simple Linear regression is a linear regression with only one predictor variable (X). Correlation demonstrates the relationship between two variables whereas a simple regression provides an equation...
Covariance
Statistics - Correlation (Coefficient analysis)

Correlation is a statistical analysis used to measure and describe the relationship betweentwo variables. The Correlations coefficient is a statistic and it can range between +1 and -1 +1 is a perfect...
Anscombe Regression
Statistics - Regression

Regression is a statistical analysis used: to predict scores on an numeric outcome variable, based on scores of: one predictor variable: simple regression or multiple predictor variables: multiple...



Share this page:
Follow us:
Task Runner