Statistics Learning - Prediction Error (Training versus Test)

About

The Prediction Error tries to represent the noise through the concept of training error versus test error.

We fit our model to the training set. We take our model, and then we apply it to new data that the model hasn't seen.

In general, because the more data, the bigger the sample size, the more information you have, the lower the error is.

Metrics

Error calculation:

Regression

Most of the regression metrics are based on the residual, see Regression Accuracy metrics for a list.

Classification

Most of the classification metrics are based on the error rate, see Classification Accuracy metrics for a list.

Type

Training

Training error is the error we get applying the model to the same data from which we trained.

Test

Test error is the error that we incur on new data. The test error is actually how well we'll do on future data the model hasn't seen.

Training vs Test

Training error almost always UNDERestimates test error, sometimes dramatically.

Training error usually UNDERestimates test error when the model is very complex (compared to the training set size), and is a pretty good estimate when the model is not very complex. However, it's always possible we just get too few hard-to-predict points in the test set, or too many in the training set. Then the test error can be LESS than training error, when by chance the test set has easier cases than the training set.