About
Resampling method are a class of methods that estimate the test error by holding out a subset of the training set from the fitting process, and then applying the statistical learning method to those held out observations.
For supervised Learning problems, separate data sets are required for building (training) and testing the predictive models.
The training error is too optimistic. The more we fit to the data, the lower the training error but the test error can get higher if we over fit and it often will. For this reaons, models are fitted on part of the data, and then evaluated against on a holdout set (Test set).
Generally, the Build Activity splits the data into two mutually exclusive subsets:
- the training set for building the model: Evaluate the model on the training set. Performance estimates obtained on the training set are overly optimistic because of overfitting.
- the test set for testing the model. Evaluate the model on a separate test set
However, if the data is already split into Build and Test subsets, you can run a Build activity and specify the Split step.
Of course, the build data (training data) and test data must have the same column structure.
Articles Related
Methods
Practical rule of thumb:
- Lots of data? – use percentage_split
Resampling methods:
- Cross-validation is a very important tool to get a good idea of the test set error of a model.
- Bootstrap, on the other hand, is most useful to get an idea of the variability or standard deviation of an estimate and its bias.
Two-fold Validation
Two-fold validation: randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.
Percentage Split
Percentage Split (Fixed or Holdout) Leave out random N% of the data. For example, you might select 60% of the rows for building the model and 40% for testing the model. The algorithm is trained against the trained data and the accuracy is calculated on the whole data set.
Cross-validation
Also known as validation:
- k-fold Cross-Validation. Select K folds without replace
- Leave-One-Out Cross Validation (Special case)
Bootstrap
Related: Bootstrap: Generate new training sets by sampling with replacement
Bootstrap is a very clever device for using the one, single training sample you have to estimate things like standard deviations.