Data Mining - Noise (Unwanted variation)

Thomas Bayes


Information from all past experience can be divided into two groups:

  • information that is relevant for the future (“signal”)
  • information that is irrelevant (“noise”).

In many cases the factors causing the unwanted variation are unknown and must be inferred from the data.

Noise can be seen as the result of:

The noise tries to be represented by calculating the prediction error



All information got random noise that is related to the data collection process.

Example: reading of GPS 'jump around' though always remaining within a few meters of the real position.


see Statistics - Bias (Sampling error)

Documentation / Reference

Discover More
Anomalies Election Fraud
Data Mining - (Anomaly|outlier) Detection

The goal of anomaly detection is to identify unusual or suspicious cases based on deviation from the norm within data that is seemingly homogeneous. Anomaly detection is an important tool: in data...
Feature Importance
Data Mining - (Attribute|Feature) (Selection|Importance)

Feature selection is the second class of dimension reduction methods. They are used to reduce the number of predictors used by a model by selecting the best d predictors among the original p predictors....
Thomas Bayes
Data Mining - (True Function|Truth)

The True model (or truth) is the model that represents perfectly the response without noise.
Thomas Bayes
Data Mining - Nested (Transactional|Historical) Data

shallow, yet wide, and nested data problems nested transactional data = all the claims for a person for example. Considering a database of retail purchases that includes the item bought, the purchaser,...
Thomas Bayes
Data Mining - Problem

A page the problem definition in data Type of target: nominal or quantitative Type of target class: binomial of multiclass Number of parameters: Type of (predictor|features): nominal or numeric....
Thomas Bayes
Data Mining - Signal (Wanted Variation)

Information from all past experience can be divided into two groups: information that is relevant for the future (“signal”). pattern information that is irrelevant (“noise”). In the real...
Thomas Bayes
Data Mining - Variation (Change?)

As defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming., Common_cause_and_special_cause_(statistics)Common and special causes are the two distinct origins of variation...
Thomas Bayes
Data Science - (Kalman Filtering|Linear quadratic estimation (LQE))

Kalman Filtering or Linear quadratic estimation (LQE) is an algorithm that uses a series of measurements observed over time Because of the algorithm's recursive nature, it can run in real time using only...
Bed Overfitting
Machine Learning - (Overfitting|Overtraining|Robust|Generalization) (Underfitting)

A learning algorithm is said to overfit if it is: more accurate in fitting known data (ie training data) (hindsight) but less accurate in predicting new data (ie test data) (foresight) Ie the model...
Regression Mean
Machine Learning - K-Nearest Neighbors (KNN) algorithm - Instance based learning

“Nearest‐neighbor” learning is also known as “Instance‐based” learning. K-Nearest Neighbors, or KNN, is a family of simple: classification and regression algorithms based on Similarity...

Share this page:
Follow us:
Task Runner