Data Mining - Noise (Unwanted variation)

Information from all past experience can be divided into two groups:

• information that is relevant for the future (“signal”)
• information that is irrelevant (“noise”).

In many cases the factors causing the unwanted variation are unknown and must be inferred from the data.

Noise can be seen as the result of:

The noise tries to be represented by calculating the prediction error

Type

Random

All information got random noise that is related to the data collection process.

Example: reading of GPS 'jump around' though always remaining within a few meters of the real position.

Documentation / Reference

Discover More
Data Mining - (Anomaly|outlier) Detection

The goal of anomaly detection is to identify unusual or suspicious cases based on deviation from the norm within data that is seemingly homogeneous. Anomaly detection is an important tool: in data...
Data Mining - (Attribute|Feature) (Selection|Importance)

Feature selection is the second class of dimension reduction methods. They are used to reduce the number of predictors used by a model by selecting the best d predictors among the original p predictors....
Data Mining - (True Function|Truth)

The True model (or truth) is the model that represents perfectly the response without noise.
Data Mining - Nested (Transactional|Historical) Data

shallow, yet wide, and nested data problems nested transactional data = all the claims for a person for example. Considering a database of retail purchases that includes the item bought, the purchaser,...
Data Mining - Problem

A page the problem definition in data Type of target: nominal or quantitative Type of target class: binomial of multiclass Number of parameters: Type of (predictor|features): nominal or numeric....
Data Mining - Signal (Wanted Variation)

Information from all past experience can be divided into two groups: information that is relevant for the future (“signal”). pattern information that is irrelevant (“noise”). In the real...
Data Mining - Variation (Change?)

As defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming., Common_cause_and_special_cause_(statistics)Common and special causes are the two distinct origins of variation...
Data Science - (Kalman Filtering|Linear quadratic estimation (LQE))

Kalman Filtering or Linear quadratic estimation (LQE) is an algorithm that uses a series of measurements observed over time Because of the algorithm's recursive nature, it can run in real time using only...
Machine Learning - (Overfitting|Overtraining|Robust|Generalization) (Underfitting)

A learning algorithm is said to overfit if it is: more accurate in fitting known data (ie training data) (hindsight) but less accurate in predicting new data (ie test data) (foresight) Ie the model...
Machine Learning - K-Nearest Neighbors (KNN) algorithm - Instance based learning

“Nearest‐neighbor” learning is also known as “Instance‐based” learning. K-Nearest Neighbors, or KNN, is a family of simple: classification and regression algorithms based on Similarity...