# Data Mining - Outliers Cases

Outliers are cases that are unusual because they fall outside the distribution that is considered normal for the data.

The distance from the centre of a normal distribution indicates how typical a given point is with respect to the distribution of the data. Each case can be ranked according to the probability that it is either typical or atypical.

The presence of outliers can have a deleterious effect on many forms of data mining. Anomaly detection can be used to identify outliers before mining the data.

In a multidimensional dataset, outliers may only appear when looking at multiple dimensions whereas one one dimension they will be not far away from the mean / median.

## Example

For example, census data might show:

• a median household income of 70,000
• and a mean household income of 80,000,

but one or two households might have an income of 200,000. These cases would probably be identified as outliers.

## How to

### find them

Outliers are outside of three standard deviations of the mean. In a normal distribution, 99% of the data falls above or below that threshold.

Recommended Pages Data Mining - (Anomaly|outlier) Detection

The goal of anomaly detection is to identify unusual or suspicious cases based on deviation from the norm within data that is seemingly homogeneous. Anomaly detection is an important tool: in data... Data Mining - Data (Preparation | Wrangling | Munging)

Data for mining must exist within a single table or view. The information for each case (record) must be stored in a separate row. Proper preparation of the data is a key factor in any data mining project.... Data Mining - Result Considerations

Before tackling a data mining problem, some considerations must be take into account in order to get good interpretations of the results. Strong correlations of data do not necessarily prove a cause-and-effect... Data Visualization - Box Plot

A box plot is a good summary of a distribution and was invented by John Tukey. See Five-number summary The boxplot is a special case of the quantile function in that it only returns the 1st, 2nd and... Distribution - (Mean|Average) (M| | )

The average is a measure of center that statisticians call the mean. To calculate the mean, you add all numbers and divide the total by the number of numbers (N). The mean is not resistant. The... Distribution - Measures of (center|central tendency) (Mean, Median, Mode)

A Measure of central tendency is a measure that describes the middle or center point of a distribution. A good measure of central tendency is representative of the distribution. The mean, the median and... Statistics - Regression

Regression is a statistical analysis used: to predict scores on an numeric outcome variable, based on scores of: one predictor variable: simple regression or multiple predictor variables: multiple... Statistics - Resistant

A statistic that is not affected by outliers is called resistant. 