# Data Mining - (Parameters | Model) (Accuracy | Precision | Fit | Performance) Metrics

Accuracy is a evaluation metrics on how a model perform.

Normal Accuracy metrics are not appropriate for evaluating methods for rare event detection

## Problem type

### Regression

#### Parameters

• Hypothesis testing: t-statistic and p-value. The p value and t statistic measure how strong is the evidence that there is a non-zero association. Even a weak effect can be extremely significant given enough data.

#### Model

How about the overall fit of the model, the accuracy of the model?

$R$ is the correlation between predicted and observed scores whereas $R^2$ is the percentage of variance in Y explained by the regression model.

#### Error

List of several error calculations:

• The squared error. The squared error is the sum of the squared difference between the actual value and the predicted value.

$$\text{Squared error}= \sum_{i=1}^{n} \left (x^i - \sum_{j=0}^{k}{w_j}.{a_j^i} \right)^2$$

• Mean Absolute Error:

$$\text{Mean Absolute Error}= \frac{|p_1-a_1|+\dots+|p_n-a_n|}{n}$$

• Relative absolute error:

$$\text{Relative absolute error}= \frac{|p_1-a_1|+\dots+|p_n-a_n|}{|a_1-\bar{a}|+\dots+|a_n-\bar{a}|}$$

• Root relative squared error:

$$\text{Root relative squared error}= \sqrt{\frac{(p_1-a_1)^2+\dots+(p_n-a_n)^2}{(a_1-\bar{a})^2+\dots+(a_n-\bar{a})^2}}$$

### Classification

The accuracy metrics are calculated with the help of a Machine Learning - Confusion Matrix

#### Accuracy

$$\begin{array}{rrc} Accuracy & = & \frac{\text{Number of correct predictions}}{\text{Total of all cases to be predicted}} \\ & = & \frac{a + d}{a + b + c + d} \end{array}$$

Accuracy is not really a reliable metric for the real performance of a classifier when the number of samples in different classes vary greatly (unbalanced target) because it will yield misleading results.

For example, if there were 95 cats and only 5 dogs in the data set, the classifier could easily be biased into classifying all the samples as cats. The overall accuracy would be 95%, but in practice the classifier would have a 100% recognition rate for the cat class but a 0% recognition rate for the dog class

The (error|misclassification) rates are good complementary metrics to overcome this problem.

#### Null Rate

The accuracy of the baseline classifier.

The baseline accuracy must be always checked before choosing a sophisticated classifier. (Simplicity first)

Accuracy isn’t enough. 90% accuracy need to be interpreted against a baseline accuracy.

A baseline accuracy is the accuracy of a simple classifier.

If the baseline accuracy is better than all algorithms accuracy, the attributes are not really informative.

## Glossary

### True

The true accuracy is the accuracy calculated on the entire data set (no data set split)

## Documentation / Reference

Recommended Pages (Prediction|Recommender System) - Collaborative filtering

Collaborative filtering is a method of making automatic predictions (filtering) the interests of a user by collecting preferences or taste information from many users (collaborating). But in general,... Data Mining - (Anomaly|outlier) Detection

The goal of anomaly detection is to identify unusual or suspicious cases based on deviation from the norm within data that is seemingly homogeneous. Anomaly detection is an important tool: in data... Data Mining - (Classifier|Classification Function)

A classifier is a Supervised function (machine learning tool) where the learned (target) attribute is categorical (“nominal”) in order to classify. It is used after the learning process to classify... Data Mining - (Function|Model)

The model is the function, equation, algorithm that predicts an outcome value from one of several predictors. During the training process, the models are build. A model uses a logic and one of several... Data Mining - (Life cycle|Project|Data Pipeline)

Data mining is an experimental science. Data mining reveals correlation, not causation. With good data, you will make good algorithm. The most preferable solution is then to work on good features.... Data Mining - (Prediction|Guess)

Something predictable is showing a pattern and is therefore not truly random. entropytrue randomness Many forms of data mining model are predictive. For example, a model might predict income based on... Data Mining - Naive Bayes (NB)

Naive Bayes (NB) is a simple supervised function and is special form of discriminant analysis. It's a generative model and therefore returns probabilities. It's the opposite classification strategy... Data Mining - Pruning (a decision tree, decision rules)

Pruning is a general technique to guard against overfitting and it can be applied to structures other than trees like decision rules. A decision tree is pruned to get (perhaps) a tree that generalize... Data Mining - Root mean squared (Error|Deviation) (RMSE|RMSD)

Root mean squared (Error|Deviation) in case of regression. The RMSD represents the sample standard deviation of the differences between predicted values and observed values. The RMSE serves to aggregate... Model Building - ReSampling Validation

Resampling method are a class of methods that estimate the test error by holding out a subset of the training set from the fitting process, and then applying the statistical learning method to those held... 