# Data Mining - Algorithms

An Algorithm is a mathematical procedure for solving a specific kind of problem.

For some data mining functions, you can choose among several algorithms.

## List

Algorithm Function Type Description
Decision Tree (DT) Classification supervised Decision trees extract predictive information in the form of human-understandable rules. The rules are if-then-else expressions; they explain the decisions that lead to the prediction.
Generalized Linear Models (GLM) Classification and Regression supervised GLM implements logistic regression for classification of binary targets and linear regression for continuous targets. GLM classification supports confidence bounds for prediction probabilities. GLM regression supports confidence bounds for predictions.
Minimum Description Length (MDL) Attribute Importance supervised MDL is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of the data.
Naive Bayes (NB) Classification supervised Naive Bayes makes predictions using Bayes' Theorem, which derives the probability of a prediction from the underlying evidence, as observed in the data.
Support Vector Machine (SVM) Classification and Regression supervised Distinct versions of SVM use different kernel functions to handle different types of data sets. Linear and Gaussian (nonlinear) kernels are supported.
SVM classification attempts to separate the target classes with the widest possible margin.
SVM regression tries to find a continuous function such that the maximum number of data points lie within an epsilon-wide tube around it.
Apriori (AP) Association Unsupervised Apriori performs market basket analysis by discovering co-occurring items (frequent itemsets) within a set. Apriori finds rules with support greater than a specified minimum support and confidence greater than a specified minimum confidence.
k-Means (KM) Clustering Unsupervised k-Means is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters. Each cluster has a centroid (center of gravity). Cases (individuals within the population) that are in a cluster are close to the centroid.
Oracle Data Mining supports an enhanced version of k-Means. It goes beyond the classical implementation by defining a hierarchical parent-child relationship of clusters.
Non-Negative Matrix Factorization (NMF) Feature Extraction Unsupervised NMF generates new attributes using linear combinations of the original attributes. The coefficients of the linear combinations are non-negative. During model apply, an NMF model maps the original data into the new set of attributes (features) discovered by the model.
One Class Support Vector Machine (One- Class SVM) Anomaly Detection Unsupervised One-class SVM builds a profile of one class and when applied, flags cases that are somehow different from that profile. This allows for the detection of rare cases (such as outliers) that are not necessarily related to each other.
Orthogonal Partitioning Clustering (O-Cluster or OC) Clustering Unsupervised O-Cluster creates a hierarchical, grid-based clustering model. The algorithm creates clusters that define dense areas in the attribute space. A sensitivity parameter defines the baseline density level.
Maximum Entropy (MaxEnt) Classification Supervised

Machine learning techniques:

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

## Comparison

### Weka

In the experimenter, the result will show:

at a significance level (5% and 1% are common). The “null hypothesis” is that two classifiers perform the same.

``````Dataset                   (1) trees.J4 | (2) rules (3) rules (4) bayes (5) lazy. (6) funct (7) funct (8) meta.
--------------------------------------------------------------------------------------------------------------
iris                     (100)   94.73 |   92.53     33.33 *   95.53     95.40     97.07     96.27     95.40
breast-cancer            (100)   74.28 |   66.91 *   70.30     72.70     72.85     67.77 *   69.52 *   71.62
german_credit            (100)   71.25 |   65.91 *   70.00     75.16 v   71.88     75.24 v   75.09 v   71.27
pima_diabetes            (100)   74.49 |   71.52     65.11 *   75.75     70.62     77.47     76.80     74.92
Glass                    (100)   67.63 |   57.40 *   35.51 *   49.45 *   69.95     62.84     57.36 *   44.89 *
ionosphere               (100)   89.74 |   82.28 *   64.10 *   82.17 *   87.10     87.72     88.07     90.89
--------------------------------------------------------------------------------------------------------------
(v/ /*) |   (0/2/4)   (0/2/4)   (1/3/2)   (0/6/0)   (1/4/1)   (1/3/2)   (0/5/1)

Key:
(1) trees.J48 '-C 0.25 -M 2' -217733168393644444
(2) rules.OneR '-B 6' -3459427003147861443
(3) rules.ZeroR '' 48055541465867954
(4) bayes.NaiveBayes '' 5995231201785697655
(5) lazy.IBk '-K 1 -W 0 -A \"weka.core.neighboursearch.LinearNNSearch -A \\\"weka.core.EuclideanDistance -R first-last\\\"\"' -3080186098777067172
(6) functions.Logistic '-R 1.0E-8 -M -1' 3932117032546553727
(7) functions.SMO '-C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K \"functions.supportVector.PolyKernel -C 250007 -E 1.0\"' -6585883636378691736
(8) meta.AdaBoostM1 '-P 100 -S 1 -I 10 -W trees.DecisionStump' -7378107808933117974
```
```

## Documentation / Reference

Discover More (Statistics|Machine Learning|Data Mining) - (Unit|Individual|Case|Subject|Observation|Instance|Input)

in Statistics. Each member of a sample is also known as: a unit an individual a case a subject an instance an observation input data Data contains values grouped into variables and observations.... Data Mining - (Function|Model)

The model is the function, equation, algorithm that predicts an outcome value from one of several predictors. During the training process, the models are build. A model uses a logic and one of several... Data Mining - Apriori algorithm

Apriori is an Unsupervised Association algorithm performs market basket analysis by discovering co-occurring items (frequent itemsets) within a set. Apriori finds rules with support greater than a specified... Data Mining - Data (Preparation | Wrangling | Munging)

Data for mining must exist within a single table or view. The information for each case (record) must be stored in a separate row. Proper preparation of the data is a key factor in any data mining project.... Data Mining - Decision Tree (DT) Algorithm

Desicion Tree (DT) are supervised Classification algorithms. They are: easy to interpret (due to the tree structure) a boolean function (If each decision is binary ie false or true) Decision trees... Data Mining - Result Considerations

Before tackling a data mining problem, some considerations must be take into account in order to get good interpretations of the results. Strong correlations of data do not necessarily prove a cause-and-effect... Data Mining - Support Vector Machines (SVM) algorithm

A support vector machine is a Classification method. supervised algorithm used for: Classification and Regression (binary and multi-class problem) anomalie detection (one class problem) Supports:... Data Mining - k-Means Clustering algorithm

k-Means is an Unsupervised distance-based clustering algorithm that partitions the data into a predetermined number of clusters. Each cluster has a centroid (center of gravity). Cases (individuals... Machine Learning - Deep Learning (Network)

Deep Learning (Networks) is an algorithms which is basically neural networks with many layers. Deep learning is also known as: deep machine learning, deep structured learning, hierarchical learning,... Machine Learning - Unsupervised Learning ( Mining )

Unsupervised learning is the second type of function that an algorithm can perform. The algorithm is said to be unsupervised when no response is used in the algorithm. Unsupervised Learning has the goal... 