# Statistics - (Discretizing|binning) (bin)

Discretization is the process of transforming numeric variables into nominal variables called bin.

The created variables are nominal but are ordered (which is a concept that you will not find in true nominal variable) and algorithms can exploit this ordering information.

## Characteristics / Problem Definition

• Split-point is a number and there are infinitely many numbers.
• finding a set of cut points to partition the range into a small number of intervals
• minimizing the number of intervals without significant loss of information

## Why?

• Discretization can be useful when creating probability mass functions – formally, in density estimation.
• Continuous explanatory attributes are not so easy to handle because they can take so many different values that is difficult to make a direct estimation of frequencies.
• A large number of Machine Learning and statistical techniques can only be applied to data sets composed entirely of nominal variables.

## Type

### Unsupervised

Unsupervised Discretization does not take into account the class.

There is two basic methods where the is discretized into K bin of:

• equal width (lengths, range value)
• equal frequencies (% of the total data, same number of observations per bin)

Equal-frequency binning is sensitive to the data distribution, which will probably make it perform better.

Unsupervised Discretization is usually performed prior to the learning process and it can be broken into tasks that must find.

• the number of discrete intervals.
• the boundaries of the intervals
• which method to use

They are experimental questions and there is no universally best method. It depends on the data.

#### Equal-width

The range of the numeric attribute is chopped into a certain number of equal parts, or bins. Wherever a numeric value falls into a bin, we take the bin name as the discretized version of the numeric value. The number of data points between bins may vary.

#### Equal-frequency

The algorithm will try to make the number of data points in each bin equal. It will adjust the size to make the number of instances that fall into each bin approximately the same.

Bin with the same frequency are quantile.

#### Using the ordering relationship

There's an ordering relationship in a continuous variable. However, when we discretize it into different bins, we are losing this information.

Which can be a problem in a tree: Before discretization, if we add a test such as “is x<v?”, after discretization, to get the equivalent test, we would need to ask “is y=a?”, “is y=b?”, “is y=c?” and replicate the tree underneath each of these nodes. That's clearly inefficient and is likely to lead to bad results.

Instead of discretizing into five different values a to e, we can discretize into four different binary attributes, k-1 binary attributes. The first attribute here says whether the value v is in this range, and the second attribute, z2, says whether it's in this range, a or b. The third, z3, says whether it's in this range, a, b, or c. The fourth says whether it's in the first four ranges.

If in our tree we have a test “is x<v?”, then if x is less than v, then z1, z2, and z3 are true and z4 is false. So an equivalent test on the binary attributes is “is z3=true?”.

The binary attributes include information about the ordering of the original numeric attribute values.

### Supervised (Highest gain)

Supervised discretization is about taking the class into account when making discretization decisions.

The most common way is to use an entropy heuristic. The heuristic will choose the split point with the smallest entropy, which corresponds to the largest information gain, and continue recursively until some stopping criterion is met.

You shouldn't use any information about the class values in the test set to help with the learning method, otherwise the model has already seen and captured the test set information.

In weka, see the classifier “FilteredClassifier” from “meta”. It's a “class for running an arbitrary classifier on data that has been passed through data modifications (in weka a filter). By using it, the test sets used within the cross-validation do not participate in choosing the discretization boundaries. The discretization operation is apply to the training set alone.

Discretization boundaries are determined in a more specific context but are based on a small subset of the overall information particularly lower down the tree, near the leaves.

For every internal node, the instances that reach it must be sorted separately for every numeric attribute

• and sorting has complexity O(n log n)
• but repeated sorting can be avoided with a better data structure

## Performance

• The classifiers SMO and SimpleLogistic implement linear decision boundaries in instance space. Discretization (with makeBinary enabled) would probably improve performance, because it allows a more complex model, no longer linear in the decision space.
• IBk can implement arbitrary boundaries in instance space. pre-discretization will not change the Ibk performance significantly (Performance = high accuracy)

## Bias

Binning data in bins of different size may introduce a bias.

The same data tells a different story depending on the level of detail you choose. Here's the same data about population growth in Europe (orange = growth, blue = decline) in five different units.

## Documentation / Reference

Discover More
Data Mining - Data (Preparation | Wrangling | Munging)

Data Preparation is a data step that prepares your data for further analyis. It's a key factor in any data project: mining, ai analytics Preparing has several steps that are explained below. ...
Data Visualisation - Histogram (Frequency distribution)

A histogram is a type of graph generally used to visualize a distribution An histogram is also known as a frequency distribution. Histograms can reveal information not captured by summary statistics...
Data Visualization - Bar Chart

Bar graphs plots: a categorical variable associated with quantities (number of case, sum of) histograms continuous variable associated with its bins They show: quantities as bar lengths...
Dimensional Data Modeling - Descriptif Attribute (Dimensional Attribute)

A descriptif attribute is class attribute that describe a property or characteristic of a dimension. They are used to label, filter and/or group on. measures Typical attributes for a product dimension...
Distribution - Quantile Analysis

A quantile is a statistic that identifies the data that is less than the given value (ie that fall at or below a score in a distribution). A quantile function will always rank the data before giving any...
GGplot - Stat - (Statistical transformation|Statistic)

The Statistical transformation (stat). Multiple layers, statistical transformation. It's often useful to transform your data before plotting, and that's what statistical transformations do. Every...
Ggplot - Bin (stat_bin)

See geom_histogram
How to use the Aggregate / Window Functions (sum, avg, ) in Sqlite ?

The aggregate / window function in Sqlite. Sqlite supports the following aggregate /window function : SUM (total) AVG Max Min Rank row_number more see the Specification...
Oracle Database - Histogram Analytic

How to build an histogram in Oracle (and binning) You create equi-width (all bin have the same distance) histogram with the WIDTH_BUCKET function. where: column is a date or number column ...
Quantile - Ntile (N Quantile)

Ntile (ie Nquantile) The Ntile function bins a data set in bucket of equal frequency. Each bucket has the same amount of observations. 4tile is a quartile 100tile is a percentile