# Data Mining - (Feature|Attribute) Extraction Function

Feature extraction is the second class of methods for dimension reduction.

It's also sometimes known as dimension reduction but it's not.

It creates new attributes (features) using linear combinations of the (original|existing) attributes.

This function is useful for reducing the dimensionality of high-dimensional data. (ie you get less columns).

Applicable for:

• latent semantic analysis,
• data decomposition and projection,
• and pattern recognition.

We project the p predictors into a M-dimensional subspace, where M < p. This is achieved by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.

Dimension reduction is a way of finding combinations of variables, extracting important combinations of variables, and then using those combinations as the features in regression.

Approaches:

## Procedure

The idea is that by choosing wisely the linear_combinations (in particular, choosing the $\phi_{mj}$ ), we will be able to beat ordinary least squares (OLS) with the raw predictors.

### Linear Combinations

Let $Z_1, Z_2, \dots, Z_m$ represent m linear combinations of the original p predictors (m<p).

$$Z_m = \sum_{j=1}^p \phi_{mj} x_j$$

where:

• $\phi_{mj}$ is a constant
• $x_j$ are the original predictors

(m<p) because if m equals p, dimension reduction will just give least squares on the raw data.

### Model Fitting

Then the following linear model can be fitted using ordinary least squares:

$$y_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im} + \epsilon_i$$

where:

• $i = 1, \dots, n$

This model can be thought of as a special case of the original linear regression model because (a little bit of algebra):

$$\sum_{m=1}^M \theta_m z_{im} = \sum_{m=1}^M \theta_m ( \sum_{j=1}^p \phi_{mj} x_{ij} ) = \sum_{j=1}^p (\sum_{m=1}^M \theta_m \phi_{mj} ) x_{ij} = \sum_{j=1}^p \beta_j x_{ij}$$

The latest term is just a linear combination of the original predictors ($x$ ) where the linear combination involves $\beta_j$

$$\beta_j = \sum_{m=1}^M \theta_m \phi_{mj}$$

Dimension reduction fits a linear model through the definition of new $z$ 's that's linear in the original $x$ 's where the $\beta_j$ 's need to take a very, very specific form.

In a way, it's similar to ridge and LASSO, it's still least squares, it's still a linear model in all the variables, but there's a constraint (penalty term) on the coefficients. We're not getting a constraint on the RSS amount but on the coefficient form $\beta_j$ coefficients.

Dimension reduction goal wants to win the bias-variance trade-off by getting a simplified model with low bias and also low variance relative to a plain vanilla least squares on the original features.

## Feature Extraction vs Ridge regression

Ridge regression is really different from dimension reduction methods (principal components regression and partial least squares) but it turns out that mathematically these ideas are all very closely related.

Principle components regression, for example, is just a discrete version of ridge regression.

Ridge regression is continuously shrinking variables, whereas principal components is doing it in a more choppy sort of way.

## Example

Example of feature extractors:

• unigrams,
• bigrams,
• unigrams and bigrams,
• and unigrams with part of speech tags.
• Given demographic data about a set of customers, group the attributes into general characteristics of the customers

## Documentation / Reference

Recommended Pages Data Mining - (Attribute|Feature) (Selection|Importance)

Feature selection is the second class of dimension reduction methods. They are used to reduce the number of predictors used by a model by selecting the best d predictors among the original p predictors.... Data Mining - (Dimension|Feature) (Reduction)

In machine learning and statistics, dimensionality reduction is the process of reducing the number of random variables (features) under consideration and can be divided into: feature selection (returns... Data Mining - (Function|Model)

The model is the function, equation, algorithm that predicts an outcome value from one of several predictors. During the training process, the models are build. A model uses a logic and one of several... Data Mining - Algorithms

An is a mathematical procedure for solving a specific kind of problem. For some data mining functions, you can choose among several algorithms. Algorithm Function Type Description Decision... Data Mining - Grouping (Classification)

Classification in data mining Data Mining - Non-Negative Matrix Factorization (NMF) Algorithm

A Unsupervised Feature Extraction algorithm. Non-Negative Matrix Factorization (NMF): generates new attributes using linear combinations of the original attributes. Creates new attributes that... Data Mining - Partial least squares (PLS)

Partial least squares (PLS) is is a dimension reduction method and uses the same method than principle components regression but it selects the new predictors (principal component) in a supervised way.... Data Mining - Principal Component (Analysis|Regression) (PCA|PCR)

Principal Component Analysis (PCA) is a feature extraction method that use orthogonal linear projections to capture the underlying variance of the data. By far, the most famous dimension reduction approach... Data Mining - Scoring (Applying)

The process of applying a model to new data is known as scoring. Apply data, also called scoring data, is the actual population to which a model is applied. Scoring operation for: classification,... Machine Learning - Unsupervised Learning ( Mining )

Unsupervised learning is the second type of function that an algorithm can perform. The algorithm is said to be unsupervised when no response is used in the algorithm. Unsupervised Learning has the goal... 