# Statistics - (Factor Variable|Qualitative Predictor)

## About

A factor is a qualitative explanatory variable.

Each factor has two or more levels, i.e., different values of the factor.

Combinations of factor levels are called treatments.

Example:

• character variable,
• or a string variable

## Modelling a factor

We can't put categorical predictors into a regression analysis function. We need to make it a numeric variable in some way. That's where dummy coding comes in.

### Two levels

Example with gender which has two levels (male or female) We create a new variable

$$X = \left\{\begin{array}{ll} 1 & \text{ if the person is a male} \\ 0 & \text{ if the person is a female} \end{array}\right.$$

Resulting model:

$$Y_i = B_0 + B_1 X_i + \epsilon_i = \left\{\begin{array}{ll} B_0 + X_i + \epsilon_i & \text{ if the ith person is a male} \\ B_0 + \epsilon_i & \text{ if the ith person is a female} \end{array}\right.$$

### More than two

With more than two levels, we create additional dummy variables.

For example, for a colour variable with three levels (blue, red, green), we create two dummy variables.

$$\begin{array}{lll} X_1 & = & \left\{\begin{array}{ll} 1 & \text{ if the colour is blue} \\ 0 & \text{ if the colour is not blue} \end{array}\right. \\ X_2 & = & \left\{\begin{array}{ll} 1 & \text{ if the colour is red} \\ 0 & \text{ if the colour is not red} \end{array}\right. \\ \end{array}$$

Then both of these variables can be used in the regression equation, in order to obtain the following model:

$$Y_i = B_0 + B_1 X_{i1} + B_2 X_{i2} + \epsilon_i = \left\{\begin{array}{ll} B_0 + B_1 + \epsilon_i & \text{ if the ith colour is blue} \\ B_0 + B_2 + \epsilon_i & \text{ if the ith colour is a red} \\ B_0 + \epsilon_i & \text{ if the ith colour is a green } \\ \end{array}\right.$$

There will always be one fewer dummy variable than the number of levels. The level with no dummy variable Green in this example is known as the baseline.

Powered by ComboStrap