About
A factor is a qualitative explanatory variable.
Each factor has two or more levels, i.e., different values of the factor.
Combinations of factor levels are called treatments.
Example:
- character variable,
- or a string variable
Articles Related
Modelling a factor
We can't put categorical predictors into a regression analysis function. We need to make it a numeric variable in some way. That's where dummy coding comes in.
Two levels
Example with gender which has two levels (male or female) We create a new variable
<MATH> X = \left\{\begin{array}{ll} 1 & \text{ if the person is a male} \\ 0 & \text{ if the person is a female} \end{array}\right. </MATH>
Resulting model:
<MATH> Y_i = B_0 + B_1 X_i + \epsilon_i = \left\{\begin{array}{ll} B_0 + X_i + \epsilon_i & \text{ if the ith person is a male} \\ B_0 + \epsilon_i & \text{ if the ith person is a female} \end{array}\right. </MATH>
More than two
With more than two levels, we create additional dummy variables.
For example, for a colour variable with three levels (blue, red, green), we create two dummy variables.
<MATH> \begin{array}{lll} X_1 & = & \left\{\begin{array}{ll} 1 & \text{ if the colour is blue} \\ 0 & \text{ if the colour is not blue} \end{array}\right. \\ X_2 & = & \left\{\begin{array}{ll} 1 & \text{ if the colour is red} \\ 0 & \text{ if the colour is not red} \end{array}\right. \\ \end{array} </MATH>
Then both of these variables can be used in the regression equation, in order to obtain the following model:
<MATH> Y_i = B_0 + B_1 X_{i1} + B_2 X_{i2} + \epsilon_i = \left\{\begin{array}{ll} B_0 + B_1 + \epsilon_i & \text{ if the ith colour is blue} \\ B_0 + B_2 + \epsilon_i & \text{ if the ith colour is a red} \\ B_0 + \epsilon_i & \text{ if the ith colour is a green } \\ \end{array}\right. </MATH>
There will always be one fewer dummy variable than the number of levels. The level with no dummy variable Green in this example is known as the baseline.