# Data Mining - Step Function (piecewise constants)

Step functions, are another way of fitting non-linearities. (especially popular in epidemiology and biostatistics)

## Procedure

Continuous variable are cut into discrete sub-ranges and fit a constant model in each of the regions. It's a piecewise constant (model|function).

## Characteristics

### Natural cut point

The piecewise constant functions are especially useful if they are some natural cut points that you want to use or/and are of interest.

### Summary

what is the average income for somebody below the age of 35? You read it straight off the plot. This is often good food for summaries (in newspapers, reports, …)

### Interaction

Useful way of creating interactions that are easy to interpret.

Example of interaction variable (between Year and Age) in a linear model:

X_1 = I(Year < 2005) . Age
X_2 = I(Year > 2005) . Age


where I is the R indicator function.

With this two variables, we will fit a different linear model as a function of age for the people who worked before 2005 and those after 2005. We will get two different linear functions in each age category. It's an easy way of seeing the effect of an interaction.

### Local

One advantage over polynomial is that the calculation is local.

With polynomials, we have a single function for the whole range of the x variable. If I change a point on the left side, it could potentially change the fit on the right side for polynomials. But for step functions, a point only affects the fit in its partition and not the others.

### Knots

Choice of cutpoints or knots can be problematic. For creating non-linearities, smoother alternatives such as splines are available.

## Implementation

The function of the sub-ranges must be seen as binary variable.

Is x less than 35?

• If yes, you make it 1.
• If not, you make it a 0.

You creates then a series of dummy variables (zero-one variables) representing each group and you just fit those with the linear model.

## Models

In the below plot, there is a cut:

• at 35
• at 50 (not really visible in the regression plot)
• and at 65.

$$C_1(X) = I(X < 35); C_2(X) = I(35 \leq X < 50), \dots, C_3(X) = I(X \geq 65)$$