Statistics - Smoothing (cubic) spline

Data System Architecture


A smoothing spline is a way of fitting splines without having to worry about knots.

They are a little bit more challenging mathematically as others splines and approaches the problem from a completely different point of view.

It's called a smoothing spline because the solution a weird spline that got a knot at every single unique value of the x's. It sounds ridiculous but that's what it is.


Criterion for fitting a smooth function g(x) to some data:

<MATH> \begin{array}{rrl} \underset{\displaystyle g \in S}{minimize} & \overbrace{ \underbrace{\sum_{i=1}^{n}(y_i - g(x_i))^2}_{\displaystyle \href{rss}{RSS}} + \underbrace{\lambda \int {g}''(t)^2 dt )}_{\displaystyle \text{Penalty Term}} }^{\displaystyle \text{Criterion}} \\ \end{array} </MATH>

It's a cubic spline because the second derivative is present in the function. If it was a k-th derivative, then it would be a 2k plus first spline

We're searching over all g functions of x. g is unknown.


The first term is the residual sum of squares (on the left) and tries to make <math>g(x)</math> match the data at each <math>x_i</math>

If we didn't have the second term, we'd just pick any crazy interpolating function that visited every y exactly.



The second term adds up all non-linearities in the function.

It will constrains the functions g to be smooth. This complex penalty is the second derivative of the function, squared (to gets rid of the sign), integrated over the whole domain. (Integrate means add them up.)

The second derivatives are going to pick up wiggles of the function, the departures, for example, from linearity.


lambda is called the roughness penalty and controls how wiggly the function can be. It modulates how much of the penalty you're imposing. It's then a tuning parameter. lambda is bigger than or equal to 0.

As lambda goes from 0 to infinity, we get a range of functions solutions that are going to go from a linear function to a very wiggly function that goes through the data.


The smaller lambda, the more wiggly the function can be. If lambda is 0, we'll get that interpolating function because we won't pay any price for being wiggly.


On the other hand, as lambda gets really large, if it gets off to infinity, we pay a big price and the function will eventually be forced to be linear because a function that's linear has a second derivative 0.

So the first term would be 0 only if lambda gets to infinity.

Fitted Matrix

The fitted values can be written as a matrix times the observed vector of responses.

<MATH> \hat{g}_\lambda = S_\lambda y </MATH> where:

  • <math>\hat{g}_\lambda</math> (g hat indexed by lambda) is a vector containing the fitted values at each of the observed x's.
  • <math>y</math> is the observed vector of responses.
  • <math>S_\lambda</math> S is an n by m matrix and is determined by the positions of the x's and the particular value of lambda used.

It's a special form called a linear smoother because it's a linear operator times y.

All polynomial regression (splines, …) have a linear operator as the linear regression. All polynomial regression, result in such expressions and it's mathematical properties are attractive for computing (such as see below the degree of freedom

Degree of freedom

The effective degrees of freedom are given by:

<MATH> df_\lambda = \sum_{i=1}^{n}{S_\lambda}_{ii} </MATH>

The degrees of freedom are much less because of the roughness penalty than the natural spline.

The effective degrees of freedom formula doesn't guarantee an integer.

Smoothing Spline Df

The degrees of freedom for the first model (16) was set manually whereas the second was find by cross-validation.

Cross Validation

You can estimate the smoothing parameter (lambda) by cross-validation.

leave-one-out (LOO)

Leave-one-out cross-validation is quite attractive because it has a really nice expression.

The leave-one-out (LOO) cross-validated error is given by

<MATH> RSS_{cv}(\lambda) = \sum^n_{i=1} (y_i - \hat{g}_\lambda^{-i}(x_i))^2 = \sum^n_{i=1} \left [\ \frac{y_i - \hat{g}_\lambda(x_i)}{1-\{S_\lambda\}_{ii}} \right ]^2 </MATH>

This is the sum of cross-validated residuals.

  • To compute the first whole expression, we would have to leave out each point and refit the function each time.
  • But It turns out you can just fit a smoothing spline to the original data and divide each of the residuals by 1 minus the corresponding diagonal element of that smoother matrix (the second expression). It's sort of like a magic formula.

You calculate it for a range of values of lambda and pick the value that minimizes it.

Discover More
Card Puncher Data Processing
R - Smoothing spline

This function will fit a smoothing spline. We can specify df rather than . It's like a polynomial of degree 10. It has that complexity, but it's smoother. So the degrees of freedom are not all spent...
Data System Architecture
Statistic - Smooth (Function Continuity) (Soft ?)

To smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise. In mathematical analysis, smoothness has to do with how many...
Thomas Bayes
Statistics - Knots (Cut points)

Knots are cutpoints that defines different regions (or partitions) for a variable. In each regions, a fitting must occurs. The definition of different regions is a way to stay local in the fitting process....
Data System Architecture
Statistics - Splines

A spline is a piecewise polynomial model: Splines have the “maximum” amount of continuity over all knots function.

Share this page:
Follow us:
Task Runner