About
BIC is like AIC and Mallow's Cp, but it comes from a Bayesian argument. The formulas are very similar.
Formula
<MATH> BIC = \frac{1}{n}(RSS + log(n)d \hat{\sigma}^2) </MATH>
The formula calculate the residual sum of squares and then add an adjustment term which is the log of the number of observations times d, which is the number of parameters in the model (intercept and regression coefficient)
As in AIC and Cp, sigma-hat squared is an estimate of the error variance which may or may not be available depending on whether n is greater than p or less than p.
With BIC, we're estimating the average test set RSS across the observations. We want it to be as small as possible. In feature selection, we're going to choose the model with the smallest BIC.
AIC and BIC
The only difference between AIC and BIC is the choice of log n versus 2. In general, if n is greater than 7, then log n is greater than 2. Then if you have more than seven observations in your data, BIC is going to put more of a penalty on a large model. In other words, BIC is going to tend to choose smaller models than AIC is.
BIC is going to select models that have fewer variables than either Cp or AIC.