About
Statistics - Bootstrap Resampling in R.
Bootstrap lets you get a look at the sampling distribution of statistics, for which it's really hard to develop theoretical versions. Bootstrap gives us a really easy way of doing statistics when the theory is very hard.
For example, if we calculate a statistics <math>\alpha</math> through a non-linear formula, the theory in this case is very hard and it's therefore difficult to answer the following questions:
- what is the sampling variability of <math>\alpha</math> ?
- what is the standard error of <math>\alpha</math> ?
- How variable is it going to be?
This is a case where the bootstrap really helps out in order to answer this questions.
Bootstrap is very handy way of getting very good reliable estimates of standard error for nasty statistics.
The Standard Bootstrap
The non-linear function
alpha=function(x,y){
vx=var(x)
vy=var(y)
cxy=cov(x,y)
(vy-cxy)/(vx+vy-2*cxy)
}
The <math>\alpha</math> function will return the last line that was evaluated.
Bootstrap Wrapper Function
alpha.fn=function(data, index){
with(data[index,],alpha(X,Y))
}
where:
- Index represents a bootstrap sample index with number between 1 and N. As the bootstrap does a re-sample of our training observations and some observations can be represented more than once and some not at all, the index may have the same observation index more than once, once or not at all.
- the function, with takes first argument of data frame and then some commands. Using the data in the data frame, execute the commands. And the main value of with is that you can use the named variables x and y that are in the data frame.
Bootstrap
Since a bootstrap involves random sampling and that we want reproducible results we set the random number seed.
require(boot)
set.seed(1)
# 1,000 bootstraps.
boot.out=boot(myDataFrame,alpha.fn,R=1000)
boot.out
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = myDataFrame, statistic = alpha.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 0.5758321 -7.315422e-05 0.08861826
This is really a fast function.
Plot
plot(boot.out)
- The first plot is a histogram of the bootstrap sampling distribution.
- The second plot is a qqplot, which plots the ordered values against the ordered statistics of a Gaussian. If it lines up on a straight line, you may say that the first plot looks like a pretty nice symmetric gaussian distribution
The block bootstrap
Correlation
A block bootstrap must be applied when there is a strong correlation between consecutive rows of the data matrix.
To see this correlation, you can plot the matrix.
matplot(dataFrame,type="l")
Effect of correlation on the standard error
When there is a strong correlation and that the data points are repeating, the sample size is in reality much smaller than the number of rows.
Which means that the standard error is underestimate because standard errors is divided by the sample size.
The wrapper function
getBetaX1=function(myData){
fit=lm(y~X1+X2,data=myData)
fit$coefficients["X1"]
}
From a data frame, it fits a linear model and return the regression coefficient of the variable X1.
The time series Bootstrapping function
Block bootstrap with a block size of 100
require(boot)
set.seed(1)
tsboot(myDataFrame,getBetaX1,R=50,sim="fixed",l=100)
BLOCK BOOTSTRAP FOR TIME SERIES
Fixed Block Length of 100
Call:
tsboot(tseries = Xy, statistic = getBetaX1, R = 50, l = 100,
sim = "fixed")
Bootstrap Statistics :
original bias std. error
t1* 0.1453263 0.06305497 0.1955884