Tuesday, April 21, 2015

Subsampling for cross validation in R

If you want to obtain cross validated estimates of performance in R, you have to split your dataset into a number of more or less equally sized parts. For leave-on-out cross validation (LOO CV), this is easy (see example below). However, LOO CV generally provides overly optimistic estimates of performance, and a number of folds < no. of observations (e.g., 5 or 10) is preferred. This may be require some custom code, if the number of observations is not a multiple of 5 or 10. Luckily, the resample.indices() function of the R-package peperr provides a generic solution to the problem. Below, examples are provided for a linear regression model, but the point is that lm() can be replaced by any model building function.


R code example for k-fold cross validation:
library(peperr)
# simulate original dataset with 10 variables and 500 observations:
dataset <- data.frame(matrix(rnorm(5000), ncol=10, nrow=500))
# set number of folds:
k <- 10
# get observation ids for in folds:
ids <- resample.indices(n=nrow(dataset), sample.n=k, method="cv")
# create a list for gathering results:
xvaldatasets <- list()
# for each fold k:
for (i in 1:k){
  # make datasets of training and test observations:
  xvaldatasets[[i]] <- list(train=dataset[ids$sample.index[[i]],],
                            test=dataset[ids$not.in.sample[[i]],])
  # fit model on training observations, and use it to make predictions for test observations:
  xvaldatasets[[i]]$test$xvalpred <- predict(lm(X10 ~ ., data=xvaldatasets[[i]]$train),
                                          newdata=xvaldatasets[[i]]$test)
}

# evaluate predictive accuracy in each fold k:
xvalcors <- list()
for (i in 1:k){
  # for example, by calculating the correlation between the predicted and observed values of X10:
  xvalcors[[i]] <- cor(xvaldatasets[[i]]$test$X10, xvaldatasets[[i]]$test$xvalpred)
}
# print correlations between true and predicted values for each fold:
unlist(xvalcors)
# calculate mean:
mean(unlist(xvalcors))
# calculate SD:
sd(unlist(xvalcors))




R code example for leave-one-out cross validation:
# simulate original dataset with 10 variables and 500 observations:
dataset <- data.frame(matrix(rnorm(5000), ncol=10, nrow=500))
# create objects for gathering results:
xvaldatasets <- list()
xvalpreds <- rep(NA,  times=ncol(dataset))
# for each fold k:
for (i in 1:nrow(dataset)){
# make datasets of training and test observations:
xvaldatasets[[i]] <- list(train=dataset[-i,],                           
                          test=dataset[i,])
# fit model on training observations, and use it to make predictions for test observations:
xvalpreds[i] <- predict(lm(X10 ~ ., data=xvaldatasets[[i]]$train),
                        newdata=xvaldatasets[[i]]$test)
}
# evaluate predictive accuracy:
cor(xvalpreds, dataset$X10)

No comments:

Post a Comment