The forward stagewise regression (FSR) algorithm roughly works as follows:
- Initilize by setting i = 0, coefficient vector b_i = 0, and the residual vector rY_i = Y - X * b_i.
- Increase (or decrease*) the coefficient of the predictor variable most highly correlated with rY_i by a small step, yielding the current coefficient vector b_{i+1}.
- Eat potato chips (optional)
- Calculate rY_{i+1} = Y - X * b_{i+1}.
- Set i = i + 1, repeat steps 2 - 4 until convergence (i.e., consecutive b_i no longer change).
Although FSR can be performed with the R package lars, I wanted to program it myself, to check out the effect of different step sizes, and to see how the results compare to ordinary least squares regression (OLS). I created a function for performing FSR (called swReg), a function for plotting the coefficient paths (called plot.swReg), and applied it on the Boston housing dataset. Here are the code and the results:
R functions for performing FSR, plotting coefficient paths and computing cross-validated prediction error
swReg <- function(X, Y, stepsize = .1, threshold = .1) { sX <- scale(X) sY <- scale(Y) b <- vector(length = ncol(X)) rY <- sY sgn <- 0 lastsgn <- 0 index <- 0 lastindex <- 0 iteration <- 0 path <- list() while(abs(max(t(rY) %*% sX)) > threshold & (index != lastindex | sgn == lastsgn)) { iteration <- iteration + 1 lastindex <- index lastsgn <- sgn index <- which(abs(t(rY) %*% sX) == max(abs(t(rY) %*% sX))) sgn <- sign((t(rY) %*% sX)[index]) b[index] <- b[index] + sgn*stepsize rY <- sY - sX %*% b path[[iteration]] <- b } bUnstand <- (b/attr(sX, "scaled:scale"))*attr(sY, "scaled:scale") intercept <- attr(sY, "scaled:center") - bUnstand %*% attr(sX, "scaled:center") bUnstand <- c(intercept, bUnstand) names(bUnstand) <- c("(Intercept)", colnames(X)) return(list(unstandardized.coef = bUnstand, standardized.coef = b, iteration = iteration, stepsize = stepsize, threshold = threshold, coef.path = data.frame(matrix( unlist(path), ncol = ncol(X), nrow = iteration, byrow = TRUE, dimnames = list(1:iteration, colnames(X)))), data = list(X = X, Y = Y)) ) } plot.swReg <- function(object, legend = TRUE) { plot(object$coef.path[,1], type = "l", ylim = c(min(object$coef.path), max(object$coef.path)), ylab = "coefficient", xlab = "iteration", main = "Standardized coefficient paths", sub = paste("stepsize =", object$stepsize)) for (i in 2:ncol(X)) {lines(object$coef.path[,i], col = i)} if(legend) { legend("topleft", legend = colnames(object$data$X), cex = .5, lty = 1, col = 1:ncol(object$coef.path), y.intersp = .25, x.intersp = .25, bty = "n", seg.len = .5) } } predict.swReg <- function(object, newdata = object$data$X) { cbind(rep(1, times = nrow(newdata)), as.matrix(newdata)) %*% object$unstandardized.coef } swReg.xval <- function(object, k = 10, seed = 42) { set.seed(seed) ids <- peperr::resample.indices(n = nrow(object$data$X), sample.n = 10, method = "cv") models <- list() xvaldatasets <- list() for (i in 1:k){ xvaldatasets[[i]] <- list(train = list(X = object$data$X[ids$sample.index[[i]],], Y = object$data$Y[ids$sample.index[[i]]]), test = list(X = object$data$X[ids$not.in.sample[[i]],], Y = object$data$Y[ids$not.in.sample[[i]]])) models[[i]] <- swReg(xvaldatasets[[i]]$train$X, xvaldatasets[[i]]$train$Y, stepsize = object$stepsize, threshold = object$threshold) xvaldatasets[[i]]$test$xvalpredY <- predict.swReg(models[[i]], newdata = xvaldatasets[[i]]$test$X) } dataset <- cbind(object$data$X, fold = NA, Y = object$data$Y, Ycvpred = NA) coefficients <- list() for(i in 1:k){ dataset[,"fold"][ids$not.in.sample[[i]]] <- i dataset[,"Ycvpred"][ids$not.in.sample[[i]]] <- xvaldatasets[[i]]$test$xvalpredY coefficients[[i]] <- list(unstandardized.coef = models[[i]]$unstandardized.coef, standardized.coef = models[[i]]$standardized.coef) } return(list(dataset = dataset, cv.coefficients = coefficients)) }
Example: Boston housing data
## Data preparation: library(MASS) X <- as.matrix(Boston[,-14]) Y <- Boston$medv ## Run forward stagewise regression: tmp1 <- swReg(X, Y, st = .2, thr = .2) tmp2 <- swReg(X, Y, st = .1, thr = .1) tmp3 <- swReg(X, Y, st = .01, thr = .01) tmp4 <- swReg(X, Y, st = .001, thr = .001) ## Plot coefficient paths: par(mfrow=c(2,2)) plot.swReg(tmp1) plot.swReg(tmp2) plot.swReg(tmp3) plot.swReg(tmp4)
Coefficient paths
What we see is that the paths actually look very similar, but less iterations are required before convergence with larger step sizes. Also, larger steps sizes provide sparser solutions (i.e., less non-zero coefficients in the final solution). Note that the algorithm is applied on standardized X and Y values, so the step size could be interpreted as the minimal correlation between a predictor and the response variable that the user finds 'relevant'. The FSR function above provides both standardized and unstandardized regression coefficients.
Comparison with OLS
What we see is: the larger the step size, the sparser the final solution. With step size equal to .2, only 5 predictor variables are selected. With step size equal to .001, all variables are selected. Also, the smaller the step size, the more the final solution resembles the OLS solution. The absolute distances between the OLS and FSR solutions decrease with step size:
References
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407-499. link to pdf
No comments:
Post a Comment