Sunday, September 25, 2016

Effect of step size in forward stagewise regression

Forward stagewise regression is a linear model selection algorithm. It is a modification of least angle regression (LARS; Efron, Hastie, Johnstone & Tibshirani, 2002). It aims to select only a subset of X (predictor) variables, for efficient prediction of a Y (response) variable.

The forward stagewise regression  (FSR) algorithm roughly works as follows:
  1. Initilize by setting i = 0, coefficient vector b_i = 0, and the residual vector rY_i = Y - X * b_i.
  2. Increase (or decrease*) the coefficient of the predictor variable most highly correlated with rY_i by a small step, yielding the current coefficient vector b_{i+1}.
  3. Eat potato chips (optional)
  4. Calculate rY_{i+1} = Y - X * b_{i+1}.
  5. Set i = i + 1, repeat steps 2 - 4 until convergence (i.e., consecutive b_i no longer change).
In step 2, the coefficient is increased or decreased, depending on the sign of the correlation between the predictor variable and rY_i. The step size, that is, the size of the increase (or decrease) in step 2, can be provided by the user, or the optimal value can be determined by cross validation. Step 3 is optional, and should only be performed on a small subset of the iterations.

Although FSR can be performed with the R package lars, I  wanted to program it myself, to check out the effect of different step sizes, and to see how the results compare to ordinary least squares regression (OLS).  I created a function for performing FSR (called swReg), a function for plotting the coefficient paths (called plot.swReg), and applied it on the Boston housing dataset. Here are the code and the results:


R functions for performing FSR, plotting coefficient paths and computing cross-validated prediction error
swReg <- function(X, Y, stepsize = .1, threshold = .1) {
  sX <- scale(X)
  sY <- scale(Y)
  b <- vector(length = ncol(X))
  rY <- sY
  sgn <- 0
  lastsgn <- 0
  index <- 0
  lastindex <- 0
  iteration <- 0
  path <- list()
  while(abs(max(t(rY) %*% sX)) > threshold & (index != lastindex | sgn == lastsgn)) {
    iteration <- iteration + 1
    lastindex <- index
    lastsgn <- sgn
    index <- which(abs(t(rY) %*% sX) == max(abs(t(rY) %*% sX)))
    sgn <- sign((t(rY) %*% sX)[index])
    b[index] <- b[index] + sgn*stepsize
    rY <- sY - sX %*% b
    path[[iteration]] <- b
  }
  bUnstand <- (b/attr(sX, "scaled:scale"))*attr(sY, "scaled:scale") 
  intercept <- attr(sY, "scaled:center") - bUnstand %*% attr(sX, "scaled:center")
  bUnstand <- c(intercept, bUnstand)
  names(bUnstand) <- c("(Intercept)", colnames(X))
  
  return(list(unstandardized.coef = bUnstand, standardized.coef = b,
            iteration = iteration, stepsize = stepsize, threshold = threshold, 
            coef.path = data.frame(matrix(
              unlist(path), ncol = ncol(X), nrow = iteration, byrow = TRUE, 
              dimnames = list(1:iteration, colnames(X)))),
            data = list(X = X, Y = Y))
         )  
}

plot.swReg <- function(object, legend = TRUE) {
  plot(object$coef.path[,1], type = "l", 
  ylim = c(min(object$coef.path), max(object$coef.path)), 
  ylab = "coefficient", xlab = "iteration", 
  main = "Standardized coefficient paths", 
  sub = paste("stepsize =", object$stepsize))
  for (i in 2:ncol(X)) {lines(object$coef.path[,i], col = i)}
  if(legend) {
    legend("topleft", legend = colnames(object$data$X), cex = .5, 
    lty = 1, col = 1:ncol(object$coef.path), y.intersp = .25, 
    x.intersp = .25, bty = "n", seg.len = .5)
  }
}

predict.swReg <- function(object, newdata = object$data$X) {
  cbind(rep(1, times = nrow(newdata)), as.matrix(newdata)) %*% object$unstandardized.coef
}

swReg.xval <- function(object, k = 10, seed = 42) {
  set.seed(seed)
  ids <- peperr::resample.indices(n = nrow(object$data$X), sample.n = 10, method = "cv")
  models <- list()
  xvaldatasets <- list()
  for (i in 1:k){
    xvaldatasets[[i]] <- list(train = list(X = object$data$X[ids$sample.index[[i]],],
                                           Y = object$data$Y[ids$sample.index[[i]]]),
                              test = list(X = object$data$X[ids$not.in.sample[[i]],],
                                          Y = object$data$Y[ids$not.in.sample[[i]]]))
    models[[i]] <- swReg(xvaldatasets[[i]]$train$X, xvaldatasets[[i]]$train$Y, 
                         stepsize = object$stepsize, threshold = object$threshold)
    xvaldatasets[[i]]$test$xvalpredY <- predict.swReg(models[[i]], newdata = xvaldatasets[[i]]$test$X)
  }
  dataset <- cbind(object$data$X, fold = NA, Y = object$data$Y, Ycvpred = NA)
  coefficients <- list()
  for(i in 1:k){
    dataset[,"fold"][ids$not.in.sample[[i]]] <- i
    dataset[,"Ycvpred"][ids$not.in.sample[[i]]] <- xvaldatasets[[i]]$test$xvalpredY 
    coefficients[[i]] <- list(unstandardized.coef = models[[i]]$unstandardized.coef, 
                              standardized.coef = models[[i]]$standardized.coef)
  }
  return(list(dataset = dataset, cv.coefficients = coefficients))
} 

Example: Boston housing data
## Data preparation:
library(MASS)
X <- as.matrix(Boston[,-14])
Y <- Boston$medv

## Run forward stagewise regression:
tmp1 <- swReg(X, Y, st = .2, thr = .2)
tmp2 <- swReg(X, Y, st = .1, thr = .1)
tmp3 <- swReg(X, Y, st = .01, thr = .01)
tmp4 <- swReg(X, Y, st = .001, thr = .001)

## Plot coefficient paths:
par(mfrow=c(2,2))
plot.swReg(tmp1)
plot.swReg(tmp2)
plot.swReg(tmp3)
plot.swReg(tmp4)


Coefficient paths


What we see is that the paths actually look very similar, but less iterations are required before convergence with larger step sizes. Also, larger steps sizes provide sparser solutions (i.e., less non-zero coefficients in the final solution). Note that the algorithm is applied on standardized X and Y values, so the step size could be interpreted as the minimal correlation between a predictor and the response variable that the user finds 'relevant'. The FSR function above provides both standardized and unstandardized regression coefficients.


Comparison with OLS


What we see is: the larger the step size, the sparser the final solution. With step size equal to .2, only 5 predictor variables are selected. With step size equal to .001, all variables are selected. Also, the smaller the step size, the more the final solution resembles the OLS solution. The absolute distances between the OLS and FSR solutions decrease with step size:







References 

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407-499. link to pdf

No comments:

Post a Comment