Notes on statistics, R and coding: Interactions in GLM(M)s: To center or not to center

Researchers interested in interaction effects of two variables X1 and X2 on outcome variable Y, are often advised to center the predictor variables, prior to creating the interaction term. The major advantage of this approach would be that the correlation between the predictor variables and the interaction term disappears, so multicollinearity is no longer a problem.

set.seed(42)
X1 <- rnorm(250, 50, 5)
X2 <- rnorm(250, 70, 5)
X1X2 <- X1 * X2
X1X2c <- (X1 - 50) * (X2 - 70)
dataset <- data.frame(X1, X2, X1X2, X1X2c)
cor(dataset)
apply(dataset, 2, mean)
apply(dataset, 2, sd)

In the example above we can see that without centering we get a substantial positive correlation between predictor variables and the interaction. With centering, the correlation is much smaller, and turns negative. However, by centering, the form of the interaction effect is changed, and the interpretation of the interaction effect should be changed accordingly.

To see how this could lead to problems, let’s look at a few examples. First, let’s assume outcome variable Y is a function of the non-centered interaction and an error term:

Y <- X1X2 + rnorm(250, 0, 200)
Yc <- X1X2c + rnorm(250, 0, 10)

An interesting pattern emerges: if the true interaction effect of X1 and X2 is based on their uncentered values, we will not recover it if we center our predictor variables before creating the interaction term, and vice versa:

# highly non-linear associations:
plot(X1X2, Yc)
cor(X1X2, Yc)
plot(X1X2c, Y)
cor(X1X2c, Y)

While there is an almost perfect correlation if we use the ‘true’ interaction term:

plot(X1X2c, Yc)
cor(X1X2c, Yc)
plot(X1X2, Y)
cor(X1X2, Y)

Looking at only the bivariate correlations between the interaction and outcome variable may be too simple, though. What happens if we also include main effects of X1 and X2? Note that the large differences in residual variances are due to the differences in the variances of the error terms included in Y and Yc, but the estimated coefficient values are as expected:

# only when we specify the true model, do we find the correct predictors:
summary(lm(Yc ~ X1 + X2 + X1X2))
summary(lm(Y ~ X1 + X2 + X1X2c))

So, if we use the ‘wrong’ interaction type, the main effects of X1 and X2 become significant, too. Only when we use the ‘true’ interaction type, we recover the true model (i.e., no main effects of X1 and X2, only a ‘pure’ interaction):

summary(lm(Yc ~ X1 + X2 + X1X2c))
summary(lm(Y ~ X1 + X2 + X1X2))

If we include both interactions, one coefficient is not calculated, because of singularities:

summary(lm(Y ~ X1 + X2 + X1X2 + X1X2c))
summary(lm(Yc ~ X1 + X2 + X1X2 + X1X2c))

Of course, in reality we can not be sure if the ‘true’ interactions are centered or not. Therefore, whether you use centered or non-centered interactions, you should always:

1) Include the main effects of the predictor variables involved in interactions.
2) Refrain from interpreting main effects in the presence of an interaction

This is actually pretty common knowledge, and something I knew before, but now I actually understand why.

Notes on statistics, R and coding

Thursday, November 3, 2016

Interactions in GLM(M)s: To center or not to center

No comments:

Post a Comment