Notes on statistics, R and coding: Bifactor always fits best

Recently, I reviewed a paper where the authors wanted to test whether a second-order factor or bifactor model fitted their data best. They had about 580 observations for 6 items, measuring two factors. As this amounts to three item indicators per factor, it is like two just-identified latent variable models in one. But with two factors, the model is clearly overidentied. However, I wondered whether their tests would be able to find the true model. In general, I expect bifactor models to often fit better, as they are less restricted versions of second-order factor models. I wondered whether bifactor models even fit better when the true model is the second-order factor model. I performed a small simulation to see.

I generated 100 datasets with 6 items, measuring two second-order factors, and one first-order factor:

library(lavaan)
set.seed(642)
rdata <- list()
for (i in 1:100) {
  # generate first-order factor:
  FO_factor <- rnorm(580)
  # generate second-order factors:
  SO_factor1 <- FO_factor + rnorm(580, sd = 1)
  SO_factor2 <- FO_factor + rnorm(580, sd = 1)
  # generate the observed data:
  rdata[[i]] <- data.frame(X1 = SO_factor1 + rnorm(580, sd = .75), X2 = SO_factor1 + rnorm(580, sd = .75), 
                           X3 = SO_factor1 + rnorm(580, sd = .75), X4 = SO_factor2 + rnorm(580, sd = .75),
                           X5 = SO_factor2 + rnorm(580, sd = .75), X6 = SO_factor2 + rnorm(580, sd = .75))
}

Then I created a unidimensional model, a model with two correlated factors, a second-order factor model, and two bifactor models to fit on the datasets:

library(semPlot)
F1.mod <- '
  F =~ X1 + X2 + X3 + X4 + X5 + X6
'
semPaths(F1.mod)

F2.mod <- '
  F1 =~ X1 + X2 + X3
  F2 =~ X4 + X5 + X6
'
semPaths(F2.mod)

SO.mod <- '
  F1 =~ X1 + X2 + X3 
  F2 =~ X4 + X5 + X6
  F =~ a*F1 + a*F2
'
semPaths(SO.mod)

BF.mod <- '
  G =~ X1 + X2 + X3 + X4 + X5 + X6
  F1 =~ X1 + X2 + X3
  F2 =~ X4 + X5 + X6
'
semPaths(BF.mod)

BF2.mod <- '
  G =~ X1 + X2 + X3 + X4 + X5 + X6
  F1 =~ X1 + X2 + X3
'
semPaths(BF2.mod)

Then I fitted the models to each of the datasets and calculated p-values (of the chi-square statistic), CFIs and RMSEAs:

F1.fit <- list()
F2.fit <- list()
SO.fit <- list()
BF.fit <- list()
BF2.fit <- list()
cfis <- list()
rmseas <- list()
pvals <- list()
results <- data.frame(dataset = rep(1:100, each = 5), pval = NA, cfi = NA, 
                      rmsea = NA, mod = rep(1:5, times = 100))
fit.ind <- c("pvalue", "cfi", "rmsea")
for (i in 1:100) {
  F1.fit[[i]] <- cfa(F1.mod, data = rdata[[i]], std.lv = TRUE)
  F2.fit[[i]] <- cfa(F2.mod, data = rdata[[i]], std.lv = TRUE)
  SO.fit[[i]] <- cfa(SO.mod, data = rdata[[i]], std.lv = TRUE)
  BF.fit[[i]] <- cfa(BF.mod, data = rdata[[i]], orthogonal = TRUE, std.lv = TRUE)
  BF2.fit[[i]] <- cfa(BF2.mod, data = rdata[[i]], orthogonal = TRUE, std.lv = TRUE)
  results[(i-1)*5+1, 2:4] <- fitmeasures(F1.fit[[i]], fit.ind)
  results[(i-1)*5+2, 2:4] <- fitmeasures(F2.fit[[i]], fit.ind)
  results[(i-1)*5+3, 2:4] <- fitmeasures(SO.fit[[i]], fit.ind)
  results[(i-1)*5+4, 2:4] <- fitmeasures(BF.fit[[i]], fit.ind)
  results[(i-1)*5+5, 2:4] <- fitmeasures(BF2.fit[[i]], fit.ind)
}
results$mod = rep(c("one-factor", "two-factor", "second-order", "bifactor", "bifactor2"), times = 100)
head(results)
aggregate(results[2:4], by = list(results$mod), FUN = mean, na.rm = TRUE)

Not surprisingly, there are a lot of warnings for the first bifactor model, because it may not be identified. We should take the fit indices for that model with a grain of salt.

On average, p-values, RMSEA and CFI indicate bad fit for the one-factor model, as they should. The second-order and two correlated-factor models fit the data well, on average, according to the p-value, CFI and RMSEA. Note that these models are equivalent models, so should fit the data identical.

The most striking findings is that the second bifactor model always seems to fit the data better than the second-order factor model, which was the model that generated the data!

How about the variances of the fit indices, do they differ much between models?

aggregate(results[2:4], by = list(results$mod), FUN = sd, na.rm = TRUE)

They don't seem to differ much. If anything, the variance of the p-values seems comparable between bifactor and second-order factor models. The variance of the CFI is lower for bifactor models, and the variance of RMSEA is lower for the second-order factor model.

Conclusion

There is not much sense in comparing the fit between second-order and bifactor models if we want to know which model generated the data. A bifactor model will often fit the data better, even when the data were generated using a second-order factor model. I am curious whether this is also the case with more indicators per factor.

Notes on statistics, R and coding

Saturday, March 18, 2017

Bifactor always fits best

No comments:

Post a Comment