Notes on statistics, R and coding: Missing data

In longitudinal studies in psychology, there may often be substantial attrition, resulting in non-ignorable missingness in datasets. There's a whole area of research devoted to what to do with missing data, which I am not summarizing here. I am summarizing some helpful findings for future reference.

First, a very practical piece of advice taken from Graham (2009): "... if the loss of cases due to missing data is small (e.g., less than about 5%), biases and loss of power are both likely to be inconsequential." (p. 554).

If >5% of values are missing: remember that MCAR is better than MAR, which is better than MNAR.

MCAR: missing completely at random. The missingness is completely random. This is great, but very unlikely. "If data are MCAR, then consistent results with missing data can be obtained by performing the analyses we would have used had their been no missing data, although there will generally be some loss of information. In practice this means that, under MCAR, the analysis of only those units with complete data gives valid inferences." (source)

MAR: missing at random. Missingness depends only on observed data. We can use methods like EM (expextation maximization) and FIML (ful-information maximum likelihood) to estimate the model.

MNAR: missing not at random. Missingness depends on onobserved data. There is no perfect way to deal with this. Now, we have a problem! Check Graham (2009) for a starter on how to deal with this.

We can never be sure if data is MAR of MCAR. Maybe we haven't observed enough data to find that data was not missing at random. With all the data in the world, we may be able to turn MNAR into MCAR. But if we had all the data in the world, we wouldn't have missing data to begin with.

Collins, Schafer and Kam (2001) compared restrictive versus inclusive strategies for taking into account missing data. Restrictive is when no or minimal additional variables are used in missing data procedures, and inclusive is when additional variables are used in missing data procedures. Inclusive is always better. Of course; when the amount of missing data is minimal, or data is in fact MAR or MCAR, it never hurts to include additional variables, because it they won't really influence imputed values or parameter estimates. This is in line with advice of, for example, Graham (2009), who states 'Use auxiliary variables' as the first strategy to reduce the biasing effects of attrition. Inclusive strategies can be performedboth with maximum likelihood estimation (ML; where missingness is taken into account in parameter estimation) and with multiple imputation (MI; where several datasets are created where missing values are imputed). However, Collins et al. (2001) note that ML approaches as implemented in most software tend to encourage inclusive restrictive strategies, whereas MI approaches tend to encourage inclusive strategies.

Practical pieces of advice from Collins et al. (2001) for real-world data analysis: "Generally, we found that when missingness does not exceed 25% and the correlation between the cause of missingness and the variable subject to missingness was .4, omitting the cause of missingness from the analysis had a negligible effect. When missingness exceeds 25% or the correlation between the cause of missingness and the variable subject to missingness is as large as .9, our results indicate that there are often substantial problems with bias, efficiency, and coverage." (p.347).

References

Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330.

Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of psychology, 60, 549-576.

Notes on statistics, R and coding

Wednesday, May 23, 2012

Missing data

No comments:

Post a Comment