Notes on statistics, R and coding: Where do AUCs come from?

I became familiar with ROC analysis and AUC values for its use in deriving optimal cut-off values for a questionnaire, for making 'at risk' and 'not at risk' classifications. In such a case, the input is a (more or less) single continuous variable (testscore on the questionnaire), and the output a dichotomous variable (a diagnosis). Then, the AUC provides an overall measure for how well the questionnaires test score can discriminate between persons with, and persons without a diagnosis. I understood this for the bivariate case: cutoffs for the continuous variable can be varied, to find the optimal cut-off value for prediction of the dichotomous criterion (the cut-off value with optimal sensitivity and specificity).

I got confused about AUCs for the multiple-predictors-case (i.e., when there is a model with many input variables and one output variable). Because then there are several (continuous) variables, instead of one.

ROC example

The true positive rate (or sensitivity) is regressed on (or plotted against) the false positive rate (or 1-specificity). We want the former to be as close to 1 as possible, and the latter to be as close to 0 as possible. Therefore, we want the ROC curve to be as far to the upper left corner of the graph, as possible.

To derive the sensitivity and specificity, the values of a single continuous variables are varied. In the bivariate case, this is just the originial continuous variable. In the multivariate case, this is the prediction of a multivariate model. For example, in logistic regression, the model derived (predicted) probabilities of belonging to the positive class. The R code for an example showing the calculation of the AUC for the bivariate and multivariate case is provided below.

Interpretation of AUC

The AUC has straightfoward interpretation: "When using normalized units, the area under the curve is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative')." (source: Wikipedia)

Problems with AUC

However, there are some problems with the AUC curve. Hand (2009) noted that the the AUC uses different misclassification cost distributions for different classifiers. This is problematic, because the relative severity of misclassifications should not depend on the classifier, but on the classification problem. Furthermore, Smits (2010) has noted that using Youden's index (sum of sensitivity and specificity minus one) to select the cut-off value for classification is problematic, as it makes the relative cost of a false negatives and false positives dependent on the prevalence.

R code example

# generate predictor variables: self-esteem (questionnaire), number of adverse life events and presence of a depressed parent:
set.seed(25)
self.esteem <- round(rnorm(250, 20, 5))
tmp <- round(rnorm(1000, 0, 5))
adverse.events <- sample(tmp[tmp>=0], 250)
depressed.parent <- round(runif(250, 0, 1))
# generate error:
error <- round(rnorm(250,0,3))
# continuous variable 'depression' is a function of above variables
depression <- -5 + 1.8*self.esteem + 1.5*adverse.events + 2.3*depressed.parent + 1.0*error
# dichotomize continuous depression variable (i.e., depression diagnosis)
diagnosis <- depression>=37
library(pROC)

# bivariate ROC analysis: relationship between self-esteem and depression diagnosis
plot(roc(diagnosis~self.esteem))
coords(roc(diagnosis~self.esteem), "best")
# in this example, self-esteem can be used for predicting depression diagnosis, as the AUC > .5. A cut-off value of 18.5 provides best sensitivity and specificity

# illustration of bivariate ROC analysis: dichotomized variable has AUC of 1.0 if predicted with originial continuous variable:
plot(roc(diagnosis~depression))
coords(roc(diagnosis~depression), "best")
# when a continuous variable can discriminate perfectly between two non-overlapping classes, the AUC is 1.0. As expected, the cut-off value providing best sensitivity and specificity is the one originally used for dichotomizing: 37.

# multivariate relationship between self-esteem, adverse life events and presence of a depressed parent
my.glm <- glm(diagnosis ~ self.esteem + adverse.events + depressed.parent, family="binomial")
predicted.probs <- predict(my.glm, type="response")
plot(roc(diagnosis~predicted.probs))
coords(roc(diagnosis~predicted.probs), "best")
# using self-esteem, no. of adverse events and presence of a depressed parent as predictors, improves predictive accuracy, compared to the bivariate model in which only self-esteem was used for prediction. The cut-off value for the model-derived probability of being in the positive class, providing the best sensitivity and specificity, is 0.4004744.

References

Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2), 171-186.

Smits, N. (2010). A note on Youden's J and its cost ratio. BMC medical research methodology, 10(1), 89.

Notes on statistics, R and coding

Thursday, February 27, 2014

Where do AUCs come from?

No comments:

Post a Comment