Wednesday, 13 January 2016

Do BEDROC results correlate with AUC?

Following up on a post over at NM where I was perhaps overly harsh on AUC, it is natural to instead consider early recognition metrics such as EF, BEDROC and so forth as potentially better measures of performance in virtual screens. Here I want to pour cold water on the idea that results for AUC and BEDROC are highly correlated and so we might as well just use the more straightfoward of the two (i.e. AUC).

In a 2010 review of current trends by Geppert, Vogt and Bajorath [1], they interpret the "What do we know and when do we know it?" paper by Nicholls [2] as presenting "evidence for a strong correlation between AUC and BEDROC [3], suggesting AUC as a sufficient measure for virtual screening performance." Now, Nicholls doesn't quite say that but I can see that Figure 8 and 9 in that paper could be interpreted in that way. For the Warren study [4] data shown, the Pearson R2 between the two is 0.901. It is unfortunate that the correlation is only shown for the averaged data (Figure 9) - it would be nice to see the Spearman correlation for Figure 8. But either way, it's clear that they are highly correlated.

So a large AUC tends to imply a large BEDROC (and v.v), and so forth. But the thing is, that's not quite what we are interested in. The real question is whether the relative ordering of methods obtained with AUC are the same as those obtained with BEDROC - if it is, then we might as well use AUC. If not, we'll have to exercise those little grey cells (I've been reading a lot of Agatha Christie recently) and figure out which is best. In other words, it could be that methods A and B both have a high AUC, and so have a high BEDROC, but does A>B in AUC space imply A>B in BEDROC space?

As it happens I have the results to hand (as one does) for AUC (*) and BEDROC(20) for the Riniker and Landrum dataset [5] (88 proteins, 28 fingerprints, 50 repetitions). This is a fingerprint-based virtual screen as opposed to a docking study, but the conclusions should be the same. The 4400 Spearman correlations between the AUC and BEDROC results are shown here as a histogram (bins of 0.05):

The median value of the Spearman correlation is 0.713. This is not that high (remember that the square is 0.51) indicating that while there is a moderate correlate between the orderings obtained by BEDROC and AUC, they are not highly correlated.

One more nail in the AUC's coffin I hope. Grab your hammer and join me!

[1] Current Trends in Ligand-Based Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation. Hanna Geppert, Martin Vogt, and Jürgen Bajorath. J. Chem. Inf. Model. 2010, 50, 205–216.
[2] What do we know and when do we know it? Anthony Nicholls. J. Comput. Aided Mol. Des. 2008, 22, 239–255.
[3] Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem. Jean-François Truchon and Christopher I. Bayly. J. Chem. Inf. Model. 2007, 47, 488-508.
[4] A critical assessment of docking programs and scoring functions. Warren GL, Andrews CW, Capelli AM, Clarke B, LaLonde J, Lambert MH, Lindvall M, Nevins N, Semus SF, Senger S, Tedesco G, Wall ID, Woolven JM, Peishoff CE, Head MS. J. Med. Chem. 2006, 49, 5912-5931.
[5] Open-source platform to benchmark fingerprints for ligand-based virtual screening. Sereina Riniker and Gregory A. Landrum. J. Cheminf. 2013, 5, 26.

* I'm lazy so I just used the mean rank of the actives instead of integrating ROC curves. You get an identical ordering (see Eq. 12 in BEDROC paper [3] for the relationship). And yes, I did check.