Wednesday, 13 January 2016

Do BEDROC results correlate with AUC?

Following up on a post over at NM where I was perhaps overly harsh on AUC, it is natural to instead consider early recognition metrics such as EF, BEDROC and so forth as potentially better measures of performance in virtual screens. Here I want to pour cold water on the idea that results for AUC and BEDROC are highly correlated and so we might as well just use the more straightfoward of the two (i.e. AUC).

In a 2010 review of current trends by Geppert, Vogt and Bajorath [1], they interpret the "What do we know and when do we know it?" paper by Nicholls [2] as presenting "evidence for a strong correlation between AUC and BEDROC [3], suggesting AUC as a sufficient measure for virtual screening performance." Now, Nicholls doesn't quite say that but I can see that Figure 8 and 9 in that paper could be interpreted in that way. For the Warren study [4] data shown, the Pearson R2 between the two is 0.901. It is unfortunate that the correlation is only shown for the averaged data (Figure 9) - it would be nice to see the Spearman correlation for Figure 8. But either way, it's clear that they are highly correlated.

So a large AUC tends to imply a large BEDROC (and v.v), and so forth. But the thing is, that's not quite what we are interested in. The real question is whether the relative ordering of methods obtained with AUC are the same as those obtained with BEDROC - if it is, then we might as well use AUC. If not, we'll have to exercise those little grey cells (I've been reading a lot of Agatha Christie recently) and figure out which is best. In other words, it could be that methods A and B both have a high AUC, and so have a high BEDROC, but does A>B in AUC space imply A>B in BEDROC space?

As it happens I have the results to hand (as one does) for AUC (*) and BEDROC(20) for the Riniker and Landrum dataset [5] (88 proteins, 28 fingerprints, 50 repetitions). This is a fingerprint-based virtual screen as opposed to a docking study, but the conclusions should be the same. The 4400 Spearman correlations between the AUC and BEDROC results are shown here as a histogram (bins of 0.05):

The median value of the Spearman correlation is 0.713. This is not that high (remember that the square is 0.51) indicating that while there is a moderate correlate between the orderings obtained by BEDROC and AUC, they are not highly correlated.

One more nail in the AUC's coffin I hope. Grab your hammer and join me!

[1] Current Trends in Ligand-Based Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation. Hanna Geppert, Martin Vogt, and Jürgen Bajorath. J. Chem. Inf. Model. 2010, 50, 205–216.
[2] What do we know and when do we know it? Anthony Nicholls. J. Comput. Aided Mol. Des. 2008, 22, 239–255.
[3] Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem. Jean-François Truchon and Christopher I. Bayly. J. Chem. Inf. Model. 2007, 47, 488-508.
[4] A critical assessment of docking programs and scoring functions. Warren GL, Andrews CW, Capelli AM, Clarke B, LaLonde J, Lambert MH, Lindvall M, Nevins N, Semus SF, Senger S, Tedesco G, Wall ID, Woolven JM, Peishoff CE, Head MS. J. Med. Chem. 2006, 49, 5912-5931.
[5] Open-source platform to benchmark fingerprints for ligand-based virtual screening. Sereina Riniker and Gregory A. Landrum. J. Cheminf. 2013, 5, 26.

* I'm lazy so I just used the mean rank of the actives instead of integrating ROC curves. You get an identical ordering (see Eq. 12 in BEDROC paper [3] for the relationship). And yes, I did check.


greg landrum said...

We also looked at this in the benchmarking paper. Figure 2 ( in that shows the correlations between the various evaluation metrics across all of the data and table S1 in the supplementary material ( has the r2 values and RMSEs.

greg landrum said...

Also: I get that you're evaluating the ordering instead of just the numeric values, but I wonder why, when looking at metrics like this, that's really important.

Noel O'Boyle said...

The better the correlations in Figure 2 (where AUC and BEDROC have R^2 of 0.895), I'm sure the better (and narrower) the correlations in my histogram. But just like with the diagram in the Nicholls paper, having a high correlation between the scores for AUC and BEDROC is a necessary, but not sufficient, condition for them to order the fingerprints in the same way.

Unknown said...

If the exercise is to pick 100 compounds for screening out of a set of 100,000 then BEDROC, together with a large exponential weighting factor, is going to make you a lot happier. As you pick more and more compounds the front-weighting of the list becomes less and less important, until absurdly you screen the entire set, at which point the ordering is of no interest. So when would you ever use plain vanilla ROC? And if you're not interested in front weighting, you probably wouldn't use either one. The correlation is self-evident, but irrelevant.

Noel O'Boyle said...

So am I flogging a dead horse? These papers are from 2008/2010 and it could be that everyone has moved on.

Unknown said...

Perhaps the issue itself is old hat, but this is a great example that illustrates that statistics does not always capture usefulness.