Saturday, 28 December 2024

The effect of ties on evaluation of virtual screens

For a virtual screening method where the associated score is prone to ties (e.g. a fingerprint comparison), care must be taken to handle these appropriately or the results will be biased towards better performance of either inactives or actives.

Let's work through a specific example, starting with some setup code assigning scores to a set of actives and separately to a set of inactives:

import random
random.seed(42)

INACTIVE, ACTIVE = range(2)

def mean_rank_of_actives(data):
    ranks = [i for (i, x) in enumerate(data) if x[1]==ACTIVE]
    return sum(ranks)/len(ranks)

if __name__ == "__main__":
    # Generate scores for actives and inactives
    actives = [1.0]*10 + [0.5]*10
    random.shuffle(actives)
    inactives = [0.5]*10 + [0.0]*70
    random.shuffle(inactives)

I've baked in a large number of ties. Half of the actives have a score of 0.5, a value that is shared by an equal number of inactives. We will use mean rank of the actives as a proxy here for ROC AUC - if I remember correctly, one is proportional to the other.

Our first attempt at evaluating performance is as follows:

def first(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    everything.sort(reverse=True) # rank the entries - highest score first
    return mean_rank_of_actives(everything)

It turns out that all of the actives are sorted ahead of the inactives, giving an overoptimistic mean rank of 9.5. Why? Because the ties are resolved based on the second item in the tuples, and ACTIVE (i.e. 1) is greater than INACTIVE (i.e. 0). In fact, swapping the values of these flags changes the results to the overpessimistic value of 14.5. Using a piece of text, e.g. "active" or "inactive", has the same problem.

No worries - when we sort, let's make sure we sort using the score values only:

def second(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    everything.sort(reverse=True, key=lambda x:x[0])
    return mean_rank_of_actives(everything)

Unfortunately, the result is still 9.5. Python's sort is a stable sort, which means that items retain their original order if there are ties. Since all of the actives come first in the original list, the tied actives still come before the tied inactives after sorting. To get rid of the influence of the original order, we need to shuffle the list into a random order:

def third(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    random.shuffle(everything)
    everything.sort(reverse=True, key=lambda x:x[0])
    return mean_rank_of_actives(everything)

This gives 12.0 +/- 0.67 (from 1000 repetitions), a more accurate assessment of the performance.

I think that this is an interesting little problem because it's rather subtle; even when you think you've solved it (as in the second solution above), some other issue rears its head. The only way to be sure is to test with tied data.

Note: If we are interested specifically in handling ties correctly in ranks, a common approach is to assign the same rank to all members of the tie. That's not really the point I want to make here. In practice, the evaluation metric might something quite different such as enrichment at 1%, which would not be amenable to such a solution.

No comments: