Friday, 9 October 2020

Comparing methods two-by-two

It is common to compare different methods using results from N distinct datasets. My earlier blogpost described why the mean rank is not a good measure of performance in these cases. Essentially, the relative performance of two methods (e.g. A and B) can be altered based on the performance of other methods (e.g. C, D and E).

But it's not just the mean rank that's the problem. It's the use of any performance measure where the assessment of the pairwise performance (e.g. between methods A and B) can be altered by the performance of other methods.

At the recent (virtual) AI in Chemistry Meeting organised by the RSC, one of the speakers showed an assessment of different methods based on how frequently that method came first relative to the other methods. Is this a reasonable way to assess performance? Let's look at an example...

Consider two methods A and B assessed using this metric on 10 datasets, where A comes first 9 times and B comes first once. Clearly A is better than B, and this is reflected by this metric.

Now let's add a method C to this comparison. It turns out that C does better than A on every dataset but still fails to beat B on the 10th. This means that A never comes first, but B still comes first once. In other words, by adding method C to the comparison, the relative performance of A and B has been inverted according to this metric. Which can't be right - A is still better than B - other methods have nothing to say about this.

So what's the solution? Well, one possibility is to read my previous blog post starting from "So what's the solution?"

Having done so, let's apply that solution. The key point is that it only makes sense to compare the methods pairwise. So let's do so by giving each dataset a vote on which method is best. This is a paired comparison (greater ability to resolve differences). 10 say C>A, 8 (net, see note 1 below) say C>B, and 8 again say A>B. These results are depicted above (see note 2 below). We can summarise this (but lose some information in the general case) with some transitive reduction by removing the C--B edge.

Will this approach catch on? It's tricky because this is one of those areas where the obvious solution seems quite reasonable, whereas the problem is quite subtle, nor have I ever seen it discussed in the field (or any field). Despite this, I will continue to pipe my thoughts directly to /dev/noel here.


1. If you're wondering why 9 x C>B and 1 x B>C leads to a net difference of 8, this is to handle the case of C=B. If it were 9 x C > B and 1 x B = C, the net difference would be 9.

2. This was generated from the following graphviz file using "dot -Tpng -o myfile.png":

digraph D {
C -> A [label="10"]
C -> B [label="8"]
A -> B [label="8"]


James said...

Would using the median of the rank solve some of these problems?

Noel O'Boyle said...

No version of the average will help as they allow methods C, D, etc. to influence the question of whether A > B.