In the fingerprint similarity paper I published last year, I made the following observation: "However the use of the mean rank is itself problematic as the pairwise similarity of two methods can be altered (and even inverted) by adding additional methods to an evaluation." I thought about providing an example illustrating this, but I didn't want to digress too much. And besides, that's perfect blog fodder...
Let's say we want to compare N methods, and we have their scores for 10 diverse datasets. Because they are diverse, we think to ourselves that it wouldn't be an accurate evaluation if we used the scores directly, e.g. taking the mean scores or some such, because some of the datasets are easy (all of the scores are high but spread out) and some are hard (all of the scores are low, but close together). As I reference in the paper, Bob Sheridan has discussed this problem, and among other solutions suggested using the mean rank.
So here's the thing. Consider two methods, A and B, where A outperforms B on 9 out of the 10 datasets, i.e. A has rank 1 nine times and rank 2 once, while vice versa for B. So the mean rank for A is 1.1 while that for B is 1.9, implying that A is better than B. So far so good.
Now let's suppose I start tweaking the parameters of B to generate 9 additional methods (B1 to B9). Unfortunately, while they have similar performance to B, they are all slightly worse on each dataset. So the rank order for 9 of the 10 datasets in now A > B > B1...B9, and that for the 10th dataset is B > B1...B9 > A. So A has rank 1 nine times (as before) and rank 11 once (giving a mean rank of 2.0), while B has the same rank as before. So now B is better than A.
Wait a second. Neither A nor B has changed, so how can our evaluation of their relative performance have changed? And therein lies the problem.
So what's the solution? Even a cat knows the answer: as I stated at the start, on 9 out of 10 datasets A outperformed B. So A is better than B, or A>B. It's as simple as that. We don't need to calculate the mean anything. Our goal is not to find a summary value for the overall performance for a method, and then to compare those. It is to evaluate relative performance, and for diverse datasets each dataset has an equal vote towards that answer. So on dataset 1 (taking a different imaginary dataset), each method dukes it out giving A>B, A>C, A>D, B>C, B>D, C=D. Then round 2, on dataset 2. Toting up the values over all datasets will yield something like A>B 10 times, while B>A 5 times, and A=B twice, so A>B overall (and similar results for other pairwise comparisons).
The nice thing about all of these pairs is that you can then plot a Hasse diagram to summarise the info. There are some niggly details like handling incomparable values (e.g. A>B, B>C, C>A) but when it's all sorted out, the resulting diagram is quite information-rich.
I didn't mention statistical significance, but the assessment of whether you can say A>B overall requires a measurement of significance. In my case, I could generate multiple datasets from the same population, and so I used a a basic T-test over the final results from each of the datasets (and corrected for multiple testing).
I've gone on and gone on a bit about what's really quite a simple idea. In fact, it's so simple that I was sure it must be how people already carry out performance evaluations across diverse datasets. However, I couldn't find any evidence of this, nor could I find any existing description of this method. Any thoughts?
Saturday, 14 January 2017
So let's fix the world.
What is written down (i.e. on the Daylight website, the OpenSMILES spec) is the following:
1. For atoms in brackets, the number of hydrogens is either listed or zero. e.g. [Na] has 0 zeros, [NaH5] has 5.
2. For atoms outside brackets (which means they must be in the so-called 'organic subset'), if they are not written lowercase (indicating aromaticity), the number of hydrogens is described by the SMILES valence model (see the OpenSMILES specification for the details), e.g. the carbons in CC have 3 implicit hydrogens.
What is not written down (apart from references to pyrrole-/pyridine-type hydrogens) is what to do about (unbracketed) aromatic atoms when reading. These should be handled as follows (as far as I can tell - see also John's investigations below):
1. For elements other than carbon, there are zero implicit hydrogens. (The corrolary is that if there were a hydrogen present, it would be bracketed.)
2. For carbon, apply the valence model of rounding up to 3 explicit neighbours. In practice, this means either 1 hydrogen present if 2 neighbours, and 0 otherwise (for 3 neighbours).
Probably because of the lack of clarity around these latter rules, not every writer or reader follows them, and Wikipedia in particular is rife with generated SMILES from
This raises the question as to what to do when presented with, for example, the SMILES c1cncc1? A correct SMILES for pyrrole would be c1c[nH]cc1. As written, the first SMILES is not kekulisable according to the Daylight aromaticity model (a neutral 'n' without a hydrogen must have a double bond, or alternatively, radicals contibute only a single electron) but one could infer that the structure intended was pyrrole. This is a slippery slope, though, once you consider aromatic rings with multiple nitrogens where it may not be possible to unambiguously assign hydrogens. Also, I would argue that it is not the job of a reader to change the structure because "it knows best".
For related reading, see John Mayfield's posts on SMILES implicit valence of aromatic atoms and New SMILES behaviour - parsing (CDK 1.5.4). Also worth noting is that at no point in identifying hydrogen locations was determination of aromatic systems or kekulisation required. Of course, if you go down the route of editing erroneous structures that may be a different story.
Image credit: Licensed CC-BY-NC by Sean Davis (image on Flickr)