Tuesday, 17 September 2019

Least Publishable Unit #3 - Which similarity metric is best?

Following on from #1 and #2, here's some work I began (maybe twice) but never completed.

Back in 2016 I published a paper entitled "Which fingerprint is best?" - actually, no, someone talked me down into the title "Comparing structural fingerprints using a literature-based similarity benchmark", a title which accurately describes what we did, but not why we did it. Anyway, it's work I'm very proud of - for example, it shows there is a detectable performance difference between 16384 and 4096 bit ECFP fingerprints, and that the fingerprints most appropriate for finding very close analogs are different than for those more distant.

The obvious follow-up to this work would be compare similarity metrics, for some particular fingerprint, e.g. the 16K ECFP4 fingerprint. There's a great paper by Todeschini et al that compares 51 similarity metrics and has a nice table where the equations are all given in one place. So I took these as the starting point for my own work.

After coding them all up and running them against the benchmark, I found that a number of the metrics gave identical rank orders across the entire dataset. That part wasn't surprising - the paper itself notes that several metrics are correlated, and the Riniker and Landrum paper has an appendix with a worked proof that shows that Tanimoto and Dice are correlated. What was surprising is that the ones I was finding as correlated were not necessarily the same as those in the paper (there was some overlap, but...).

Regarding the main reason for doing the paper, Tanimoto surprisingly (or not) did turn out to be one of the best (if not the best, I can't remember exactly). Perhaps more interesting were the results I got from looking at the effect of changing the weighting of the Tversky similarity; I can't remember the details but the best results were not where I expected them to be, and I never got to the bottom of why.


Thomas Evangelidis said...

So what do you suggest? Taking the average of CT4, HL, and JT (namings according to Todeschini et al) as a better, unbiased similarity metric?

Noel O'Boyle said...

If I had finished the work I would probably have some suggestions.