Saturday 24 October 2020

The SMILES reading benchmark - two years on

In August 2017, after attending an InChI meeting at the NIH in Bethesda, I had the idea of putting together a SMILES reading benchmark. I already had the bones of one to test my rewrite of Open Babel's reading of aromatic SMILES, but after attending a workshop led by Greg Landrum on Open File Formats for Chemical Information I decided to tidy it up and broaden the scope.

My goals were to identify issues affecting interoperability, to resolve those issues by working with developers, and to provide a resource to help future implementations avoid problems. This last goal has recently been realised through Rich Apodaca's work on a Rust-based SMILES parser where he gives an extensive write-up on the role of the SMILES benchmark. The benchmark has also been of use to the IUPAC SMILES+ project, which grew out of Greg's workshop at the NIH and is led by Vin Scalfani.

Results and progress were described in a poster at the ICCS in June 2018, and subsequently (with updates) at the ACS in Aug of that year in "A de facto standard or a free-for-all? A benchmark for reading SMILES". I've thought about writing up a paper but I was never really keen - the point wasn't to write a paper, or point out software that had problems, but to improve SMILES. 

Back in the heady days of 2017-18, my approach with the benchmark was to work with, or at least nudge, various software vendors/developers towards improved interoperability. A tricky task when I worked for a software vendor myself, was a developer of a cheminformatics toolkit, and was sometimes neither a customer nor a user. Despite this, the benchmark was reasonably successful...but not completely, and two years down the line I find myself in a different environment relying on different tools, and wondering if some more nudging in the right direction might help.

In this spirit, let's take a look at an example from the ChemDraw results in the benchmark (to be found here), illustrate the problem and work out the solution by hand.

Figure 1 (left) shows entry 26359 in the benchmark. The CDK generates the following aromatic SMILES for this structure: c1(=O)c2c(c3=c1n(nco3)C)cccc2. However, when this SMILES is pasted into ChemDraw, the depiction in Figure 1 (middle) is obtained, which resolves to the structure on the right on hitting Alt+K. No error or warning appeared that might indicate problems when reading the SMILES.

Figure 1

Now let's do this by hand. Figure 2 shows the structure as described by the SMILES string. A key point to note/remember is that a SMILES string exactly describes the hydrogen count on every atom - we 'just' need to work out the bond orders of the aromatic bonds making sure that every atom that needs a double bond gets exactly one.

Figure 2

For the actual details of the algorithm, check out the source code of Open Babel or my partialsmiles project (also the CDK, but that's a different algorithm than described here). But you can think of it like solving Minesweeper - to begin with we tackle the bits we are sure about, before we have to start guessing. The two bonds to the carbonyl carbon must be single bonds; ditto for the bonds to NMe, and to the O in the ring (see here for some details). The remaining bonds to be kekulized are shown in black in Figure 3 (left):

Figure 3

We'll call this point A. Each of the remaining black atoms needs to have a double bond. But which to start with? If we put the first double bond in the wrong place we might end up having to start over. Again, you should start with those you are certain about - and that's those black atoms that have a single black bond. This must be a double bond. Once you've placed those, set the other neighbouring bonds to single, and updated the list of atoms that need a double bond, your structure will look like Figure 3 (middle). 

At this point, there are no black atoms with just a single black bond, so it's time to guess: just choose one and place a double bond. Now update the list of atoms that need a double bond, and go back to point A. Keep repeating until all the bonds are kekulized...or there are no bonds left to choose.

For more than 95% of the cases in the benchmark this will result in a kekulized structure. For the remaining cases, you instead end up with a pair of black atoms that don't have a double bond. To fix this, do a DFS to find an alternating path ('augmenting path') that joins them, and then flip the bond orders along the path. For example, consider the situation below, where I started by placing the double bond along the bond joining the 6-membered rings. To fix, just flip the bond orders from C-C=C-C to C=C-C=C.

Figure 4

The described procedure will successfully kekulize any structure that can be kekulized. Feel free to reach out if you have any questions.

Sunday 11 October 2020

Finding matched pairs of a peptide at the RDKit UGM

The recent RDKit UGM was a masterclass in how to organise a conference virtually, successfully replicating at least some of the in-person experience. This was due to the extensive use of Discord (best known as a chat server for gamerz) to manage questions, answers, discussion and networking, but also the technical support for Discord (thanks to Floriane Montanari) and Zoom (thanks to Christiane from Knime). With previous virtual meetings I have attended, the meeting only had an existence while someone was speaking; here discussions filled the interims between, and indeed the duration of, the talks.

I contributed a lightning talk to the meeting entitled "An efficient algorithm to find matched pairs of a peptide". Somehow I managed to give a talk on peptides without showing any peptide structures, which I'll blame on the 5 minute time limit and not on a perverse sense of humour.

Friday 9 October 2020

Comparing methods two-by-two

It is common to compare different methods using results from N distinct datasets. My earlier blogpost described why the mean rank is not a good measure of performance in these cases. Essentially, the relative performance of two methods (e.g. A and B) can be altered based on the performance of other methods (e.g. C, D and E).

But it's not just the mean rank that's the problem. It's the use of any performance measure where the assessment of the pairwise performance (e.g. between methods A and B) can be altered by the performance of other methods.

At the recent (virtual) AI in Chemistry Meeting organised by the RSC, one of the speakers showed an assessment of different methods based on how frequently that method came first relative to the other methods. Is this a reasonable way to assess performance? Let's look at an example...

Consider two methods A and B assessed using this metric on 10 datasets, where A comes first 9 times and B comes first once. Clearly A is better than B, and this is reflected by this metric.

Now let's add a method C to this comparison. It turns out that C does better than A on every dataset but still fails to beat B on the 10th. This means that A never comes first, but B still comes first once. In other words, by adding method C to the comparison, the relative performance of A and B has been inverted according to this metric. Which can't be right - A is still better than B - other methods have nothing to say about this.

So what's the solution? Well, one possibility is to read my previous blog post starting from "So what's the solution?"

Having done so, let's apply that solution. The key point is that it only makes sense to compare the methods pairwise. So let's do so by giving each dataset a vote on which method is best. This is a paired comparison (greater ability to resolve differences). 10 say C>A, 8 (net, see note 1 below) say C>B, and 8 again say A>B. These results are depicted above (see note 2 below). We can summarise this (but lose some information in the general case) with some transitive reduction by removing the C--B edge.

Will this approach catch on? It's tricky because this is one of those areas where the obvious solution seems quite reasonable, whereas the problem is quite subtle, nor have I ever seen it discussed in the field (or any field). Despite this, I will continue to pipe my thoughts directly to /dev/noel here.


1. If you're wondering why 9 x C>B and 1 x B>C leads to a net difference of 8, this is to handle the case of C=B. If it were 9 x C > B and 1 x B = C, the net difference would be 9.

2. This was generated from the following graphviz file using "dot -Tpng -o myfile.png":

digraph D {
C -> A [label="10"]
C -> B [label="8"]
A -> B [label="8"]