Saturday, 24 October 2020

The SMILES reading benchmark - two years on

In August 2017, after attending an InChI meeting at the NIH in Bethesda, I had the idea of putting together a SMILES reading benchmark. I already had the bones of one to test my rewrite of Open Babel's reading of aromatic SMILES, but after attending a workshop led by Greg Landrum on Open File Formats for Chemical Information I decided to tidy it up and broaden the scope.

My goals were to identify issues affecting interoperability, to resolve those issues by working with developers, and to provide a resource to help future implementations avoid problems. This last goal has recently been realised through Rich Apodaca's work on a Rust-based SMILES parser where he gives an extensive write-up on the role of the SMILES benchmark. The benchmark has also been of use to the IUPAC SMILES+ project, which grew out of Greg's workshop at the NIH and is led by Vin Scalfani.

Results and progress were described in a poster at the ICCS in June 2018, and subsequently (with updates) at the ACS in Aug of that year in "A de facto standard or a free-for-all? A benchmark for reading SMILES". I've thought about writing up a paper but I was never really keen - the point wasn't to write a paper, or point out software that had problems, but to improve SMILES. 


Back in the heady days of 2017-18, my approach with the benchmark was to work with, or at least nudge, various software vendors/developers towards improved interoperability. A tricky task when I worked for a software vendor myself, was a developer of a cheminformatics toolkit, and was sometimes neither a customer nor a user. Despite this, the benchmark was reasonably successful...but not completely, and two years down the line I find myself in a different environment relying on different tools, and wondering if some more nudging in the right direction might help.

In this spirit, let's take a look at an example from the ChemDraw results in the benchmark (to be found here), illustrate the problem and work out the solution by hand.

Figure 1 (left) shows entry 26359 in the benchmark. The CDK generates the following aromatic SMILES for this structure: c1(=O)c2c(c3=c1n(nco3)C)cccc2. However, when this SMILES is pasted into ChemDraw, the depiction in Figure 1 (middle) is obtained, which resolves to the structure on the right on hitting Alt+K. No error or warning appeared that might indicate problems when reading the SMILES.

Figure 1

Now let's do this by hand. Figure 2 shows the structure as described by the SMILES string. A key point to note/remember is that a SMILES string exactly describes the hydrogen count on every atom - we 'just' need to work out the bond orders of the aromatic bonds making sure that every atom that needs a double bond gets exactly one.

Figure 2

For the actual details of the algorithm, check out the source code of Open Babel or my partialsmiles project (also the CDK, but that's a different algorithm than described here). But you can think of it like solving Minesweeper - to begin with we tackle the bits we are sure about, before we have to start guessing. The two bonds to the carbonyl carbon must be single bonds; ditto for the bonds to NMe, and to the O in the ring (see here for some details). The remaining bonds to be kekulized are shown in black in Figure 3 (left):

Figure 3

We'll call this point A. Each of the remaining black atoms needs to have a double bond. But which to start with? If we put the first double bond in the wrong place we might end up having to start over. Again, you should start with those you are certain about - and that's those black atoms that have a single black bond. This must be a double bond. Once you've placed those, set the other neighbouring bonds to single, and updated the list of atoms that need a double bond, your structure will look like Figure 3 (middle). 

At this point, there are no black atoms with just a single black bond, so it's time to guess: just choose one and place a double bond. Now update the list of atoms that need a double bond, and go back to point A. Keep repeating until all the bonds are kekulized...or there are no bonds left to choose.

For more than 95% of the cases in the benchmark this will result in a kekulized structure. For the remaining cases, you instead end up with a pair of black atoms that don't have a double bond. To fix this, do a DFS to find an alternating path ('augmenting path') that joins them, and then flip the bond orders along the path. For example, consider the situation below, where I started by placing the double bond along the bond joining the 6-membered rings. To fix, just flip the bond orders from C-C=C-C to C=C-C=C.

Figure 4

The described procedure will successfully kekulize any structure that can be kekulized. Feel free to reach out if you have any questions.

No comments: