Thursday, 2 November 2017

Announcing more closures

It is a good idea to avoid extending an existing format. If additional information needs to be stored, many formats provide a means to store it separately so that software that does not support the extension will still be able to read and process the core information. For example, with SMILES, ChemAxon have famously used the title field to store additional info.

However, this is not always possible if one runs up against a fundamental limitation of a format.

I recently ran into a situation where I needed to deal with the fact that the SMILES syntax only supports at most two digit ring closure numbers. This is the %nn notation. The problem is that %111 does not mean ring closure 111 but rather ring closures 11 and 1. If you need a three digit number you are out of luck - it simply can't be expressed as a SMILES string.

So, is it really necessary to have this ability in the first place? Daylight didn't think so. And they were mainly right - when writing SMILES, if you reuse the numbers and pick a traversal order that closes rings as soon as possible then it may be possible to avoid requiring three digits (though a diamond or graphene substructure may cause problems - a refutable hypothesis).

But the thing is, any particular implementation may not be reusing ring closure numbers, or may be favoring a different traversal order (e.g. nicer looking SMILES are made by following ring substituents first). Indeed, remembering that this syntax can be used to create a bond between any two atoms, you may wish to use the ring closure syntax as a means to construct a molecule, and there you don't want to be limited to just 100 bonds; e.g. by stitching together a large molecule by combining shorter SMILES strings (see SmiLib for example, and the note under Restrictions), or via this SMILES hack of Andrew's.

When Bob Hanson of Jmol had to deal with this issue he settled on the syntax "%(number)" (discussed here in 2012, and more recently in J. Cheminf. 2016, 8, 50). One alternative might be "%%", but again this would be limited, this time to three digits. Might as well come up with a general solution as Bob has done. It uses an extra character for the three-digit case but this is going to be rare in the first place, and these SMILES are going to be so big that having a few more characters is not going to break the bank.

Support for this syntax has recently been added to RDKit and Open Babel. For more info see the associated pull requests (here and here). Hopefully we will start to see support for this more generally in other toolkits.

1. According to the Jmol docs, "%(n)" supports any non-negative number (integer, presumably?). My implementation supports up to 5 digits.

Image credit:
image by InExtremiss on Flickr licensed CC-BY