Monday 28 August 2017

My ACS talk on Kekulization and aromatic SMILES

Here are the slides for the talk I presented last week at the ACS meeting in Washington. It describes my understanding of the Daylight toolkit as deduced by John:
The fundamental reason for attempting to describe the behaviour of the Daylight toolkit is the assumption that this is the correct way to read/write SMILES, and that any deviation is either wrong or at least should be justified. The background to this is that I recently worked on the handling of kekulization and reading SMILES in Open Babel, but found that many of the details were not present in the Daylight documentation. As I say on the final slide, please let me know if you feel I have made any mistake. You can do this by email or by leaving a comment below.

Over at the NextMove blog, Peter Shenkin brought up the biphenylene case, which (to my mind) illustrates alternative approaches to reading aromatic SMILES. Consider the SMILES string c12ccccc1c3ccccc23. Some toolkits may read this, work out that only the two six-membered rings can be aromatic, and then make sure that the double bonds are not placed in the four-membered ring. I refer to this approach as dearomatisation, an approach that Open Babel used to use. It involves ring detection, 4n+2 counting and so forth. Apart from taking some time, an obvious problem is different aromaticity models may be used by the reader and writer, thus leading the reader to drop aromaticity from a particular ring, typically by setting those bonds to single bonds and adjusting hydrogens, resulting in a different structure than intended.

In any case, this is not the approach used by the Daylight toolkit, which did not consider 4n+2, or even detect cycles. The approach is described in the talk above so I won't repeat it here. For the SMILES above, I believe that it would generate one of two Kekulé forms depending on the atom order; one with the two double bonds in the four-membered ring, and one with two benzenes. It's for this reason that Daylight would never generate that SMILES for biphenylene ("don't generate aromatic SMILES that you can't kekulize"), but always write a single bond symbol for the bonds connecting the phenyl rings (e.g. something like c12-c3c(-c2cccc1)cccc3). When written that way, kekulization always gives the desired form.
This use of a single bond symbol is rather subtle. When writing an aromatic SMILES, the rule is "use a single bond symbol when a ring bond is between aromatic atoms but is not itself aromatic". This corrects for the fact that all default bonds joining aromatic atoms are themselves considered aromatic (except where the bond is not in a ring but the atoms are).

Following up on a comment by Rajarshi, while differences in aromaticity models are a problem for 'dearomatisation' algorithms, they are not a problem for the kekulization algorithm used by Daylight. So long as the structure is kekulizable (and appropriate single-bond symbols are used) then it can read in any structure without loss of information no matter what aromatic model is used.