Saturday, 14 January 2017

Counting hydrogens in a SMILES string - The Rules

Hydrogens are not usually listed explicitly in a SMILES string, but instead can be inferred from a set of rules. To be precise, when reading a SMILES, the position of every hydrogen is known unambiguously once you know the rules. Oh - did I forget to mention the catch? - those rules are not written down anywhere. Nice.

So let's fix the world.

What is written down (i.e. on the Daylight website, the OpenSMILES spec) is the following:
1. For atoms in brackets, the number of hydrogens is either listed or zero. e.g. [Na] has 0 zeros, [NaH5] has 5.
2. For atoms outside brackets (which means they must be in the so-called 'organic subset'), if they are not written lowercase (indicating aromaticity), the number of hydrogens is described by the SMILES valence model (see the OpenSMILES specification for the details), e.g. the carbons in CC have 3 implicit hydrogens.

What is not written down (apart from references to pyrrole-/pyridine-type hydrogens) is what to do about (unbracketed) aromatic atoms when reading. These should be handled as follows: [Updated 03/Aug/2017]
1. Calculate the bond order sum by treating aromatic bonds as single bonds
2. Apply normal SMILES implicit valence rules using this sum
3. Subtract one from the number of implicit hydrogens, if there are any

As an example, consider aromatic C and N as in pyridine, c1ccncc1. The bond order sum for both the c and the n is 2. Their implicit valence is 4 and 3 respectively, and so the number of implicit hydrogens would be 4-2 and 3-2 respectively. After subtracting one, this gives 1 and 0, respectively. As another example, consider 'cn'. Without worrying too much about whether this is a sensible SMILES string, we can still read off the number of hydrogens based on the rules above: 4-1 on the carbon, and 3-1 on the nitrogen, then subtract one from each, and so [CH2] and [NH] connected by an aromatic bond (if kekulized, this should give H2C=NH). As a final example, consider the nitrogen of c1ccn(C)c1: 3-3 on the nitrogen gives 0, and we can't do any further subtraction.

Probably because of the lack of clarity around these latter rules, not every writer or reader follows them, and Wikipedia in particular is rife with generated SMILES from REDACTED which writes lowercase n whether or not the nitrogen has a hydrogen present. Some toolkits fail to read such SMILES, or interpret them differently than intended.

This raises the question as to what to do when presented with, for example, the SMILES c1cncc1? A correct SMILES for pyrrole would be c1c[nH]cc1. As written, the first SMILES is not kekulisable according to the Daylight aromaticity model (a neutral 'n' without a hydrogen must have a double bond, or alternatively, radicals contibute only a single electron) but one could infer that the structure intended was pyrrole. This is a slippery slope, though, once you consider aromatic rings with multiple nitrogens where it may not be possible to unambiguously assign hydrogens. Also, I would argue that it is not the job of a reader to change the structure because "it knows best".

For related reading, see John Mayfield's posts on SMILES implicit valence of aromatic atoms and New SMILES behaviour - parsing (CDK 1.5.4). Also worth noting is that at no point in identifying hydrogen locations was determination of aromatic systems or kekulisation required. Of course, if you go down the route of editing erroneous structures that may be a different story.

Image credit: Licensed CC-BY-NC by Sean Davis (image on Flickr)

No comments: