Saturday 23 September 2017

How many cheminformaticians does it take to read a SMILES string?

A few days ago I posted the following poll question on Twitter:
How many Hs are on the N in the molecule described by this SMILES string? N(C)(C)(C)C
I provided four possible answers:
  • Zero
  • One
  • Depends
  • Can't say as no such molecule
Please take a moment to choose your own answer. To guide you, here is Inigo Montoya.
The results of the survey were interesting. I selected the answers to highlight common misconceptions, and indeed each was chosen at least as much as the correct answer. In fact, only two people got the correct answer, and one of those people shares the office with me.

First of all, it's not "can't say as no such molecule" (8/25 people). To even say whether any such molecule exists, one needs to read the SMILES string, and so could answer the question. Also, the scope of SMILES strings is any molecule that can be represented by a valence model; that is, it is not restricted to only molecules that exist. I think what people were really indicating here is that they work back from the molecule to answer how many hydrogens it must have; given that they couldn't work out what molecule it is, they weren't able to count the hydrogens.

And what about "depends"? (2 people) I'm guessing that people chose this answer to indicate that it's either 0 or 1 (or something else) but it's not possible to be certain. This might be where people are trying to second guess what the corret molecule might have been, given that the one presented is clearly wrong in some way. Here, the key point that people are missing is that a SMILES string exactly represents a molecule including all of its hydrogens.

But what is that hydrogen count? 0 or 1? Well, 13 people went for zero and 2 for one.

Which highlights the final misconception, that the number of hydrogens on an atom in a SMILES string is the same as the answer to "how many hydrogens would I guesstimate if I didn't know how many were attached?". The answer to that is probably zero here. However, when reading SMILES, the number of hydrogens on the nitrogen is exactly described by a set of rules outlined by Daylight at [1], and somewhat more clearly by OpenSMILES at [2] (a project which was spearheaded by former Daylight developer, Craig James). The rules are simple and have no special cases. For nitrogen, the rule is add hydrogens to round up the valence to 3 (if valence <= 3), or up to 5 (if 3 < valence <= 5), and zero otherwise. In short, one hydrogen should be added.

If at this point you're thinking, "but that doesn't make sense - you shouldn't add a hydrogen", you are missing the point. The structure that the SMILES writer was presented with had a hydrogen there. We can argue about whether it was intended, but it did have it. The SMILES writer then used the SMILES valence rule described above to decide whether or not it could omit square brackets. It worked out that it could, and then decided to omit them. Any SMILES reader that respects the rules would then read it in and perfectly reproduce the original structure.

...Or would they? I've just been testing the ability of a number of cheminformatics toolkits and drawing programs to agree on the number of hydrogens on each atom of a given SMILES string. The goal is not to give anyone a kicking, but to improve SMILES interoperability across the board, and in the last two weeks I've given feedback to (I think) 7 different toolkits with this goal in mind. So far I have test results for the Avalon toolkit, BIOVIA Draw, CACTVS, the CDK, ChemDraw, Indigo, IWToolkit, JChem, OEChem, Open Babel, OpenChemLib, the RDKit, and Weininger's CEX. If you would like to add another program to this list, then I encourage you to get in touch.

[1] (section 3.2.1)
[2] (section 3.1.5)