Friday 24 April 2009

Broken symmetry - Can SMILES and InChI ensure canonicalisation?

I've been working on the SMILES code in OpenBabel over the last while. The longer I've spent on it, the more impressed I've been with how it has been handled in the code and also with what a great idea SMILES was in the first place. The same goes for InChI, which has a slightly different goal, but which goes the extra mile and solves normalisation problems which I didn't even know existed.

But do they work? Can their canonicalisation procedures ensure that two identical molecular graphs result in the same canonical SMILES or InChI?

The InChI canonicalisation procedure is summarised in Rich's post. The Daylight algorithm is in Weininger*3, JCICS, 1989, 29, 97. And the review of the field that throws both into question is Ivanciuc's review of Processing Constitutional Information in Gasteiger's Handbook of Cheminformatics.

The key question here is whether the SMILES and InChI algorithms are capable of identifying automorphisms. There is a brute force way to do this, but both SMILES and InChI try to avoid this by identifying symmetry classes using extended connectivity and various graph invariants. An explicit automorphism check is not described as part of either algorithm but yet Ivanciuc argues repeatedly (e.g. at the end of 5.1.4) that any canonicalisation algorithm that does not include an explicit automorphism check "is incomplete, and its use in a chemical database...is unreliable".

The funny thing is that although the SMILES paper came out several years prior to Gasteiger's handbook (1993 vs. 2003), it is not referenced. Furthermore, the InChI developers have followed the same route more recently.

I leave the following question as an exercise for the reader: if a counterexample to the SMILES or InChI algorithms existed, how would one find it?

Image credit: _Blaster_

Tuesday 14 April 2009

Are you on my side or not? It's E/Z

Handling cis/trans stereochemistry with SMILES should be easy, right? You have the canonical examples for trans:
A. I/C=C/Cl
(I is down, Cl is up)
B. I\C=C\Cl
(I is up, Cl is down)
and cis:
C. I/C=C\Cl (both are down)
D. I\C=C/Cl (both are up)
The "/" or "\" symbols should be chosen based on whether the substituent occurs before or after the atom attached to the double-bond. Bearing this in mind, the following represents the same trans structure as A:
E. C(=C/Cl)\I
Note that the effect of moving the "I" from one side of the "C" to the other (that is, A vs E) causes the bond symbol to change.

When ring closures occur on the double bond, a further complication arises as the stereobond appears twice, once at each end of the ring closure. The symbol indicating the stereochemistry should only appear at the end on the double bond:
F. I/C=C\1/CCCN1
Of course, where two substituents are shown explicitly at one end of a double bond, it's not necessary to show the stereochemisty for both of the bonds (although it makes things clearer). That is, the following two representations are identical to F:
G. I/C=C1/CCCN1
H. I/C=C\1CCCN1

Image credit: suttonhoo

Friday 3 April 2009

Some short stories

  • I want to flag up Andrew Dalke's course at the end of April on Python and cheminformatics. While I might disagree with Andrew's toolkit of choice, there's no doubt that the skills learnt will be of great benefit to any cheminformatician in their day-to-day work. As well as a cheminformatics portion, the course includes matplotlib (plotting), communicating with Excel, XML processing, subprocess (for calling command-line programs), NumPy, R, SQL and Django.
  • The first issue of Journal of Cheminformatics has hit the electronic shelves. Point your RSS reader to the feed. Best of luck to Christoph and David.
  • Is 2009 the year of OChRe on the desktop? After almost a decade of little development in this area, we have in quick succession papers on ChemReader, OSRA and now Clide Pro, an update of the venerable Clide. The techniques used by the new version are described in detail in the paper. Unfortunately, there is little in the way of comparison either to the original Clide or other OChRe software. On the plus side, the dataset of images discussed in the paper has been made available as supporting material with the intention of forming part of a community benchmark for performance comparisons (although it's not clear whether this dataset was also used for training the software).
  • There seems to be some confusion over the name of this field. Is it OSR (Optical Stucture Recognition, according to OSRA), OCR (Optical Chemical Recognition, ala chemOCR), OCSR (you guessed it, Optical Chemical Structure Recognition, as referred to in the Clide Pro paper), or OChRe (Optical Chemical Recognition again, but spelling out a real word; it also has that InChI up-and-down thing going on)?
  • Did your experiments fail again? Tell me about it. I mean that literally, because you've got your choice of journals to publish in. There's the All Results Journal ("all results are good results") or (for the more mathematically inclined) Rejecta Mathematica.