Friday 24 April 2009

Broken symmetry - Can SMILES and InChI ensure canonicalisation?

I've been working on the SMILES code in OpenBabel over the last while. The longer I've spent on it, the more impressed I've been with how it has been handled in the code and also with what a great idea SMILES was in the first place. The same goes for InChI, which has a slightly different goal, but which goes the extra mile and solves normalisation problems which I didn't even know existed.

But do they work? Can their canonicalisation procedures ensure that two identical molecular graphs result in the same canonical SMILES or InChI?

The InChI canonicalisation procedure is summarised in Rich's post. The Daylight algorithm is in Weininger*3, JCICS, 1989, 29, 97. And the review of the field that throws both into question is Ivanciuc's review of Processing Constitutional Information in Gasteiger's Handbook of Cheminformatics.

The key question here is whether the SMILES and InChI algorithms are capable of identifying automorphisms. There is a brute force way to do this, but both SMILES and InChI try to avoid this by identifying symmetry classes using extended connectivity and various graph invariants. An explicit automorphism check is not described as part of either algorithm but yet Ivanciuc argues repeatedly (e.g. at the end of 5.1.4) that any canonicalisation algorithm that does not include an explicit automorphism check "is incomplete, and its use in a chemical database...is unreliable".

The funny thing is that although the SMILES paper came out several years prior to Gasteiger's handbook (1993 vs. 2003), it is not referenced. Furthermore, the InChI developers have followed the same route more recently.

I leave the following question as an exercise for the reader: if a counterexample to the SMILES or InChI algorithms existed, how would one find it?

Image credit: _Blaster_

4 comments:

Andrew Dalke said...

I know the OpenEye developers included brute-force testing of their canonicalization routines, including their recent support for canonical Kékule-form SMILES. They started with the graph structure and permuted the atom and bond orders, generated a new SMILES, and made sure that all permutations gave the same result. This is empirical testing to back up the mathematical reasons behind the "shortcuts". The result? No problems found.

If "any canonicalisation algorithm that does not include an explicit automorphism check 'is incomplete, and its use in a chemical database...is unreliable'" then I'll put forth Sagan's statement, that extraordinary claims require extraordinary proof. After 20 years of Daylight canonicalization (which builds on previous research), and after the nauty approach used in InChI, and after all the empirical tests, I'll need something much more explicit, like a mathematical proof or a specific counter-example, before I'll treat this as something other than an article of unsupported faith.

Noel O'Boyle said...

The general point that Ivanciuc makes is true - it's easy to show that an arbitary set of graph invariants (e.g. just using the atom type for each node) may not be sufficient to identify graph-symmetrical atoms.

I would say that the extraordinary claim is really by the SMILES and InChi developers. The onus is on them to show (even if only empirically) that their algorithms can identify the symmetry classes.

A more direct test of this question would simply be to generate the permutations corresponding to the symmetry classes and check for non-automorphism (this is the test that Ivanciuc advocates in general).

Andrew Dalke said...

The algorithm in Weininger et al. does not work for the general case of all graphs. There are ambiguities, cleared up mostly after experience, as well as problems with certain classes of graphs. Brian Kelley, in his Dr. Dobb's article on this, mentioned that isospectral graphs were one problem, but I wasn't able to track down what he meant by that.

I've pointed out elsewhere that the Weininger paper fails to handle ambiguities resolvable only by bond type. It's possible I think in theory to have two atoms which are in the same symmetry class but connected to another atom through two different bond types. I wasn't able to construct one. The Weininger algorithm preferentially chooses highest bond orders first, which also limited the possibility of failures. My conclusion was that there might be problems, but only with non-realistic chemistry.

As for InChI - I thought the basic algorithm was akin to "of all possible representations, find the one which is alphabetically smallest." But I haven't dug into the details of that one.

How much more empirical testing would suffice? If this test is "easy" and only requires brute force, why didn't Ivanciuc do it already?

Noel O'Boyle said...

I understood that those ambiguities in the Weininger algorithm are related to the canonicalisation step. I haven't heard that the symmetry classes derivation is incomplete.

The "easy" doesn't refer to testing this (which would require the set of all molecules) but simply to showing that not all graph invariant partitionings are going to be able to find symmetry classes (consider labelling all atoms with 1, for example - this will place all atoms with the same number of neighbours into the same symmetry classes, but will not distinguish further).

Thanks for the pointer to Kelley's article. The Dr Dobbs page is here for anyone interested.