Wednesday, 19 September 2012

Using the InChI to canonicalise SMILES

I believe that the Open chemistry community will wish to move towards InChI as the definitive approach for all canonicalisation in their codes. We have found that "unique SMILES" is not precisely defined and there is no accepted reference implementation that is freely available. For example a given molecule (e.g. caffeine) has at least 9 representations on the public Web.
- Peter Murray-Rust, Feb 2005, Open Babel mailing list

Different software generates different canonical SMILES. The reason for this is simple; no-one has described a canonicalisation scheme for SMILES that includes stereochemistry. Even if we wanted to generate the same SMILES, we cannot do so. Back in 2005, PMR pointed out that the InChI could be used for this purpose. As ever, PMR was way ahead of the times, and to my knowledge no one took up this idea until...

A paper of mine has just been published in J. Cheminf.:
Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI
NM O'Boyle, Journal of Cheminformatics 2012, 4:22. doi:10.1186/1758-2946-4-22 

I describe two approaches to generate a canonical SMILES, one based on roundtripping through the InChI (and so it incorporates the InChI normalisation as a side-effect), and one that just takes the canonical labels from the InChI (so the structure is unchanged). These approaches are available in the development version of Open Babel as options to SMILES output, and should soon be available in Open Babel 2.3.2.

I'm hoping that other toolkits will see merit in this approach and add similar capability. This will allow, for the first time, different toolkits to generate the same SMILES, and for the first time, it will finally be clear how different toolkits disagree on aspects of their chemical model. Only then we will have some progress on sorting out standard algorithms for stereocentre detection, aromatic models and so forth. And all this will be good for toolkits, and good for users.


Unknown said...

I haven't looked at Inchi lately but I think it is critical that the canonicalization algorithm is exposed as declarative and with unit tests. This will then allow developers to code it in different languages - and to separate it from the InChI serialization. We should have an API where we input a structure and normalize it; and a second where the normalized structure is canonicalized.

baoilleach said...

Thanks Peter. All of that would be indeed be nice; the InChI as it stands is not intended for the use to which I've put it here and so to a certain extent I'm squashing a square peg into a round hole. (It works surprisingly well though.)

gilleain said...

Unit tests would make re-implementation much easier. I've idly dreamed of doing this myself, but the worry is that the core canonicalization routine is quite sensitive to how it is set up.

I know that partition-refinement gives a different ordering depending on various choices made in the algorithm. Hopefully, those particular choices could be carried over into a different implementation.