Thursday 7 October 2010

Measuring information loss in file format conversion Part II

In a comment on the previous post, Richard Hall asked about the error rate going from SDF->SMI->InChI. A very good question. Clearly, this tests the fidelity of reading and writing to and from SMILES. But there's a second less obvious effect...

But first, the results.

For the initial set of 18053 molecules, we find 108 disagreements (0.6%) with the InChIs obtained by converting using Open Babel straight from SDF->InChI. Of these, 25 have an error in the molecular formula in the InChI. These are straightforward bugs in Open Babel in determining the correct number of implicit hydrogens when reading some SMILES (Update 08/10/2010: Now fixed).

The others are more interesting disagreements: when converting from SDF -> InChI, the InChI library itself gets to decide which are the stereocenters*; when converting from SMI -> InChI, the InChI library needs to accept what Open Babel tells it. In other words, disagreements arise when the internal stereochemistry models in the two libraries disagree.

I took a look at three which appeared to disagree in different ways.

[CID 15550] Going through SMI leads to loss of stereo at a double bond in a ring system
From SDF: InChI=1S/C8H12/c1-2-4-6-8-7-5-3-1/h1-4H,5-8H2/b3-1-,4-2-
From SMI: InChI=1S/C8H12/c1-2-4-6-8-7-5-3-1/h1-4H,5-8H2

Whoops, it's those pesky double bonds in ring systems as discussed in the previous post. Might be time to look into this. The ring in question is an 8-membered ring. Is it possible to have a trans bond in an 8-membered ring?

[CID 17567] Going through SMI leads to loss of stereo at a double bond
From SDF: InChI=1S/C4H9N/c1-4(2)3-5/h3-5H,1-2H3/b5-3+
From SMI: InChI=1S/C4H9N/c1-4(2)3-5/h3-5H,1-2H3

The double bond in question is a [H]N=C bond. Open Babel doesn't think this can be a cis/trans bond; InChI thinks it can. Anyone actually know?

[CID 15456] Going through SMI leads to loss of stereo at two tetrahedral centers.
From SDF: InChI=1S/C11H22N3O3P/c1-6-17-9(15)12-18(16,13-7-10(13,2)3)14-8-11(14,4)5/h6-8H2,1-5H3,(H,12,15,16)/t13-,14-/m0/s1
From SMI: InChI=1S/C11H22N3O3P/c1-6-17-9(15)12-18(16,13-7-10(13,2)3)14-8-11(14,4)5/h6-8H2,1-5H3,(H,12,15,16)

The tetrahedral centers in question are both sp3 nitrogens where two of the (three) bonds to the nitrogen are part of the same ring. Again, there is a disagreement between Open Babel and InChI on whether such nitrogens can be tetrahedral centers.

The good news about these results is that we're almost down to the level where the only disagreements we see are disagreements on stereocenters rather than plain buggy bugs.

The bad news is that it's not clear what to do about these disagreements. Setting aside the "Open Babel is wrong - no, InChI is wrong!" discussion, another cheminformatics library will produce different InChIs yet again depending on how it defines stereochemical centers. If we could all agree on what constitutes a reasonable stereocenter there wouldn't be any problem. Alternatively, we could just follow whatever InChI says is a stereocenter even if we don't agree...?

* I apologise for use of the American spelling throughout. It's a symptom of preparing a paper for an ACS journal. Normal service will resume shortly.

3 comments:

nyc dad said...

If you want interoperability and linking of information you need a (arbitrary) standard. While I am biased for InChI, I would argue, not chemistry/stereochemistry, but that InChI has the backing of the publishers and database producers. Thus using the InChI (arbitrary) rules is the correct practical solution.

Between chemists and chemistry there never has been and will never be agreement, so we should/must find a workable solution.

Steve Heller

Noel O'Boyle said...

I'm coming around to the same viewpoint, Steve. It's just that I hadn't seen this problem coming.

But to play the devil's advocate, if InChI sees cis/trans isomers where none exist in reality (the C=N bond in the post above being a potential example) then we will see two different InChIs being calculated for the same molecule just by chance depending on whether the apparent geometry in the input file is cis or trans.

Perhaps the set of possibly stereogenic centers listed in the InChI technical manual could be tightened up here or there.

R Stephan said...

I don't know about InChi but I have seen stereo centers at het nitrogen atoms. The electron pair of nitrogen when free switches its conformation at 100 Hz so there is no preference, but when bound this is prevented, so, within molecules N effectively becomes a stereo center. Which is widely ignored, of course, until people see a crystal structure and begin to ask questions.

The example I encountered was reserpine which somehow wasn't correctly modeled by CORINA compared to the published x-ray structure.

I'm glad InChi has support, and that you noted it for openbabel.