In a comment on the previous post, Richard Hall asked about the error rate going from SDF->SMI->InChI. A very good question. Clearly, this tests the fidelity of reading and writing to and from SMILES. But there's a second less obvious effect...
But first, the results.
For the initial set of 18053 molecules, we find 108 disagreements (0.6%) with the InChIs obtained by converting using Open Babel straight from SDF->InChI. Of these, 25 have an error in the molecular formula in the InChI. These are straightforward bugs in Open Babel in determining the correct number of implicit hydrogens when reading some SMILES (Update 08/10/2010: Now fixed).
The others are more interesting disagreements: when converting from SDF -> InChI, the InChI library itself gets to decide which are the stereocenters*; when converting from SMI -> InChI, the InChI library needs to accept what Open Babel tells it. In other words, disagreements arise when the internal stereochemistry models in the two libraries disagree.
I took a look at three which appeared to disagree in different ways.
[CID 15550] Going through SMI leads to loss of stereo at a double bond in a ring system
From SDF: InChI=1S/C8H12/c1-2-4-6-8-7-5-3-1/h1-4H,5-8H2/b3-1-,4-2-
From SMI: InChI=1S/C8H12/c1-2-4-6-8-7-5-3-1/h1-4H,5-8H2
Whoops, it's those pesky double bonds in ring systems as discussed in the previous post. Might be time to look into this. The ring in question is an 8-membered ring. Is it possible to have a trans bond in an 8-membered ring?
[CID 17567] Going through SMI leads to loss of stereo at a double bond
From SDF: InChI=1S/C4H9N/c1-4(2)3-5/h3-5H,1-2H3/b5-3+
From SMI: InChI=1S/C4H9N/c1-4(2)3-5/h3-5H,1-2H3
The double bond in question is a [H]N=C bond. Open Babel doesn't think this can be a cis/trans bond; InChI thinks it can. Anyone actually know?
[CID 15456] Going through SMI leads to loss of stereo at two tetrahedral centers.
From SDF: InChI=1S/C11H22N3O3P/c1-6-17-9(15)12-18(16,13-7-10(13,2)3)14-8-11(14,4)5/h6-8H2,1-5H3,(H,12,15,16)/t13-,14-/m0/s1
From SMI: InChI=1S/C11H22N3O3P/c1-6-17-9(15)12-18(16,13-7-10(13,2)3)14-8-11(14,4)5/h6-8H2,1-5H3,(H,12,15,16)
The tetrahedral centers in question are both sp3 nitrogens where two of the (three) bonds to the nitrogen are part of the same ring. Again, there is a disagreement between Open Babel and InChI on whether such nitrogens can be tetrahedral centers.
The good news about these results is that we're almost down to the level where the only disagreements we see are disagreements on stereocenters rather than plain buggy bugs.
The bad news is that it's not clear what to do about these disagreements. Setting aside the "Open Babel is wrong - no, InChI is wrong!" discussion, another cheminformatics library will produce different InChIs yet again depending on how it defines stereochemical centers. If we could all agree on what constitutes a reasonable stereocenter there wouldn't be any problem. Alternatively, we could just follow whatever InChI says is a stereocenter even if we don't agree...?
* I apologise for use of the American spelling throughout. It's a symptom of preparing a paper for an ACS journal. Normal service will resume shortly.