Thursday, 12 July 2007

Pybel - Just how unique are your molecules? Part II

I recently posted on using SMILES, Fingerprints and InChIs to identify unique molecules in a dataset. Geoff pointed out that I should have been using canonical SMILES ("can" in Open Babel), instead of non-canonical SMILES ("smi" in Open Babel). Why?

Well, the same molecule will always have the same canonical SMILES. Whereas depending on the order of the atoms in the input file, the same molecule might have a different non-canonical SMILES. Since I was interested in identifying unique molecules, I should have been using canonical SMILES.


The same as before, just with "can" instead of "smi". The code used is only slightly different from that in the previous post.


Here is the output of the script (see code below):


The number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?

There are 2 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates

The number of (unique) molecules is 10792

Are there any remaining molecules with the same canonical

There are 815 molecules with 2 duplicates
There are 61 molecules with 3 duplicates
There are 36 molecules with 4 duplicates
There are 10 molecules with 5 duplicates
There are 9 molecules with 6 duplicates
There are 9 molecules with 7 duplicates
There is 1 molecule with 8 duplicates
There are 3 molecules with 10 duplicates
There are 3 molecules with 11 duplicates
There are 2 molecules with 13 duplicates
There is 1 molecule with 14 duplicates
There is 1 molecule with 24 duplicates
There is 1 molecule with 34 duplicates

Are there any remaining molecules with the same fingerprint?

None found

Are there any remaining molecules with the same InChI?

There are 2 molecules with 2 duplicates


If you compare the results here with those in my previous post, you will find that that there are two instances where two molecules have the same FP, InChI and non-canonical SMILES, but they have different canonical SMILES! By eye the two molecules are identical. This appears to be a bug in the canonicalisation routine and I've reported it to OpenBabel. The details are:

non-canonical SMILES is CC1OC(C[NH](C1)CC(C(F)(F)F)O)C but canonical SMILES is...
  • ZINC03883383: C[C@H]1O[C@@H](C)C[NH](C1)C[C@H](O)C(F)(F)F
  • ZINC03883386: C[C@@H]1O[C@H](C)C[NH](C1)C[C@H](O)C(F)(F)F

non-canonical SMILES is C1C(CC(C(C1O)O)O)(C(=O)O)O but canonical SMILES is...
  • ZINC03870192: O[C@@H]1CC(O)(C[C@H](O)C1O)C(=O)O
  • ZINC03870194: O[C@H]1CC(O)(C[C@@H](O)C1O)C(=O)O

Although my original aim was simply to give an example of how to use Pybel, this trivial analysis of ZINC has identified areas for improvement in both ZINC and OpenBabel. Without access to such a large dataset as ZINC, errors such as those identified here would go unnoticed. If anyone has any more ideas for tests, let me know...


GMC2007 said...

Possibly a little confusion here. The @ and @@ specify configuration and when these are present you're dealing with what sometimes gets called an isomeric SMILES. Double bond geometry and isotopes can also be specified in isomeric SMILES. The SMILES that you're calling non-canonical does not have the stereochemistry specified. However it may still be canonical. It's just not an isomeric SMILES.

baoilleach said...

I think my interpretation is still correct. The non-canonical SMILES are definitely not canonicalised to begin with, so for sure there is a potential to give rise to different SMILES.

Secondly, the two molecules with different isomeric SMILES (thanks for the definition) have identical 3D structures, which surely is inconsistent with different isomeric SMILES.

GMC2007 said...

SMILES certainly need to be canonicalised if you're going to be using them for duplicate recognition. However SMILES that you appear to have generated with 'smi' (as opposed to 'can') have lost stereochemical information. You might want to check that 'smi' isn't just writing a canonical SMILES without stereochemical information. Does 'smi' reproduce the order of the input SMILES and does it output stereochemical information?

I’ve taken a look at your SMILES. I believe that ZINC03883383 and ZINC03883386 are distinct structures because the chiral center in the substituent renders the two faces of the morpholine ring non-equivalent. One worrying feature of these two SMILES is that the 4-connected nitrogen does not have a positive charge so the SMILES encode free radicals.

The other two SMILES present more of a mystery since they appear to encode the same structure. Only two of the stereo centers have defined configuration and the two that do not are C1 and C4 of a cyclohexane ring. Encoding cis/trans relationships in 1,4-disubsituted cyclohexanes with SMILES did pose problems in the past (maybe still) because the plane of symmetry is incompatible with chirality and some software ‘knows’ that the relevant carbon atoms can’t be chiral centers. It could be that the canonicalisation has somehow lost some information about stereochemical relationships. There are a couple of things that you could check. First, if you started with the structures in another format, you may be able see which stereocenters have defined configuration and therefore whether any of this has been lost in SMILES generation and/or canonicalisation. Secondly you could edit in a specific configuration for each (and both) of the stereocenters with undefined configuration and check that this information is not lost in the canonicalisation process.

baoilleach said...

Thanks for the feedback, GMC. For the record, I will be following up these issues either with the OpenBabel development team, or with John Irwin, over at ZINC. I will post an update if and when.

Webster said...

The first pair of smiles are different enantiomers. They also are incorrect in that they have a 4 valent uncharged nitrogen. If you remove the H, CONCORD will generate different 3D conformations.

The second pair, as written are identical since the two are superimposible on each other. The chiral markings serve to differentiate them from the trans case where one hydroxyl is above the ring and the other below. That the canonicalizer doesn't generate identical SMILES is a bug.

Also, I think that there are two other centers that are not marked