Thursday, 5 July 2007

Pybel - Just how unique are your molecules?

Cheminformaticians often want to remove or identify duplicates in a dataset of molecules. There are a couple of popular ways to do this: ensure that each molecule has a unique SMILES string, has a unique Daylight-type fingerprint, or has a unique InChI.

But what's the difference? Is there any? Does a unique fingerprint imply a unique InChI, and vice versa?


Enough with the questions. Let's just get some numbers. Our test set is the first slice of the drug-like subset of the ZINC database. Specifically, it's the file 3_p0.0.mol2.gz. According to ZINC, this should contain "only a single form of each molecule". We're going to use Pybel, the Python module included with Open Babel, to calculate the three common measures of uniqueness and check for duplicates.


Here is the output of the script (see code below):


The total number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?
There are 4 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates

The total number of (unique) molecules is 10790

Are there any remaining molecules which have the same FP \
and SMI?

There are 846 molecules with 2 duplicates
There are 55 molecules with 3 duplicates
There are 23 molecules with 4 duplicates
There are 3 molecules with 5 duplicates
There is 1 molecule with 7 duplicates

Are there any remaining molecules which have the same InChI?


So it seems that uniqueness is not quite as easy to measure as you might think. Even once you remove all molecules that share the same fingerprint, SMILES string, and InChI, there are still many that share the same fingerprint and SMILES string, although the InChIs are unique.

The results shown here either expose errors in OpenBabel, or possible errors in the creation of the ZINC database. In either case, this is an example of two Open chemistry resources being used to test and validate each other. Hopefully these results will aid in the development of even better software or data.


import os
import glob
import pybel
import cPickle

def duplicates(keys, data):
"""Find the duplicates in 'data' in terms of particular

'keys' is a list containing some selection of the
indices 1, 2 and 3.
ans = {}
for d in data:
ans.setdefault(tuple([d[key] for key in keys]),[]) \
result = {}
for k, v in ans.iteritems():
if len(v) > 1: # len(v) is the number of duplicates
result[len(v)] = result.get(len(v), 0) + 1
for k in sorted(result.keys()):
if result[k]>1:
print "There are %d molecules with %d" \
" duplicates" % (result[k], k)
print "There is %d molecule with %d" \
" duplicates" % (result[k], k)
return ans

def readfrommol2(filename):
data = []
for mol in pybel.readfile("mol2", filename):
data.append(( mol.title,
"%s" % mol.calcfp('FP2').bits,
mol.write("inchi").rstrip() ))
outputfile = open("dups.pickle", "w")
cPickle.dump(data, outputfile)
return data

def readfrompickle():
inputfile = open("dups.pickle", "r")
data = cPickle.load(inputfile)
return data

if __name__ == "__main__":
print "Starting...\n"

if not os.path.isfile("dups.pickle"):
data = readfrommol2("3_p0.0.mol2")
data = readfrompickle()

TITLE, FP, SMI, INC = range(4)

# How many molecules are there?
print "The number of molecules is %d\n" \
% len(data)

# Find molecules with the same FP, SMI and INC
print "Are there any molecules with the same" \
" FP, SMI and InChI?"
answer = duplicates( [SMI, FP, INC], data)

# Select only unique molecules
uniquedata = [(v[0], k[0], k[1], k[2]) for k, v \
in answer.iteritems()]
print "\nThe number of (unique) molecules is %d\n" \
% len(uniquedata)

# Find molecules with the same FP and SMI but not INC
print "Are there any remaining molecules which have" \
" the same FP and SMI?\n"
duplicates( [SMI, FP], uniquedata )

# Find any remaining molecules with the same InChI
print "\nAre there any remaining molecules which have" \
" the same InChI?\n"
duplicates( [INC], uniquedata ) # Nothing is found

print "...Finished!"


Rajarshi said...

I don't think using FP's to characterize uniqueness is a reliable approach. If a fingerprint is defined as N bits corresponding to N structural features, it is possible that 2 molecule have the same set of M structural features, but one of them has a feature that is not considered by the fingerprint. This would imply that both structures would have the same fingerprint, but are actually different.

This situation is directly applicable to the structural keys (MACCS, BCI etc).

Obviously, this problem is alleviated by a large enough feature set or else by using hashed fingerprints.

Noel O'Boyle said...

I think that path-based FPs (such as the Daylight-type FP used here), although not guaranteed to be unique, hardly ever have clashes. It would be interesting to select all molecules in the entire ZINC dataset which are unique by InChI (not including the stereochemistry level, which a FP doesn't detect) and see whether there are any clashes. I bet there won't be.

I would tend to agree about the structural feature fingerprints though (which Open Babel also has). These are probably more suited to ensuring that a dataset has a diverse range of chemistry, rather than to avoid duplicates.

Unknown said...

What's with stereochemistry?

And, you should check

John D. MacCuish, Christos Nicolaou, and Norah E. MacCuish
J. Chemical Information & Computer Sciences 41(1):134–146, 2001. DOI 10.1021/ci000069q

Cheers, Joerg

Geoff Hutchison said...

Actually, you'd be better off using the "can" (Canonical SMILES) format, rather than SMILES for detecting duplicates / uniques. If you use regular SMILES, you're assuming the atom ordering is the same in every entry, which is unlikely.

Both the FP and InChI approaches will perform some measure of canonicalization, so you might as well pick the Canonical SMILES. It's in Open Babel 2.1.x and later.

I'd be curious if you get different results (but please use 2.1.1 or later.)

Felix said...

wouldn't it make sense to look at a few examples to see if the molecules are different or not and not just look at numbers?

Noel O'Boyle said...

@Felix: It was originally intended as simply an example of using Pybel, rather than an analysis of ZINC.

However, due in particular to Geoff's comment on using canonical SMILES, there will be a follow-up post with more information.

Anonymous said...

Could you let me know how many duplicates you found with .smi and can smiles. I am using both to check the duplicates

Noel O'Boyle said...

This post is from more than 2y ago, so I don't really have those figures to hand at this stage. I should say that the canonical SMILES code in OpenBabel has been improved since then (particularly in terms of stereochemistry), and is undergoing further improvements right now (for 2.3). When that's done, it should be a very useful tool for accurately identifying duplicates.