But what's the difference? Is there any? Does a unique fingerprint imply a unique InChI, and vice versa?
Method
Enough with the questions. Let's just get some numbers. Our test set is the first slice of the drug-like subset of the ZINC database. Specifically, it's the file 3_p0.0.mol2.gz. According to ZINC, this should contain "only a single form of each molecule". We're going to use Pybel, the Python module included with Open Babel, to calculate the three common measures of uniqueness and check for duplicates.
Results
Here is the output of the script (see code below):
Discussion
Starting...
The total number of molecules is 12712
Are there any molecules with the same FP, SMI and InChI?
There are 4 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates
The total number of (unique) molecules is 10790
Are there any remaining molecules which have the same FP \
and SMI?
There are 846 molecules with 2 duplicates
There are 55 molecules with 3 duplicates
There are 23 molecules with 4 duplicates
There are 3 molecules with 5 duplicates
There is 1 molecule with 7 duplicates
Are there any remaining molecules which have the same InChI?
...Finished!
So it seems that uniqueness is not quite as easy to measure as you might think. Even once you remove all molecules that share the same fingerprint, SMILES string, and InChI, there are still many that share the same fingerprint and SMILES string, although the InChIs are unique.
The results shown here either expose errors in OpenBabel, or possible errors in the creation of the ZINC database. In either case, this is an example of two Open chemistry resources being used to test and validate each other. Hopefully these results will aid in the development of even better software or data.
Code
import os
import glob
import pybel
import cPickle
def duplicates(keys, data):
"""Find the duplicates in 'data' in terms of particular
keys.
'keys' is a list containing some selection of the
indices 1, 2 and 3.
"""
ans = {}
for d in data:
ans.setdefault(tuple([d[key] for key in keys]),[]) \
.append(d[TITLE])
result = {}
for k, v in ans.iteritems():
if len(v) > 1: # len(v) is the number of duplicates
result[len(v)] = result.get(len(v), 0) + 1
for k in sorted(result.keys()):
if result[k]>1:
print "There are %d molecules with %d" \
" duplicates" % (result[k], k)
else:
print "There is %d molecule with %d" \
" duplicates" % (result[k], k)
return ans
def readfrommol2(filename):
data = []
for mol in pybel.readfile("mol2", filename):
data.append(( mol.title,
"%s" % mol.calcfp('FP2').bits,
mol.write("smi").split()[0],
mol.write("inchi").rstrip() ))
outputfile = open("dups.pickle", "w")
cPickle.dump(data, outputfile)
outputfile.close()
return data
def readfrompickle():
inputfile = open("dups.pickle", "r")
data = cPickle.load(inputfile)
inputfile.close()
return data
if __name__ == "__main__":
print "Starting...\n"
if not os.path.isfile("dups.pickle"):
data = readfrommol2("3_p0.0.mol2")
else:
data = readfrompickle()
TITLE, FP, SMI, INC = range(4)
# How many molecules are there?
print "The number of molecules is %d\n" \
% len(data)
# Find molecules with the same FP, SMI and INC
print "Are there any molecules with the same" \
" FP, SMI and InChI?"
answer = duplicates( [SMI, FP, INC], data)
# Select only unique molecules
uniquedata = [(v[0], k[0], k[1], k[2]) for k, v \
in answer.iteritems()]
print "\nThe number of (unique) molecules is %d\n" \
% len(uniquedata)
# Find molecules with the same FP and SMI but not INC
print "Are there any remaining molecules which have" \
" the same FP and SMI?\n"
duplicates( [SMI, FP], uniquedata )
# Find any remaining molecules with the same InChI
print "\nAre there any remaining molecules which have" \
" the same InChI?\n"
duplicates( [INC], uniquedata ) # Nothing is found
print "...Finished!"
8 comments:
I don't think using FP's to characterize uniqueness is a reliable approach. If a fingerprint is defined as N bits corresponding to N structural features, it is possible that 2 molecule have the same set of M structural features, but one of them has a feature that is not considered by the fingerprint. This would imply that both structures would have the same fingerprint, but are actually different.
This situation is directly applicable to the structural keys (MACCS, BCI etc).
Obviously, this problem is alleviated by a large enough feature set or else by using hashed fingerprints.
I think that path-based FPs (such as the Daylight-type FP used here), although not guaranteed to be unique, hardly ever have clashes. It would be interesting to select all molecules in the entire ZINC dataset which are unique by InChI (not including the stereochemistry level, which a FP doesn't detect) and see whether there are any clashes. I bet there won't be.
I would tend to agree about the structural feature fingerprints though (which Open Babel also has). These are probably more suited to ensuring that a dataset has a diverse range of chemistry, rather than to avoid duplicates.
What's with stereochemistry?
And, you should check
John D. MacCuish, Christos Nicolaou, and Norah E. MacCuish
J. Chemical Information & Computer Sciences 41(1):134–146, 2001. DOI 10.1021/ci000069q
Cheers, Joerg
Actually, you'd be better off using the "can" (Canonical SMILES) format, rather than SMILES for detecting duplicates / uniques. If you use regular SMILES, you're assuming the atom ordering is the same in every entry, which is unlikely.
Both the FP and InChI approaches will perform some measure of canonicalization, so you might as well pick the Canonical SMILES. It's in Open Babel 2.1.x and later.
I'd be curious if you get different results (but please use 2.1.1 or later.)
wouldn't it make sense to look at a few examples to see if the molecules are different or not and not just look at numbers?
@Felix: It was originally intended as simply an example of using Pybel, rather than an analysis of ZINC.
However, due in particular to Geoff's comment on using canonical SMILES, there will be a follow-up post with more information.
Could you let me know how many duplicates you found with .smi and can smiles. I am using both to check the duplicates
This post is from more than 2y ago, so I don't really have those figures to hand at this stage. I should say that the canonical SMILES code in OpenBabel has been improved since then (particularly in terms of stereochemistry), and is undergoing further improvements right now (for 2.3). When that's done, it should be a very useful tool for accurately identifying duplicates.
Post a Comment