Noel O'Blog: Pybel - Just how unique are your molecules?

Cheminformaticians often want to remove or identify duplicates in a dataset of molecules. There are a couple of popular ways to do this: ensure that each molecule has a unique SMILES string, has a unique Daylight-type fingerprint, or has a unique InChI.

But what's the difference? Is there any? Does a unique fingerprint imply a unique InChI, and vice versa?

Method

Enough with the questions. Let's just get some numbers. Our test set is the first slice of the drug-like subset of the ZINC database. Specifically, it's the file 3_p0.0.mol2.gz. According to ZINC, this should contain "only a single form of each molecule". We're going to use Pybel, the Python module included with Open Babel, to calculate the three common measures of uniqueness and check for duplicates.

Results

Here is the output of the script (see code below):


Starting...

The total number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?
There are 4 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates

The total number of (unique) molecules is 10790

Are there any remaining molecules which have the same FP \
    and SMI?

There are 846 molecules with 2 duplicates
There are 55 molecules with 3 duplicates
There are 23 molecules with 4 duplicates
There are 3 molecules with 5 duplicates
There is 1 molecule with 7 duplicates

Are there any remaining molecules which have the same InChI?

...Finished!

Discussion

So it seems that uniqueness is not quite as easy to measure as you might think. Even once you remove all molecules that share the same fingerprint, SMILES string, and InChI, there are still many that share the same fingerprint and SMILES string, although the InChIs are unique.

The results shown here either expose errors in OpenBabel, or possible errors in the creation of the ZINC database. In either case, this is an example of two Open chemistry resources being used to test and validate each other. Hopefully these results will aid in the development of even better software or data.

Code


import os
import glob
import pybel
import cPickle

def duplicates(keys, data):
    """Find the duplicates in 'data' in terms of particular
    keys.

    'keys' is a list containing some selection of the
    indices 1, 2 and 3.
    """
    ans = {}
    for d in data:
        ans.setdefault(tuple([d[key] for key in keys]),[]) \
            .append(d[TITLE])
    result = {}
    for k, v in ans.iteritems():
        if len(v) > 1: # len(v) is the number of duplicates
            result[len(v)] = result.get(len(v), 0) + 1
    for k in sorted(result.keys()):
        if result[k]>1:
            print "There are %d molecules with %d" \
                  " duplicates" % (result[k], k)
        else:
            print "There is %d molecule with %d" \
                  " duplicates" % (result[k], k)
    return ans

def readfrommol2(filename):
    data = []
    for mol in pybel.readfile("mol2", filename):
        data.append(( mol.title,
            "%s" % mol.calcfp('FP2').bits,
            mol.write("smi").split()[0],
            mol.write("inchi").rstrip() ))
    outputfile = open("dups.pickle", "w")
    cPickle.dump(data, outputfile)
    outputfile.close()
    return data

def readfrompickle():
    inputfile = open("dups.pickle", "r")
    data = cPickle.load(inputfile)
    inputfile.close()
    return data

if __name__ == "__main__":
    print "Starting...\n"

    if not os.path.isfile("dups.pickle"):
        data = readfrommol2("3_p0.0.mol2")
    else:
        data = readfrompickle()

    TITLE, FP, SMI, INC = range(4)

    # How many molecules are there?
    print "The number of molecules is %d\n" \
            % len(data)

    # Find molecules with the same FP, SMI and INC
    print "Are there any molecules with the same" \
          " FP, SMI and InChI?"
    answer = duplicates( [SMI, FP, INC], data)

    # Select only unique molecules
    uniquedata = [(v[0], k[0], k[1], k[2]) for k, v \
            in answer.iteritems()]
    print "\nThe number of (unique) molecules is %d\n" \
            % len(uniquedata)

    # Find molecules with the same FP and SMI but not INC
    print "Are there any remaining molecules which have" \
          " the same FP and SMI?\n"
    duplicates( [SMI, FP], uniquedata )

    # Find any remaining molecules with the same InChI
    print "\nAre there any remaining molecules which have" \
          " the same InChI?\n"
    duplicates( [INC], uniquedata ) # Nothing is found

    print "...Finished!"

8 comments:

Rajarshi6 July 2007 at 23:10
I don't think using FP's to characterize uniqueness is a reliable approach. If a fingerprint is defined as N bits corresponding to N structural features, it is possible that 2 molecule have the same set of M structural features, but one of them has a feature that is not considered by the fingerprint. This would imply that both structures would have the same fingerprint, but are actually different.

This situation is directly applicable to the structural keys (MACCS, BCI etc).

Obviously, this problem is alleviated by a large enough feature set or else by using hashed fingerprints.
Noel O'Boyle7 July 2007 at 09:22
I think that path-based FPs (such as the Daylight-type FP used here), although not guaranteed to be unique, hardly ever have clashes. It would be interesting to select all molecules in the entire ZINC dataset which are unique by InChI (not including the stereochemistry level, which a FP doesn't detect) and see whether there are any clashes. I bet there won't be.

I would tend to agree about the structural feature fingerprints though (which Open Babel also has). These are probably more suited to ensuring that a dataset has a diverse range of chemistry, rather than to avoid duplicates.
Unknown7 July 2007 at 17:59
What's with stereochemistry?

And, you should check

John D. MacCuish, Christos Nicolaou, and Norah E. MacCuish
J. Chemical Information & Computer Sciences 41(1):134–146, 2001. DOI 10.1021/ci000069q

Cheers, Joerg
Geoff Hutchison8 July 2007 at 05:50
Actually, you'd be better off using the "can" (Canonical SMILES) format, rather than SMILES for detecting duplicates / uniques. If you use regular SMILES, you're assuming the atom ordering is the same in every entry, which is unlikely.

Both the FP and InChI approaches will perform some measure of canonicalization, so you might as well pick the Canonical SMILES. It's in Open Babel 2.1.x and later.

I'd be curious if you get different results (but please use 2.1.1 or later.)
Felix9 July 2007 at 22:00
wouldn't it make sense to look at a few examples to see if the molecules are different or not and not just look at numbers?
Noel O'Boyle9 July 2007 at 22:25
@Felix: It was originally intended as simply an example of using Pybel, rather than an analysis of ZINC.

However, due in particular to Geoff's comment on using canonical SMILES, there will be a follow-up post with more information.
Anonymous3 September 2009 at 15:22
Could you let me know how many duplicates you found with .smi and can smiles. I am using both to check the duplicates
Noel O'Boyle5 September 2009 at 23:40
This post is from more than 2y ago, so I don't really have those figures to hand at this stage. I should say that the canonical SMILES code in OpenBabel has been improved since then (particularly in terms of stereochemistry), and is undergoing further improvements right now (for 2.3). When that's done, it should be a very useful tool for accurately identifying duplicates.

Thursday, 5 July 2007

Pybel - Just how unique are your molecules?

8 comments: