Thursday, 25 April 2013

How fast is chemfp? We investigate!

fingerprintThat is, um, I investigate. I described chemfp in an earlier blogpost. Simply put, it comprises software by Python/C guru Andrew Dalke to handle and generate binary fingerprints for molecules. It enables easy comparison of fingerprints from different toolkits, and provides super-fast similarity methods. With a new even faster version out in Feb 2013, I thought I should put it through its paces.


The standard way to carry out a fast similarity search with Open Babel is the so-called fastsearch method, which precalculates an index file of fingerprints for an sdf file, and then linearly-scans through this for hits against a single query structure. With chemfp, the initial step is similar; you convert everything to fps files, a standard file format for fingerprints developed by Andrew. Previously you needed to use chemfp's ob2fps to do this, but OB 2.3.2 added support for fps files and can generate them directly 2 to 3 times faster.
$ obabel chembl_15mod.sdf -O chembl.fs # for fastsearch
$ obabel chembl_15mod.sdf -O chembl.fps # for chemfp

Single-molecule query

Let's take the first molecule in the database as a query, and find those molecules in ChEMBL that are within 0.8 similarity (Tanimoto). simsearch is the name of chemfp's tool for similarity searching:
$ obabel chembl.fs -osmi -s query.sdf -aa -at 0.8 > threshold.fs.txt
$ simsearch --threshold 0.8 -q query.fps chembl.fps > threshold.fps.txt
They both take about 1.2 seconds. I'm being a bit vague because the exact value doesn't matter; for some query molecules fastsearch takes longer, for some less. It's just to give an idea. But in short, fastsearch appears to be in the same ballpark as simsearch for single molecule queries.

Multi-molecule query

fastsearch doesn't provide an easy way to do multiple queries. If you want to do it yourself, you would just have to do each search one-by-one. So 1000 searches would take 1000s, let's say. In contrast, with simsearch 1000 searches on ChEMBL takes only 8s:
$ simsearch --threshold 0.8 -q largequery.fps chembl.fps > thresholdb.fps.txt
How does it manage this? Well, first of all, it runs in parallel on 4 available CPUs by default (...see Andrew's comment below). These figures come from a server which has 4 hyperthreading CPUs. The rest of the improvement comes from the use of in-memory data structures and algorithmic magic, some of which Andrew has described on his blog over the last year or so.

With speeds like this, it brings all-against-all similarity searches within reach, and simsearch provides an option just for this. The following finds the nearest 8 molecules for each molecule in the dataset:
$ simsearch -k 8 --NxN largequery.fps > NxN.fps.txt
I tried this for the first 10000 molecules in ChEMBL, and it took about 1s; 50K took 11s; 100K took 32s; 250K took 144s; and the whole of ChEMBLdb (1.2 million) took 2900s (48m 20s). To put this in context, an all-against-all search of ChEMBLdb using fastsearch would take something like 14 days.

Single-molecule query revisited

The alert reader may be wondering why, if the multi-molecule query is so darned fast, the single-molecule query is no faster than Open Babel's fastsearch. The answer is rather simple (...but not quite right - see Andrew's comment below): 99% of the time spent on the single-molecule query is setup. Once the search datastructure has been initialised, then the response for a query is within milliseconds. You can see timings for these over at the chemfp website. To achieve these timings in practice, you would need to write a Python script that used the chemfp library and was accessed via some sort of client/server architective, e.g. a local webservice. As Andrew points out, this would allow search results to be returned instantly as a chemist sketches a structure. Trying this out is left as an exercise for the reader. :-)

In conclusion

So, in short, if you have any interest in running multiple queries against a database, comparing two large datasets, or finding pairwise similarity within a dataset, check out chemfp.

1. I should point out that Andrew is making chemfp available as open source but with delayed release. Commercial licensees get support and the latest features.
2. For ease of use with Open Babel, the ChEMBLdb SDF file was modified to add the chembl_id to the title. Oh ye gods of ChEMBL please consider doing this for your humble users.

Image credit: Russell J Watkins on Flickr

Monday, 15 April 2013

Talk on Universal SMILES at New Orleans ACS

Early on Wednesday I presented my recent paper on Universal SMILES at the New Orleans ACS. This is a canonical SMILES string that uses the InChI canonical labels. Usually I tell the audience that the slides will be made available, but this time there was someone in the audience who was standing up every so often and taking photos; I thought this was so awesome I said nothing.

Anyway, here are the slides. They are very wordy, partly because I used up all my visualisation skills fiddling around with the other formal talk I was giving (appearing soon on the NextMove blog), and partly because I was aware that for web readers a picture of a donkey surfing might not spell out how to create canonical SMILES quite as well as traditional bulletpoints.

Universal (and Inchified) SMILES are available right now in Open Babel. Rumour has it that the CDK and RDKit are considering supporting Universal SMILES. If you use these or any other toolkits, and think that having support for Universal SMILES would be nothing short of paradigm-shifting awesome, ask them to add support.

Thursday, 11 April 2013

Talk on Open Babel at New Orleans ACS

Now that Open Babel 2.3.2 has been released 6 months, I thought it might almost be time to talk about what's new, and also mention what's under development.

Here's the talk I gave at Rajarshi's CINF Flash session last Sunday. It was my first time at the flash session but I'm definitely going to make a point of attending and presenting at this in future; it is preceded by a free lunch!