Pre-processing
The standard way to carry out a fast similarity search with Open Babel is the so-called fastsearch method, which precalculates an index file of fingerprints for an sdf file, and then linearly-scans through this for hits against a single query structure. With chemfp, the initial step is similar; you convert everything to fps files, a standard file format for fingerprints developed by Andrew. Previously you needed to use chemfp's ob2fps to do this, but OB 2.3.2 added support for fps files and can generate them directly 2 to 3 times faster.$ obabel chembl_15mod.sdf -O chembl.fs # for fastsearch $ obabel chembl_15mod.sdf -O chembl.fps # for chemfp
Single-molecule query
Let's take the first molecule in the database as a query, and find those molecules in ChEMBL that are within 0.8 similarity (Tanimoto). simsearch is the name of chemfp's tool for similarity searching:$ obabel chembl.fs -osmi -s query.sdf -aa -at 0.8 > threshold.fs.txt $ simsearch --threshold 0.8 -q query.fps chembl.fps > threshold.fps.txtThey both take about 1.2 seconds. I'm being a bit vague because the exact value doesn't matter; for some query molecules fastsearch takes longer, for some less. It's just to give an idea. But in short, fastsearch appears to be in the same ballpark as simsearch for single molecule queries.
Multi-molecule query
fastsearch doesn't provide an easy way to do multiple queries. If you want to do it yourself, you would just have to do each search one-by-one. So 1000 searches would take 1000s, let's say. In contrast, with simsearch 1000 searches on ChEMBL takes only 8s:$ simsearch --threshold 0.8 -q largequery.fps chembl.fps > thresholdb.fps.txtHow does it manage this? Well, first of all, it runs in parallel on 4 available CPUs by default (...see Andrew's comment below). These figures come from a server which has 4 hyperthreading CPUs. The rest of the improvement comes from the use of in-memory data structures and algorithmic magic, some of which Andrew has described on his blog over the last year or so.
With speeds like this, it brings all-against-all similarity searches within reach, and simsearch provides an option just for this. The following finds the nearest 8 molecules for each molecule in the dataset:
$ simsearch -k 8 --NxN largequery.fps > NxN.fps.txtI tried this for the first 10000 molecules in ChEMBL, and it took about 1s; 50K took 11s; 100K took 32s; 250K took 144s; and the whole of ChEMBLdb (1.2 million) took 2900s (48m 20s). To put this in context, an all-against-all search of ChEMBLdb using fastsearch would take something like 14 days.
Single-molecule query revisited
The alert reader may be wondering why, if the multi-molecule query is so darned fast, the single-molecule query is no faster than Open Babel's fastsearch. The answer is rather simple (...but not quite right - see Andrew's comment below): 99% of the time spent on the single-molecule query is setup. Once the search datastructure has been initialised, then the response for a query is within milliseconds. You can see timings for these over at the chemfp website. To achieve these timings in practice, you would need to write a Python script that used the chemfp library and was accessed via some sort of client/server architective, e.g. a local webservice. As Andrew points out, this would allow search results to be returned instantly as a chemist sketches a structure. Trying this out is left as an exercise for the reader. :-)In conclusion
So, in short, if you have any interest in running multiple queries against a database, comparing two large datasets, or finding pairwise similarity within a dataset, check out chemfp.Notes:
1. I should point out that Andrew is making chemfp available as open source but with delayed release. Commercial licensees get support and the latest features.
2. For ease of use with Open Babel, the ChEMBLdb SDF file was modified to add the chembl_id to the title. Oh ye gods of ChEMBL please consider doing this for your humble users.
Image credit: fingerprint by Russell J Watkins on Flickr (CC BY-NC-SA 2.0)
5 comments:
really interesting!
how fast is the conversion of compounds into the fps format? and how long would it take to do this calculation with all compounds in pubchem?
I can't remember the time required for conversion to fps. Something like 30 to 40 minutes maybe. If you split it over multiple CPUs (by hand) you could do much better obviously.
I'll assume the second question is rhetorical, as you can easily try this out for yourself. :-)
Thanks for the review Noel. Good to know that obabel's interface is significantly faster than ob2fps. I'll have to see what I can do to speed things up, since the Python overhead shouldn't be that large.
Some minor points. The simsearch single-molecule query doesn't actually build the search data structure. There's special purpose code which does the search directly against the FPS file contents. This takes constant memory overhead and is faster than building the data structure, but only appropriate for a single query. (The tradeoff point is something like 5 queries.)
You can use the --memory option to force the single-molecule query to load everything into memory, or use the --scan option to force everything to use the text scanner. The most you can do at once is given by --batch-size.
"it runs in parallel on all available CPUs"
It should, by default, use 4 OpenMP threads. For more threads, set the environment variable OMP_NUM_THREADS to the number of threads you want. This will be a command-line option in future versions.
"Oh ye gods of ChEMBL please consider doing this for your humble users"
Preach it, brother! That would be an advantage of using ob2fps. :)
Thanks Andrew for the clarification.
I saw CPU usage of over 700%; I presume this is the magic hyperthreading then.
"I presume this is the magic hyperthreading then."
Beware Noel. The 'magic' of hyperthreading can often make things slower in practice. But it looks awfully cool in htop.
Post a Comment