Sunday, 6 July 2008

Cheminformatics toolkit face-off: Speed (Python vs Java vs C++)

It's a bit meaningless to compares speeds of different toolkits performing the same operations. Functionality is really the reason you are going to favour one toolkit over another. Here I'm focusing on comparing the speed of accessing the same toolkit from CPython or Jython versus accessing it directly in its original language. In other words, what price do you pay to be able to work in Python rather than C++ or Java? (I'll discuss the advantages of working in Python in a later post.)

To begin with, let's consider accessing the CDK from Python versus from Java. The test input file for all of these tests is the first subset of the drug-like ligands in ZINC, 3_p0.0.sdf, which contains 24098 molecules. The three test cases are (1) Iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) calculate 25 descriptor values for the first 228 molecules. I implemented these in Java, and using cinfony. For example, here's the cinfony version for test case 2:
import time
from cinfony import cdk

t = time.time()
for mol in cdk.readfile("sdf", "3_p0.0.sdf"):
print mol.molwt
print time.time() - t

Here are the results (times are in seconds):
MethodTest 1Test 2Test 3
Java22.238.931.7
CPython (cinfony, cdkjpype)34.072.638.2
Jython (cinfony, cdkjython)23.744.434.0

It's clear that accessing the CDK from Jython is almost as fast as using it in a Java program. However, there is an overhead associated with using it from CPython except where, as in Test 3, most of the time is spent in computation.

Next, let's look at accessing OpenBabel from Python versus from C++. Here I will compare the following tests cases: (1) iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) apply 30 steps of a forcefield optimisation to the first 200 molecules. Here's an example of the cinfony script for (2). Notice any similarities to the one above? :-)
import time
from cinfony import pybel

t = time.time()
for mol in pybel.readfile("sdf", "3_p0.0.sdf"):
print mol.molwt
print time.time() - t

Here are the results (note: measured on a different machine than the tests above):
MethodTest 1Test 2Test 3
C++77.7126.056.8
CPython (cinfony, Pybel)78.3132.960.0
Jython (cinfony, Jybel)80.4135.759.4
In short, the cost of using Pybel or Jybel is small.

Technical notes: For the CDK, I calculated the values of all descriptors (except IP) that didn't return an array of values. This came to 25 descriptors. I also skipped over one molecule, #20, that took several seconds to process. Jython can natively access Java libraries such as the CDK, but to access the CDK from CPython, cinfony uses JPype. cinfony uses the Python SWIG wrappers to access OpenBabel from CPython; Jython is using the Java SWIG wrappers for OpenBabel. I need to repeat the runs a few times on a quiet machine to get better figures, but I note that the figures do not include the cost of loading the OpenBabel DLL or starting up the JVM.

Image credit:MacRonin47

6 comments:

Rajarshi said...

Interesting results - are you using the same input SDF for the CDK and OB tests? If so, why does the C++ version take 77s to iterate over the file vs 22s for the Java version?

baoilleach said...

Sorry - different machine, but yes same input.

baoilleach said...

Just an update. I've run the CPython OpenBabel test 1 on the same machine as the one used for the CDK tests, and the time is 88.0 (best of 3). So, it does look like the CDK iterates much faster.

Rajarshi said...

Hmm, I wonder why that's the case?

Anonymous said...

Why can't u mix pyrex/pypy with python code and see the results.

Also did u check using psycho module?

Cheers,
Sam

baoilleach said...

@Sam: Pyrex converts Python code to the equivalent C and compiles it as an extension module. Here, the hard work is already done by an extension module, OpenBabel. Similarly, in relation to Psyco, I think this would only speed up the part done in Python, but this might be worth trying if a couple of seconds difference is important. I don't Pypy could help here.

BTW, you can find the final figures in the Cinfony paper, described here.