To begin with, let's consider accessing the CDK from Python versus from Java. The test input file for all of these tests is the first subset of the drug-like ligands in ZINC, 3_p0.0.sdf, which contains 24098 molecules. The three test cases are (1) Iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) calculate 25 descriptor values for the first 228 molecules. I implemented these in Java, and using cinfony. For example, here's the cinfony version for test case 2:
import time from cinfony import cdk t = time.time() for mol in cdk.readfile("sdf", "3_p0.0.sdf"): print mol.molwt print time.time() - t
Here are the results (times are in seconds):
Method | Test 1 | Test 2 | Test 3 |
Java | 22.2 | 38.9 | 31.7 |
CPython (cinfony, cdkjpype) | 34.0 | 72.6 | 38.2 |
Jython (cinfony, cdkjython) | 23.7 | 44.4 | 34.0 |
It's clear that accessing the CDK from Jython is almost as fast as using it in a Java program. However, there is an overhead associated with using it from CPython except where, as in Test 3, most of the time is spent in computation.
Next, let's look at accessing OpenBabel from Python versus from C++. Here I will compare the following tests cases: (1) iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) apply 30 steps of a forcefield optimisation to the first 200 molecules. Here's an example of the cinfony script for (2). Notice any similarities to the one above? :-)
import time from cinfony import pybel t = time.time() for mol in pybel.readfile("sdf", "3_p0.0.sdf"): print mol.molwt print time.time() - t
Here are the results (note: measured on a different machine than the tests above):
Method | Test 1 | Test 2 | Test 3 |
C++ | 77.7 | 126.0 | 56.8 |
CPython (cinfony, Pybel) | 78.3 | 132.9 | 60.0 |
Jython (cinfony, Jybel) | 80.4 | 135.7 | 59.4 |
Technical notes: For the CDK, I calculated the values of all descriptors (except IP) that didn't return an array of values. This came to 25 descriptors. I also skipped over one molecule, #20, that took several seconds to process. Jython can natively access Java libraries such as the CDK, but to access the CDK from CPython, cinfony uses JPype. cinfony uses the Python SWIG wrappers to access OpenBabel from CPython; Jython is using the Java SWIG wrappers for OpenBabel. I need to repeat the runs a few times on a quiet machine to get better figures, but I note that the figures do not include the cost of loading the OpenBabel DLL or starting up the JVM.
Image credit:MacRonin47
6 comments:
Interesting results - are you using the same input SDF for the CDK and OB tests? If so, why does the C++ version take 77s to iterate over the file vs 22s for the Java version?
Sorry - different machine, but yes same input.
Just an update. I've run the CPython OpenBabel test 1 on the same machine as the one used for the CDK tests, and the time is 88.0 (best of 3). So, it does look like the CDK iterates much faster.
Hmm, I wonder why that's the case?
Why can't u mix pyrex/pypy with python code and see the results.
Also did u check using psycho module?
Cheers,
Sam
@Sam: Pyrex converts Python code to the equivalent C and compiles it as an extension module. Here, the hard work is already done by an extension module, OpenBabel. Similarly, in relation to Psyco, I think this would only speed up the part done in Python, but this might be worth trying if a couple of seconds difference is important. I don't Pypy could help here.
BTW, you can find the final figures in the Cinfony paper, described here.
Post a Comment