Friday 28 March 2008

Pybel as a generic API for cheminformatics libraries - proof of concept using CDK

I'm very interested in interoperability of open source chemistry codes. Following a comment by Egon on a recent post of mine, I started wondering whether the Pybel API could used with other cheminformatics libraries as a backend.

The advantage of this for the user would be (a) to reduce the learning curve - if you know how to use Pybel, you can access any of several different cheminformatics libraries with the same syntax, (b) the same scripts could be used to carry out a particular analysis using different cheminformatics libraries - different libraries may have different fingerprints, descriptors or implementations of particular algorithms (this is of course also useful for cross-checking the results of different programs) and (c) help reduce the divide between different cheminformatics toolkits (interoperability!!).

The rationale behind Pybel (described in the paper) lends itself to this use. Pybel doesn't attempt to wrap all the functionality of OpenBabel, but only the most common tasks in cheminformatics. For advanced options, or additional functionality, you can go behind the scenes and access OpenBabel directly. As a result, I propose that the Pybel API represents a generic API (one of many possible, of course) for accessing any cheminformatics library.

To test this, I have created CDKabel, a proof of concept which shows that the Chemistry Development Kit (CDK) can be accessed using Pybel syntax through Jython. CDKabel does not yet pass all of the Pybel tests, but there's enough to show that the approach has some merit. Compare the following: here's some Python code using Pybel and OpenBabel:
C:\Documents and Settings\oboyle>python25
Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32
bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more inf
ormation.
>>> from pybel import *
>>> for mol in readfile("sdf", "head.sdf"):
... print "Molecule has molwt of %.2f and %d atoms" %
(mol.molwt, len(mol.atoms))
...
Molecule has molwt of 122.12 and 15 atoms
Molecule has molwt of 332.49 and 28 atoms
>>>
Now here's some Jython code with CDKabel and CDK:
D:\Tools\CDK>set CLASSPATH=cdk-1.0.2.jar

D:\Tools\CDK>..\jython2.2.1\jython
Jython 2.2.1 on java1.6.0_05
Type "copyright", "credits" or "license" for more informa
tion.
>>> from cdkabel import *
>>> for mol in readfile("sdf", "head.sdf"):
... print "Molecule has molwt of %.2f and %d atoms" %
(mol.molwt, len(mol.atoms))
...
Molecule has molwt of 122.04 and 15 atoms
Molecule has molwt of 331.96 and 28 atoms
>>>
Well, at least they agree on the number of atoms :-) (It's my fault - CDK has like, ten different ways of calculating the molecular mass, and I just chose randomly :-) )

I've only spent a few minutes throwing CDKabel together, so it doesn't do much beyond the example shown. However, if interested, you can download it and try it for yourself.

I'd appreciate comments on the idea that there is a core Python API that could be usefully applied to several cheminformatics libraries. Would anyone use CDKabel if it were available?

1 comment:

Andrew Dalke said...

Here's some of my experiences with the basic API from PyDaylight, which as you know is similar to pybel's. Although it doesn't handle file I/O.

I wrote code for one of my clients using PyDaylight. Years later they ported everything over to OEChem. I in effect reimplemented enough of PyDaylight in OEChem to make everything work. The biggest problem SMARTS portability: OEChem doesn't support "vector bindings" and changes the semantics of a few terms.

I spent several days going through the regression failures to characterize all of these differences. It's something you'll have to worry about with a "cross-platform", pybel. Code isn't that portable if it does different things on each platform.

Another client had tools that worked with OEChem but wanted to be able to demo it on machines without an OEChem license. They didn't use much of OEChem and I was able to emulate that on top of FROWNS. It was a C++ library which turned around and called a Python library. Strange, but it worked well enough.

The biggest problem there was the performance. I don't recall but I think it was about 100 times slower than OEChem.

In both those cases the drive to switch was because of licensing problems. When pybel on top of free toolkits, that's not so important.