Wednesday 30 April 2008

Dear LazyWeb - How to write a unittest for when a Python module is unavailable?

I have a Python module (cinfony.pybel) which I am testing with the standard unittest module. One of the classes in the module has an optional dependency; if the Python Imaging Library (PIL) is available, a particular draw() method will work, otherwise it should raise an ImportError with a helpful message (I am simplifying the situation somewhat but this is the general problem).

I have two related questions:

(1) Should I place the "import Image" statement for PIL at the start of the draw() method, or at the start of the module (and catch the exception but set a flag)? Personally, I feel that its more Pythonic to keep all of the imports at the start of the module; that is, to be up-front about dependencies, optional or not.

(2) How can I test this code in a unittest? That is, how can I call this draw() method with PIL available, and a second time, without PIL available?

Thursday 24 April 2008

Cheminformatics toolkit face-off - LogP and TPSA

Updated 23/04/08 (see below)

In a recent post, I discussed some thoughts I had about using the Pybel API as a generic API for cheminformatics toolkits. I've started to implement this (more on this at a later date) and have begun writing some tests to compare results from different toolkits. To begin with, I was hoping that RDKit (Jan2008), CDK (1.0.2) and OpenBabel (dev, soon-to-be 2.2.0) could agree on LogP and TPSA values calculated with the same group contribution approach.

The code is below, but first here are the results:
ALogP
=====
****Example1*****
The calculated value from JCICS, 1999, 39, 868 is 1.400000
cinfony.rdk {'MolLogP': 1.4007999999999998,
'pyMolLogP': 1.4007999999999998}
cinfony.pybel {'LogP': 0.80749999999999988}



****Example2*****
The calculated value from JCICS, 1999, 39, 868 is 2.750000
cinfony.rdk {'MolLogP': 2.7486000000000006,
'pyMolLogP': 2.7486000000000006}
cinfony.pybel {'LogP': 1.6415999999999995}



****methane*****
The calculated value from JCICS, 1999, 39, 868 is 0.144100
cinfony.rdk {'MolLogP': 0.6331, 'pyMolLogP': 0.6331}
cinfony.pybel {'LogP': 0.14410000000000001}



TPSA
====
****metoprolol*****
The calculated value from JMC, 2000, 43, 3714 is 50.700000
cinfony.cdk {'tpsa': 50.719999999999999}
cinfony.rdk {'TPSA': 50.719999999999999}
cinfony.pybel {'TPSA': 50.719999999999999}



****nordiazempam*****
The calculated value from JMC, 2000, 43, 3714 is 41.500000
cinfony.cdk {'tpsa': 41.460000000000001}
cinfony.rdk {'TPSA': 41.460000000000001}
cinfony.pybel {'TPSA': 41.460000000000001}
The good news is that they all agree on the TPSA of the two examples. The bad news is that CDK 1.0.2 does not calculate ALogP, and OpenBabel doesn't seem to be doing well for the more complicated examples (which are taken from the JCICS paper). It's possible that OpenBabel is using a different parameterisation than the paper I reference. I'm not sure either if there's any difference between the two ALogP values that RDKit calculates. Hopefully, the people that know will comment below.

Update (23/04/08): (1) I calculated the LogP of methane incorrectly - it should be 0.63610. (2) RDKit has a transposition error in the value for C in methane (only a minor error). (3) OpenBabel calculates the correct answer if hydrogens are added. I'm going to recommend that the descriptor adds the hydrogens itself, to avoid user errors in future. (Done! Thanks to Chris Morley.) (4) The two ALogP descriptors in RDKit are implementations in two different languages, one in C++, the other in Python. They should be identical. (Thanks to Greg Landrum of RDKit for identifying several of these issues)

Here's the code...
from cinfony import cdk, rdk, pybel


descs = [("ALogP", "JCICS, 1999, 39, 868"),
("TPSA", "JMC, 2000, 43, 3714")]
toolkits = [(cdk, {'ALogP': None, 'TPSA': ['tpsa']}),
(rdk, {'ALogP': ['MolLogP', 'pyMolLogP'],
'TPSA': ['TPSA']}),
(pybel, {'ALogP': ['LogP'], 'TPSA': ['TPSA']})]
testcases = {'ALogP':
{"methane": (0.1441, 'C'),
"Example1": (1.40, 'c1(O)ccccc1(OC)'),
"Example2": (2.75, 'c1ccccc1c1ccccn1')},
'TPSA':
{"metoprolol": (50.7,
"c1(OCC(O)CNC(C)C)ccc(CCOC)cc1"),
"nordiazempam": (41.5,
"c1c(Cl)ccc2NC(=O)CN=C(c3ccccc3)c12")}}

for desc, ref in descs:
print desc + "\n" + "="*len(desc)
for k, v in testcases[desc].iteritems():
print "****%s*****" % k
print "The calculated value from %s is %f" % (ref, v[0])
for toolkit in toolkits:
if toolkit[1][desc]:
mol = toolkit[0].readstring("smi", v[1])
print toolkit[0].__name__,
print mol.calcdesc(toolkit[1][desc])
print "\n\n"
Image credit: pfly

Tuesday 22 April 2008

Cheminformatics toolkit face-off - Molecular weight

Next up in this series of toolkit comparisons is the molecular weight. Here, cinfony is going to support two ideas:
The gold standard is the element data in the Blue Obelisk Data Repository (BODR), which was created for just this purpose.

The following are the results for a hydrogen atom and a carbon atom according to the BODR and each of the toolkits. The first figure in parentheses is the molwt, then the exactmass.
BODR: (1.00794, 1.007823032)
(12.0107, 12)

Pybel: (1.00794, 1.007825032)
(12.0107, 12.0)

RDKit: (1.008)
(12.011)

CDK: (1.0079400539398193, 2.0156500339508057)
(12.010700225830078, 16.031299591064453)

OpenBabel is the only toolkit that gets everything right. RDKit is doing okay, but should consider using BODR data in future. The CDK presents the most intriguing results. I'm not sure whether the Python/Java interface has introduced noise into the molwt values, but they are exactly in agreement with the BODR up to the 7th decimal place or so, after which something goes weird. On the other hand, it's quite clear that the CDK's exactmass is simply not behaving as advertised. It is using Deuterium for the hydrogen and C-16 for the carbon. (I note that MFAnalyser has already been replaced in the CDK development code.)

Image credit: Atomium by bugmonkey (CC BY-NC 2.0)

Thursday 3 April 2008

Noel O'Blog celebrates one year at ACS New Orleans

To celebrate the first anniversary of Noel O'Blog, I'm going to be attending the ACS National Meeting in New Orleans, so say hi if you see me. I'll mostly be hanging around the CINF and COMP sessions, or at the CCDC (Cambridge Crystallographic Data Centre) stand in the Exposition Hall, and for sure, at the CINF reception on Tuesday evening.

If you're tired of hearing talks about how docking doesn't work, etc., etc., come along early on Monday and I'll explain one way to improve a scoring function (in the docking software GOLD) to pick out those pesky actives. Any other ideas people might have would be very welcome.