Noel O'Blog: Cheminformatics toolkit face-off

Updated 23/04/08 (see below)

In a recent post, I discussed some thoughts I had about using the Pybel API as a generic API for cheminformatics toolkits. I've started to implement this (more on this at a later date) and have begun writing some tests to compare results from different toolkits. To begin with, I was hoping that RDKit (Jan2008), CDK (1.0.2) and OpenBabel (dev, soon-to-be 2.2.0) could agree on LogP and TPSA values calculated with the same group contribution approach.

The code is below, but first here are the results:

ALogP
=====
****Example1*****
The calculated value from JCICS, 1999, 39, 868 is 1.400000
cinfony.rdk {'MolLogP': 1.4007999999999998,
'pyMolLogP': 1.4007999999999998}
cinfony.pybel {'LogP': 0.80749999999999988}



****Example2*****
The calculated value from JCICS, 1999, 39, 868 is 2.750000
cinfony.rdk {'MolLogP': 2.7486000000000006,
'pyMolLogP': 2.7486000000000006}
cinfony.pybel {'LogP': 1.6415999999999995}



****methane*****
The calculated value from JCICS, 1999, 39, 868 is 0.144100
cinfony.rdk {'MolLogP': 0.6331, 'pyMolLogP': 0.6331}
cinfony.pybel {'LogP': 0.14410000000000001}



TPSA
====
****metoprolol*****
The calculated value from JMC, 2000, 43, 3714 is 50.700000
cinfony.cdk {'tpsa': 50.719999999999999}
cinfony.rdk {'TPSA': 50.719999999999999}
cinfony.pybel {'TPSA': 50.719999999999999}



****nordiazempam*****
The calculated value from JMC, 2000, 43, 3714 is 41.500000
cinfony.cdk {'tpsa': 41.460000000000001}
cinfony.rdk {'TPSA': 41.460000000000001}
cinfony.pybel {'TPSA': 41.460000000000001}

The good news is that they all agree on the TPSA of the two examples. The bad news is that CDK 1.0.2 does not calculate ALogP, and OpenBabel doesn't seem to be doing well for the more complicated examples (which are taken from the JCICS paper). It's possible that OpenBabel is using a different parameterisation than the paper I reference. I'm not sure either if there's any difference between the two ALogP values that RDKit calculates. Hopefully, the people that know will comment below.

Update (23/04/08): (1) I calculated the LogP of methane incorrectly - it should be 0.63610. (2) RDKit has a transposition error in the value for C in methane (only a minor error). (3) OpenBabel calculates the correct answer if hydrogens are added. I'm going to recommend that the descriptor adds the hydrogens itself, to avoid user errors in future. (Done! Thanks to Chris Morley.) (4) The two ALogP descriptors in RDKit are implementations in two different languages, one in C++, the other in Python. They should be identical. (Thanks to Greg Landrum of RDKit for identifying several of these issues)

Here's the code...

from cinfony import cdk, rdk, pybel


descs = [("ALogP", "JCICS, 1999, 39, 868"),
("TPSA", "JMC, 2000, 43, 3714")]
toolkits = [(cdk, {'ALogP': None, 'TPSA': ['tpsa']}),
(rdk, {'ALogP': ['MolLogP', 'pyMolLogP'],
'TPSA': ['TPSA']}),
(pybel, {'ALogP': ['LogP'], 'TPSA': ['TPSA']})]
testcases = {'ALogP':
{"methane": (0.1441, 'C'),
"Example1": (1.40, 'c1(O)ccccc1(OC)'),
"Example2": (2.75, 'c1ccccc1c1ccccn1')},
'TPSA':
{"metoprolol": (50.7,
"c1(OCC(O)CNC(C)C)ccc(CCOC)cc1"),
"nordiazempam": (41.5,
"c1c(Cl)ccc2NC(=O)CN=C(c3ccccc3)c12")}}

for desc, ref in descs:
print desc + "\n" + "="*len(desc)
for k, v in testcases[desc].iteritems():
print "****%s*****" % k
print "The calculated value from %s is %f" % (ref, v[0])
for toolkit in toolkits:
if toolkit[1][desc]:
mol = toolkit[0].readstring("smi", v[1])
print toolkit[0].__name__,
print mol.calcdesc(toolkit[1][desc])
print "\n\n"

Image credit: pfly

3 comments:

Unknown22 April 2008 at 19:22
Noel, I really appreciate this effort and I hope that you keep track of the differences ;-)

If you are unlucky the same SMARTS patterns might lead to different results, since the chemical perception cascades might be very different (aromaticity, hybridisation, kekulization, stereo assignment, and finally atom typing).

This was basically one reason why people on OpenSmiles could not easily agree on an aromatic standard. If there would exist one, we could just share the text definition for it ;-)

I said this already several times, if we want clean solutions we need text, better XML, perceptions, only. This would require to modularize single perception layers, which take XML input. The XML can then be distributed and all packages are fine, just by implementing a simple XML to perception layer, which would be the same ;-)

Problem, we are far from having a query language strong enough for this task. I was hoping that OpenSmiles might serve as starting point, but the discussion seems stalled. MQL is still my favourite for several reasons, but I guess I am not objective on this.

Anyway, great work ... keep going all together ... accepting the problem is already a good way in improving problems ;-)
Noel O'Boyle24 April 2008 at 15:03
Thanks for the encouraging comments, Joerg. For sure, there are problems at several levels. Perhaps the aim of this 'first generation' of cheminformatics toolkits is simply to identify which problems are the most serious going forward, so that subsequent versions will be built to handle such problems from the start, e.g. Rich's series on "Can your cheminformatics tool handle this?".
Anonymous10 March 2009 at 09:16
Good post Noel, your keeping track of things!!!. I am reading gasteiger book (cover to cover) and developing apps using ruby.

By the way, I wanna say that
"Ruby my Love!"
"Ruby the future!!"

python is not good some how...OOPs suks badly. "___" suks too much...

Lot of syntax inconsistencies in python but lot of libraries are there.

Once Ruby has as much libraries the name "python" will be deleted from my head.

My choice "(Ruby + JRuby) ||
(Ruby + C)"
for any task.

Love,
Chemboy.

Noel O'Blog

Thursday, 24 April 2008

Cheminformatics toolkit face-off - LogP and TPSA

3 comments: