Tuesday, 22 April 2008

Cheminformatics toolkit face-off - Molecular weight

Next up in this series of toolkit comparisons is the molecular weight. Here, cinfony is going to support two ideas:
The gold standard is the element data in the Blue Obelisk Data Repository (BODR), which was created for just this purpose.

The following are the results for a hydrogen atom and a carbon atom according to the BODR and each of the toolkits. The first figure in parentheses is the molwt, then the exactmass.
BODR: (1.00794, 1.007823032)
(12.0107, 12)

Pybel: (1.00794, 1.007825032)
(12.0107, 12.0)

RDKit: (1.008)
(12.011)

CDK: (1.0079400539398193, 2.0156500339508057)
(12.010700225830078, 16.031299591064453)

OpenBabel is the only toolkit that gets everything right. RDKit is doing okay, but should consider using BODR data in future. The CDK presents the most intriguing results. I'm not sure whether the Python/Java interface has introduced noise into the molwt values, but they are exactly in agreement with the BODR up to the 7th decimal place or so, after which something goes weird. On the other hand, it's quite clear that the CDK's exactmass is simply not behaving as advertised. It is using Deuterium for the hydrogen and C-16 for the carbon. (I note that MFAnalyser has already been replaced in the CDK development code.)

Image credit: bugmonkey

4 comments:

Egon Willighagen said...

Yeah, rounding problem for the CDK. Not sure what happens with the 16C and 2H... what code did you use? Also for the bridging bits, please.

baoilleach said...

>>> mol = cdk.readstring("smi", "[C]").Molecule
>>> mf = cdk.cdk.tools.MFAnalyser
>>> mf(mol).getMass()
16.031299591064453
>>> mf(mol).getCanonicalMass()
12.010700225830078

(Not quite sure what you mean by the bridging bits...hope this is sufficient)

Andrew Dalke said...

If you want, I can run tests against the OpenEye libraries. That depends on if this is a "toolkit face-off" or an "open source toolkit face-off" :)

There is a "OECalculateMolecularWeight()" function which does a MW calculation ignoring isotopic weight, but nothing more. Instead, you get to build your own from

>>> OEGetIsotopicWeight(1, 1)
1.007825032
>>> OEGetIsotopicWeight(1, 2)
2.0141017780000001
>>> OEGetIsotopicWeight(1, 3)
3.0160492680000002

Accurate MW calculations is a bit of a problem. Do you know the isotopic distributions for an element? Does the material come from a land source, or sea source? They had different distributions. Indeed, corn-based foods have their own distinctive carbon isotope distribution.

Even if you stick with exact values, the measured values do change. Just checking now I see that there are 2007 numbers from IUPAC (see http://old.iupac.org/news/archives/2007/atomic-weights_revised07.html ) which are different than the BODR data. It looks like BODR uses the 2005 numbers.

That said, thanks for doing the cross-check. Typos and mistakes are so easy to do, and hard to verify when there's just lists upon lists of numbers.

baoilleach said...

Thanks for the offer Andrew, but I'm actually just running through test cases for cinfony, but dressing it up as a "face-off" :-)

Regarding the figures in the BODR, I'll forward the comment to those responsible.