Tuesday 21 December 2010

All I want for Christmas is...comprehensive user documentation

Want something to read over Christmas? I've got 137 pages of technical docs right here for you.

It's the Open Babel user documentation. It still has a (*cough*) few TODOs here and there, but it brings together various bits of documentation scattered over the wiki and covers a bit of everything I think. It's been on the web as HTML for a while, but I've just set up the automated build for a PDF and so am giving it a shout out.

So grab a few mince pies. Throw another logP on the fire, and get ready for some riveting reading. Then dial it back a notch and check out the Open Babel docs.

P.S. If you find any mistakes or have any suggestions, feel (very) free to just fix them yourself or add some text in by forking the docs on github.

See you in the New Year...

Monday 13 December 2010

Go dig in Indigo

About a year ago, SciTouch LLC announced the release of the open source cheminformatics library Indigo. At that time, the full API of the library was not exposed to the user. Instead, a set of simplified APIs were available along with a number of command-line applications.

Just a few weeks ago at Goslar, Dmitry and Mikhail of Indigo made a couple of announcements. First, the library, documentation, etc. is no longer at SciTouch, but at GGA software. Second, the full API is now being made available. It's actually a C library, but wrappers are available for C#, Java and Python. If you're interested in accessing it from Python, follow the Download link to the Python API.

The Indigo library is currently at the version 1.0 Beta 3 stage, but new API functions are being added on request over at the Indigo mailing list. So if there's something missing that you'd like to have, get over there now and ask about it.

Naturally, I'm interested in adding support for Indigo to Cinfony. I've already pretty much done the Python bindings. Just put indy.py into the same directory as indigo.py and away you go. As usual, let me know if you find any bugs.

Wednesday 8 December 2010

Name that stereochemistry - When Mol files go wrong

Here's an image from PubChem. I count two chiral centers, but for how many of these is the chirality specified?

As Egon pointed out in a comment on the original post, it's impossible to interpret this image without assuming the use of a particular wedge/hash bond convention. Either the stereochemistry is defined at both edges of the wedge, or it's defined only at one end.

The problem with MOL files

But this isn't just a problem with images of molecules; this same problem affects 2D MOL files. The underlying MOL file for the above molecule has that same bond marked as a wedge. Any time this happens to a bond connecting two chiral centers, there is a resulting ambiguity that requires a particular convention to be assumed. If your primary means of storing chemical data is a 2D MOL file, you should start feeling nervous right about now.

To be fair, the example discussed is really an isolated case in PubChem. In the case of the first 23071 molecules in PubChem, there are 14362 bonds connecting two chiral centers, but there are only 21 instances where they have been marked as wedge or hash. (I note that for all of these cases, it was possible to choose a different stereobond to avoid this problem.)

Another database whose primary means of storing chemical information is 2D MOL files is the ChEMBL database. This contains 635933 molecules, with 482773 inter-chiralcenter bonds. Of these bonds, there are 7335 marked as stereobonds. In other words, more than 1% of the molecules have an ambiguous stereocenter, simply because of the way the stereochemistry was encoded into the MOL file. This is probably a bit high, and I expect that ChEMBL will either fix this or point out an error in my calculations.

Interpret this

Okay. So we know the problem. But we still need to answer the original question...what stereochemistry was intended for the molecule in question?

First, here's the PubChem record.

The most common (and recommended) convention for handling a wedge/hash bond connecting two stereocenters is to consider the stereochemisty as defined only at the thin end of the wedge. If you look at the SMILES string calculated by OEChem, and the InChI string (calculated by the official InChI binary?), you will see that this is how the MOL file was interpreted.

But, here's the thing...

I think that the convention that the generator of the MOL file was following was that the stereochemistry was defined at both ends. Why? Well, firstly, similar molecules in the database have their stereochemistry clearly defined, most notably ascorbic acid of which this is a derivative. And secondly, the chiral marks in the connection table in the MOL file indicate that the stereochemistry is defined at both centers. Correction (see comment by WDI below): The chiral marks do not indicate this - they actually say that the stereochem is only defined at one center.

What to do

The reason I'm even looking into this is because I'm trying to figure out how Open Babel should handle these cases. When reading them, should it just assume that the MOL file was following the common convention but issue a warning to flag up the fact that the MOL file sucks? Should it provide an option to read other conventions? Should it avoid writing files that contain inter-chiralcenter stereobonds even if they were in the input, or will that upset users who expect Open Babel to pass wedge/hash bonds through unchanged?

Of course, this could all be avoided if people would just fix their 2D MOL files. With that in mind, here a couple of lines of code to identify such problems (requires dev version of OB 2.3):

Monday 6 December 2010

Name that stereochemistry

Here's an image from PubChem. I count two chiral centers, but for how many of these is the chirality specified?