Thursday, 13 October 2011

Recognise this? Roundtripping chemical images

With the imminent release of Open Babel 2.3.1, I thought I'd come up with some examples of use for a new feature, PNG depiction.

To generate a PNG with Open Babel you just use the PNG output format:
obabel -:CC(=O)Cl -O tmp.png

Open Babel actually allows you to embed the chemical structure (in any format) directly into a new or existing PNG file. If you do this, then you can roundtrip as follows:
> obabel -:CC(=O)Cl -O tmp.png -xO smi
> obabel tmp.png -osmi
CC(=O)Cl

If you haven't embedded a chemical structure in the image, you'll have to use optical chemical recognition software such as the open source OSRA (Igor Filippov) or Imago (GGA Software). Both of these can output a MOL/SDF file, which contains the 2D coordinates of the perceived structure, and this can be depicted. I did this for a set of 450 images from the Japanese Patent Office as follows:
> for %a in (*.tif) do "C:\Program Files (x86)\osra\1.3.8\osra.bat" %a --format sdf | obabel -isdf -O %~na_osra.png -d
> for %a in (*_chem.png) do "C:\Program Files\GGA Software\Imago Toolkit\alter_ego.exe" %a -o tmp.mol -q && obabel tmp.mol -O %~na_imago.png -d

The results are here: Subset 1 2 3.

Notes:
1. Open Babel depiction for large molecules needs to be fixed, as the lines get faint and disappear in some cases. [Update (26/03/2012): Now fixed]
2. The tiff files needed to be converted to pngs for Imago (used a "for" loop with Imagemagick convert).
3. In the case of multiple molecules in the OSRA output, only the first molecule is depicted (I think).
4. Several structure gave error messages when depicting the Imago structures due to unrecognised labels. I think there's a way around this but I didn't look into it.

7 comments:

Egon Willighagen said...

Really, really, really nice!

But what's up with the coloring in that image? Is that the same green?

(Recent OB makes me jealous...)

baoilleach said...

Have you thought about adopting the MCDL Java code for layout? That's what we use (as converted to C++ by the author, Sergei Trepalin), and it seems to work great.

gilleain said...

Hmm. I've tried out the mcdl editor (and read the paper) but sourceforge doesn't have the source code...

Geoff said...

You should mention that OB can also round-trip it's own PNG files, since it includes the molecule as text inside the graphic.

baoilleach said...

@gilleain: I see some source code there as a download.

@Geoff: I think I did mention that, but I'll make it a bit clearer...

gilleain said...

@baoilleach : Ah, right Found it; thanks.

The code is ... interesting. It's essentially already in a non-java language (C++?) with things like VarInt as a way around the call-by-value only nature of Java functions.

It might be the 100 or so templates that make the layout code, but I haven't looked very closely.

baoilleach said...

I think it was originally Delphi and then ported to Java and C++. But you should talk to Sergei if interested.