Saturday, 8 November 2008

Of OChRe, OSRA and OASA (but not OSCAR)

The field of Optical Chemical Recognition (OChRe) has been around a while with a number of well-established players (see Antony William's post on the subject). There's just a single open source program though, OSRA by Igor Filippov of the NIH, which was released in July 2007. Given an image containing one or several chemical structures, OSRA returns the SMILES string for the structures.

The dependencies of OSRA give an insight into the term "open source ecosystem". Rather than reinvent the wheel, OSRA makes use of open source libraries for optical character recognition (OCRAD, GOCR), bitmap to vector image conversion (POTRACE), messing about with images (ImageMagick, GREYstoration, ThinImage, CImg) and of course, cheminformatics (it uses either OpenBabel or RDKit). Luckily for Windows users, Igor provides a compiled version.

Back in July, I emailed Igor to request a new feature, the ability to output an SDF file containing the coordinates taken from the image. Three hours later he replied that he'd added it. And now, only four months later, I've gotten around to testing it (insert excuse here)...

Anyway, one nice way of testing conversion code is by roundtripping (or "There And Back Again"). So I took the now legendary depiction faceoff test file (see, for example, here), used the coordinates therein to create a PNG image using OASA (via Pybel or cinfony), ran OSRA on the resulting image to get some coordinates, and then used those coordinates to generate a PNG again. By eyeballing the two PNG images, it's possible to discover errors. So, here are the results of the OChRe.

Notes:
(1) If you notice any trends in the errors, comment below and Igor might fix them.
(2) Where there are missing images after the OChRe, this is where OSRA missed a bond (probably reasonably), generated two molecules, and caused OASA a headache (it only handles single molecules).

Here's the code:
import pybel
import popen2

odir = "images"
for mol in pybel.readfile("sdf", "onecomponent.sdf"):
    title = mol.title
    print title
    mol.draw(usecoords=True, show=False, filename=os.path.join(odir, "%s_oasa.png" % title))
    o, i, e = popen2.popen3("../osra-trunk/osra -f sdf %s/%s_oasa.png" % (odir, title))
    osrasdf = o.read()
    newmol = pybel.readstring("sdf", osrasdf)
    try:
        newmol.draw(usecoords=True, show=False, filename=os.path.join(odir, "%s_osra.png" % title))
    except AssertionError:
        print "Unconnected!"

Image credit: Jason.Hudson

17 comments:

Egon Willighagen said...

Nice analysis! BTW, I think we can all forgive 14384490 to go wrong :)

Maybe you can add some summary on the top of the page to list success/fail rates. And maybe categorize them accordingly too?

Geoff Hutchison said...

If there's any sort of "trend," it's that the right column often has wedge bonds to non-chiral atoms. So I'd suggest adding a pass after recognition: if two bonded atoms are both non-chiral, then the bond should not have wedge/hash notation.

That would solve a fair number of minor bugs. It's harder to classify the major issues.

But I can attest from Open Babel that these kinds of round-trip tests are hugely useful.

Joerg Kurt Wegner said...

Love the roundtripping idea, very nice work !

Do you think it would be possible to use the very same framework with an additional image->PDF->image conversion and different resolutions? This would be a closer reality check.

Thanks, again ... looking forward for the next study ;-)

Igor Filippov said...

Noel, many thanks for this analysis and the kind words!

To Geoff's comment - the wedge bonds are detected based on the line thickness,
in some cases it doesn't work too well - in most cases I've seen before it was the other way around though - a wedge bond mis-recognized as regular single bond. I'm not sure however that a post-image processing check for chiral atoms is simple, there are a lot of weird stereochemistry cases out there...

To Joerg - unfortunately PDF is only getting processed at 150dpi right now. There are two reasons for that - 1) Speed. A multi-page document can take quite a while even at a single smaller resolution, 2) There are some strange problems with Ghostscript that I've seen when attempting rendering a PDF at a higher resolutions - memory usage going through the roof, program crashes etc.

Thank you for the comments,
I really appreciate the input!

Egon Willighagen said...

Igor, if multipage PDFs are a problem, why not split them up into single page PDFs first? Would also allow easy parallelization of PDF processing...

Check out:

http://www.pdfhacks.com/pdftk/

available from many GNU/Linux distributions, like Ubuntu:

http://packages.ubuntu.com/search?keywords=pdftk

Igor Filippov said...

Egon,

ImageMagick library which I use already supports reading PDFs one page at a time (using Ghostscript), the problem is not that there are many pages in a document but to keep processing reasonably fast so that the user doesn't get bored and walks away :) Also there seems to be
some strange issues with Ghostscript at high resolutions. I would consider a replacement for Ghostscript but it does makes things more complicated if I have to use an image processing framework outside of ImageMagick.

baoilleach said...

@egon, I'll add the summary if you give me the figures. I'm trying to build a bazaar here, not a cathedral. :-)

@igor: I've never really understand the DPI issue and OSRA. What DPI should I use to analyse the images here (how do I find out)?

Egon Willighagen said...

Noel, you created the images from a CT, right? And that's what OSRA outputs... I thought you could easily compare those...

What numbers did you have in mind instead?

baoilleach said...

@egon: Checking identity of connection tables would be pretty fast. But categorising the errors...it would be faster just to eyeball the images and make notes which any reader of the blog can do. And the exact figure doesn't seem very interesting to me; what can it be compared to?

baoilleach said...

46 out of 90 are converted without error. I wonder is the color causing problems regarding wedge detection - I should be able to check this...

Igor said...

Noel, about the DPI - if you leave it out (or set to 0)
OSRA will try 72,150, and 300 dpi and pick the best one automatically. It seems to work quite well, if I say so myself. Usually screen captures and computer-generated images designed to be viewed on the screen are 72 dpi, scanned documents are 300
(it's rather a convention in OCR that 300 dpi is what
one should focus on), and 150 is just in-between :)
For PDF/PS the document itself is actually already in vector representation, ImageMagick renders it to raster format then I process it as any other image. If you know that you've scanned your image at a different resolution or
if you're just looking for faster processing time then
use -r option to specify the
resolution you'd like to use.

baoilleach said...

Firstly, black+white gave the same results. Secondly, it seems that Beda Kosata and Daniel Svozil are already engaged in a thorough investigation of OChRe statistics for different programs. I look forward to seeing the results.

Igor said...

Beda Kosata and Daniel Svozil - sounds interesting, do you have a link?

baoilleach said...

"personal communication". I suggest you get in touch with Beda if interested.

Igor said...

I believe mis-recognized wedge bonds have simple explanation - whenever a single bond ends at an intersection that could be mistaken for a bond end looking thicker there is a chance for miscategorization.
This chance is greater for low-res images. Even though the bond thickness is sampled at three different points, all away from the very ends, it could potentially be that the measurements are 1,2,3 pixels thickness for beginning, middle and close to the end of the bond for a regular single bond.

Igor said...

version 1.1.0 is now released, it has better wedge bond detection algorithm
which would hopefully resolve the unfortunate "trend" with mis-categorized stereochemistry.
http://cactus.nci.nih.gov/osra/

baoilleach said...

I will repeat the analysis (in a new blog post). I've also figured out how to handle the cases where OSRA generates two disconnected molecules.