Thursday, 8 May 2008

Cheminformatics toolkit face-off - Depiction

[See update (28/10/08).]

Now for some pretty pictures as well as some not so pretty. Yes, it's the turn of the structure diagram generators (SDGs) to strut their stuff and throw some shapes. How do they perform for 100 random compounds from PubChem?

Here are my results for depiction and structure diagram generation (Note: I will move these links to my regular hosting in the near future). The images were generated using cinfony, the code is here and the dataset is here.

Some comments:
(0) Rich Apodaca has written an overview of Open Source SDGs.
(1) 2D coordinate generation is independent of depiction. A SDG typically has both parts but coordinates could be generated with one toolkit and depicted with another.
(2) Looking good is not the same as chemical accuracy. But looking good is important too! :-)
(3) OASA (Obviously Another Stupid Acronym) is a Python SDG which is part of BKchem by Beda Kosata. To depict images, it uses Pycairo. OASA currently does not support stereochemistry (e.g. across double bonds) and so the generated coordinates will not be chemically-accurate in some circumstances. This of course does not affect its ability to depict coordinates generated by other toolkits. OASA can depict wedges and bonds but I haven't yet implemented this in cinfony.
(4) It is not possible to use the CDK's depiction mechanism natively from CPython using JPype (it's fine from Jython though). That is why I use OASA to depict the CDK's generated coordinates. Technically, this is because JPype doesn't allow subclassing of Java classes (JPanel in this case). It does however, allow Python classes to implement a Java interface. It would be great if the CDK could add a convenience interface to allow this (I can supply more details off-blog).
(5) The PubChem images and coordinates are generated by the Cactvs toolkit.
(6) There are two sets of RDKit images. Since the current release has quite poor depictions, I decided to also include the results from a development branch, which uses Aggdraw.
(7) It is important to consider how to handle hydrogens. With OASA, I just drew all the hydrogens. This is probably not a good idea.
(8) It is likely that I have not used the best parameters, etc. for generating some of these images. I welcome any suggestions to improve image quality.
(9) OASA wasn't able to layout CID 1373132, and so that molecule was not included (I guess I should have just caught the exception and continued). The error message was:
Traceback (most recent call last):
File "cdkjpype.py", line 464, in draw
oasa.cairo_out.cairo_out().mol_to_cairo(mol, filename)
File "bkchem\oasa\oasa\cairo_out.py", line 85, i
n mol_to_cairo
self._draw_edge( e)
File "bkchem\oasa\oasa\cairo_out.py", line 121,
in _draw_edge
side += reduce( operator.add, [geometry.on_which_
side_is_point( 
start+end, (
self.transformer.transform_xy( a.x,a.y))) for a
in ring if a!=v1 and a!=v2])
File "bkchem\oasa\oasa\geometry.py", line 92, in
on_which_side_is_point
b = atan2( y2-y1, x2-x1)
ValueError: math domain error

(10) PubChem entries with more than 1 connected component were not included in this test. (As a result, the number of molecules shown is actually less than 100.)

Image: Creation by Ariana Rose Taylor-Stanley (CC BY 2.0)

6 comments:

Wolf Ihlenfeldt said...

In PubChe, both coordinates and depictions are *exclusively* generated by the Cactvs toolkit (free for academics, www.xemistry.com/academic). There is *zero* OpenEye involvement in these.

greg landrum said...

I think using the Hs with OASA was unfair to it. Having the Hs in there during the coordinate generation makes things super crowded (and thus hard to depict) and having them in the drawings looks a bit silly.

The new RDKit molecule drawing code you used isn't drawing dashed bonds correctly (obviously). The author (ahem!) forgot to check in the line dashing code. That will soon be fixed.

Noel O'Boyle said...

I got PMRed! Hope the server can handle it. See further comments there

Noel O'Boyle said...

@wolf: I've corrected my error. My mistake was due to the fact that the SDF files contain "OEChem" in the header.

@drgergl: It's now clear that I need to remove the Hs before using OASA. If OASA had an option to remove relevant Hs itself I would have used it. In RDKit, some Hs are retained, so it seems that it's not just a case of removing all Hs.

Wolf Ihlenfeldt said...

Here are some comments on the more subtle mechanics working behind the scenes of the PubChem layouter. This may help to explain why the PubChem images (mostly) look familiar, recognizable and pleasant, and why other tools fail:

a) Ring system alignment. Dominant ring systems are drawn in a standard orientation (13199309). We do not want to see steroids in weird orientations.

b) Ring system substituents and embedded hetero atoms also influence the orientation - substituents preferentially on top, heteroatoms on bottom (14610101).

c) You want a clear indication whether a double bond is stereo or not. Stereo double bonds have explicit H, non-stereogenic double bonds are crossed (3834779). Plotting a stereogenic double bond in a random cis or trans fashion without crossing bonds, wavy substituent bonds or other indicators is a grave error.

d) The stereochemistry of ring double bonds and double bonds attached to a ring (11870297) is definitely important when they are part of a macrocyle. No other renderer seems to get this right.

e) In some cases, wedges are best put on explicit hydrogens (14785619). None of the other renders manage to display any recognizable stereochemistry for this cpd.

f) Charges and isotope labels are an important part of the structure data. Omitting them is just wrong (12389967).

g) Dynamic autoscaling is a must. For many applications, you cannot allow huge compounds to break out of your box (9682082). All PubChem images are the same size. If a structure is too big to fit into the box, it is scaled down.

h) It is often possible to avoid atom overlap with a little effort in postprocessing (11829069, 19040534). And if you need to make a compromise (14384490), better allow bonds and rings to overlap than atoms.

Some other important features not shown in the sample set, but handled by the renderer are:

- Stereochemically correct odd and even allenes, quadratic planar cpds etc. are important, too, and do appear in PubChem. And SF4 should not be drawn the same way as CH4 -atom hybridization is part of the equation.

- The renderer can compute and display contracted symbols (NH2, COOH), but this is currently not enabled for the PubChem images. Also, it can both use available z coordinate information, or automatically infer it for cage ring systems, in order to render crossing bonds with a gap.

- If there are multiple fragments, special alignment rules bring opposite charge pairs into vicinity, etc. And if somebody encodes an alloy as 90 atoms of metal A and 10 atoms of metal B (yes, you can find them in PubChem), you do not want to see a chain of pearls in the image, but something more reasonably grouped.

The Cactvs display routines obtain information about defined stereochemistry and connectivity from code implemented with the aid of another toolkit, so it is not to blame for the choice of strange tautomers such as in 20834758.

The layout component of the toolkit code is more than 25K lines. This count includes ring analysis, rendering functions etc. but not auxiliary support routines (I/O, low-level imaging). This is probably more than the total line count of some of the other tool kits. As always, your results tend to be related to what you invest.

Unknown said...

oh oh, WDI entered the scence and you are a very dangerous and very welcome person having in any cheminformatics discussion ;-)

Thanks, for clarifying !

And, any interest in screwing some of our naive OpenSMILES discussions? Your comments are wholeheartedly welcome.