Sunday 28 February 2010

How to store stereochemistry in Mol files

It's almost a year now since Tim Vandermeersch started to remove all of the stereochemistry code from OpenBabel and write it all from scratch. The first fruits of this work made it out into OB 2.2.3 where stereochemistry in SMILES uses the new code and works correctly.

I've helped out with reading and writing code for various formats. Right now, I'm adding stereo (i.e. double bond stereochemistry, and chirality) to the MDL Mol format. There are three places where stereochemical information can be stored in these files: the coordinates, the atom parity (in the atom block), the bond stereo (in the bond block).

My current understanding is that where 3D coordinates are present, there's no need to store stereochemical information in either the atom parity or the bond block. I think I'll probably set the atom parity anyway (since I've already written the code, and it helps when you look at the file to be able to easily identify the chiral centers).

For 2D coordinates, there's no need to store the bond stereochemistry (as this can be worked out from the coordinates), but chirality needs to be stored explicitly. The normal way to store this is not using atom parity (but I'll set this anyway for the same reasons as above), but by setting one of the bonds on the tetrahedral center to up or down.

For 0D coordinates, there are no guidelines. I propose to store cis/trans stereo using the bond stereo (you know, UP [or DOWN] at both ends of a double bond means cis), and chirality using the atom parity. The MDL spec states that atom parity should be ignored when read, but the alternative is to just forget the stereochemistry, or else to store both cis/trans stereo *and* chirality in the bond block, which may just about be possible but is likely to be a real mess.

Any thoughts?

Image credit: pt

Thursday 18 February 2010

Chemistry in R - Upcoming workshop

Some time ago, I described the available libraries for carrying out cheminformatics in R. Much of this work has been carried out by Rajarshi Guha, who has developed the R packages RCDK, RPubChem and fingerprint.

Now's your chance to have a masterclass in the use of these packages because Rajarshi is going to be giving an intensive workshop on their use in May at the EBI. Information on the workshop is available on its webpage, but all you need to know is that it costs a whopping £25 per day (the second day is on metabolomics and R) and you might want to book early.

I already have. See you there!

Here's the timetable on the first day:
  • 09:15 Brief Introduction to R
  • A very brief overview of R from a programming and application point of view. Will look at some basic programming constructs and then survey packages that are useful for cheminformatics problems. This session will also briefly touch on RDBMS access from R.
  • 10:45 Brief Introduction to the CDK
  • An overview of CDK functionality. This will be relatively high level and will not go into the nitty gritty details of Java programming. Will serve to highlight what can (and cannot) be done from R.
  • 11:30 Input/Output & Molecular Manipulations
  • Reading and writing chemical structures from various sources and in various formats. What does I/O entail in the CDK programming model? How does it affect working in R? Once we have a set of molecules, what can we do with them? We’ll cover accessing atoms and bonds, setting and getting properties on molecules and so on.
  • 13:30 Fingerprints & Similarity Searching
  • I’ll discuss accessing the various fingerprint methods of the CDK and manipulating fingerprints using the fingerprint R package. I’ll also address reading fingerprint data from files generated by other programs.
  • 14:00 Descriptors and QSAR Modeling
  • QSAR modeling is a common cheminformatics task. Key to developing QSAR models is the evaluation of molecular descriptors. In this session, I’ll cover the available types of descriptors and how one evaluates them. We’ll then run through examples of developing QSAR models, starting from molecule loading and ending at a final model.
  • 15:45 R, CDK and Chemical Databases
  • Getting chemical structure and bioassay information from PubChem using R. This section will overview the functions that let one retrieve structures, assay information and assay data. I will also highlight current limitations in terms of data size and ways around these limits.
  • 16:45 Adding New Functionality
  • This session will highlight how one might go about extending the package – either by wrapping calls to the CDK in R or by adding your own Java methods and then calling them.
  • 17:15 Additional Practical and Questions

Wednesday 10 February 2010

An RSS feed for the CCL list....Update

Some time ago I described the (unofficial) RSS feed I developed to keep up to date with the Computational Chemistry List (

If you were subscribed to that feed (and Google Reader tells me that there were 40 people at least following it) you may have noticed that it has not been updated since mid January. This was due to my use of a domain name which has since lapsed. Not to worry, just resubscribe to with your favourite feed reader, and all will be well again.

By the way, if you're interested in subscribing to Indiana University's cheminformatics list (CHMINF-L), I've just realised that there's an RSS feed provided in the bottom right of the archive page. Now you can free your inbox and still keep up to date with these high-volume lists.

Tuesday 2 February 2010

Similarity searching using OpenBabel - find similar molecules in ChEMBLdb

You can use the 'babel' program provided with OpenBabel (on Windows as well as Linux) to search in a database for molecules similar to a particular query. The full details are in the Fingerprint Tutorial on the OpenBabel wiki, but here is a case study using ChEMBLdb which is available as an SDF file of 517261 molecules.

Note that we are using the default OpenBabel fingerprint for all of these analyses. This fingerprint is FP2, a path-based fingerprint (somewhat similar to the Daylight fingerprints).

(1) Download Version 2 of ChEMBLdb from

(2) After unzipping it, make a fastsearch index (this took 18 minutes on my machine, for the 500K+ molecules).
babel chembl_02.sdf -ofs

(3) Let's use the first molecule in the sdf file as a query. Using Notepad (or on Linux, "head -79 chembl_02.sdf") extract the first molecule and save it as "first.sdf". Note that the molecules in the ChEMBL sdf do not have titles; instead, their IDs are stored in the "chebi_id" property field.

(4) This first molecule is 100183. Check its ChEMBL page. It's pretty weird, but is there anything similiar in ChEMBLdb? Let's find the 5 most similar molecules. Because we have created the fastsearch index, this is extremely fast - on my machine it just takes 2 seconds:
babel chembl_02.fs mostsim.sdf -Sfirst.sdf -at5

(5) The results are stored in mostsim.sdf, but how similar are these molecules to the query?
babel first.sdf mostsim.sdf -ofpt
> Tanimoto from first mol = 1
Possible superstructure of first mol
> Tanimoto from first mol = 0.986301

> Tanimoto from first mol = 0.924051
Possible superstructure of first mol
> Tanimoto from first mol = 0.869048
Possible superstructure of first mol
> Tanimoto from first mol = 0.857143
6 molecules converted
76 audit log messages

(6) That's all very well, but it would be nice to show the ChEBI IDs. Let's set the title field of mostsim.sdf to the content of the "chebi_id" property field, and repeat step 5.
babel mostsim.sdf mostsim_withtitle.sdf --append "chebi_id"
babel first.sdf mostsim_withtitle.sdf -ofpt
>100183 Tanimoto from first mol = 1
Possible superstructure of first mol
>124893 Tanimoto from first mol = 0.986301
>206983 Tanimoto from first mol = 0.924051
Possible superstructure of first mol

>207022 Tanimoto from first mol = 0.869048
Possible superstructure of first mol
>607087 Tanimoto from first mol = 0.857143
6 molecules converted
76 audit log messages

(7) Here are the ChEMBL pages for these molecules: 100183, 124893, 206983, 207022, 607087. I think it is fair to say that they are pretty similiar. In particular, the output states that 206983 and 207022 are possible superstructures of the query molecule, and that is indeed true.