Wednesday 23 May 2012

When Mol files go wrong III

Let's play spot the difference. Are the following structures the same? (Mol files from CHEMBL186139 and CHEMBL1180158.)
But if they're the same, then how come there are two distinct entries for this in the database? Well guess what - they don't have the same InChI:
InChI=1S/C30H36N4/c1-2-10-20-32-28-18-24-34(30-16-8-6-14-26(28)30)22-12-4-3-11-21-33-23-17-27(31-19-9-1)25-13-5-7-15-29(25)33/h5-8,13-18,23-24H,1-4,9-12,19-22H2/p+2/b31-27+,32-28?
InChI=1S/C30H36N4/c1-2-10-20-32-28-18-24-34(30-16-8-6-14-26(28)30)22-12-4-3-11-21-33-23-17-27(31-19-9-1)25-13-5-7-15-29(25)33/h5-8,13-18,23-24H,1-4,9-12,19-22H2/p+2/b31-27-,32-28+
The nitrogen attached to the ring is treated as a C=N once the two protons are added to neutralise the charge. The InChI code then considers the stereochemistry across that double bond to be defined in one case (177.2°) but undefined in the other (179.1°). Here are the pictures from winchi (click to enlarge):
I'm not quite sure where the problem is. Is the InChI correct to make the distinction? Any thoughts?

Monday 21 May 2012

When Mol files go wrong II

With time I've become more convinced that the SMILES format is more capable of faithfully storing stereochemistry than a 2D format such as Mol. Here is another tale of woe, related to tetrahedral stereocentres with one implicit bond, illustrated and annotated by Symyx Draw (am I the only one who thinks this is better than ChemDraw?).

Did you know that the stereochemistry of the wedge is interpreted differently in the two following cases?
Easy peasy, eh? But what about the in-between case where the angle between the two plane bonds is close to 180 (see below on left)? Guess what - you're in trouble if you do this. Some software will regard this as undefined, some will continue on regardless. If you look at the InChI string you can see that it regards the stereo as undefined, whereas the SMILES string does contain a stereocentre. In short, you've got a problem; if you're in charge of a database, you should identify such cases and fix them (manually), for example as shown on the bottom right (if that is the correct stereo) or by adding the implicit hydrogen.
In the course of other work, I've come across some instances of this problem in ChEMBL and will be talking to the team about sorting it out. Does anyone have other examples of potential stereo problems in Mol files and how to identify them?

Wednesday 2 May 2012

Speedup repeated calls to Python functions

If your Python script has repeated calls to a function with the same parameters each time, you can speed things up by caching the result. This is called memoization. It's not rocket science.

What is rocket science is that with a little bit of Python magic (see the code here), you can simply add memoization to any function with the @memoized decorator, e.g.
@memoized
def calcHOMO(smiles):
   # Generate Gaussian input file with Open Babel
   # and run Gaussian to find the HOMO.
   return homo
Calling this function with a SMILES string the first time would return the HOMO after 10 minutes. Calling it a second time would return the result instantly.

Update (11/05/2012): This feature is available in the Python standard library as of Python 3.2. See Andrew's comment below