Showing posts with label OpenBabel. Show all posts
Showing posts with label OpenBabel. Show all posts

Wednesday, 9 September 2009

Running the Windows OpenBabel GUI under Linux on the Windows desktop - Need some Wine?


A friend of mine (Ed Cannon) recently showed me the OpenBabel GUI running on Linux. The surprising thing about this is that OpenBabel currently does not have a Linux version of the GUI (Update 16/Sept/09: Now it does). He was running our Windows release on Linux using Wine, the Windows emulator ("sudo apt-get install wine"). Cool, I thought - I didn't realise that that would even work. Cue blog post.

To get a screenshot I needed Linux. As I described earlier, it's easy (and free) to run Linux on Windows using VMWare Player. This time I installed an Ubuntu 9.04 image. And then (after running "sudo vmware-config-tools.pl") I discovered a new feature called Unity mode. This allows you to use the virtual machine to start Linux applications that appear directly on your Windows desktop (rather than enclosed in a Linux desktop). So I decided to get a screenshot of the Windows OpenBabel GUI running under Wine/Linux together with it running natively on Windows.

The only catch is that I wasn't able to screen capture the Linux application in Windows so in the end, despite all my hard work, I had to Gimp two images together. The resulting image is accurate though, and you can click for a larger version.

Wednesday, 26 August 2009

Using OpenBabel from Java

OpenBabel 2.2.3 has just been released and with that, a new release of the Java bindings. The full details on using OpenBabel from Java are available on our wiki. On Windows openbabel.jar is included with the OpenBabel GUI so no additional installation is necessary. You just start Eclipse, add the jar file and away you go.

The following example shows how to use OpenBabel from Java. It includes an example of file format conversion, iteration over atoms, and using the SMARTS matcher.
import org.openbabel.*;

public class Test {

public static void main(String[] args) {
// Initialise
System.loadLibrary("openbabel_java");

// Read molecule from SMILES string
OBConversion conv = new OBConversion();
OBMol mol = new OBMol();
conv.SetInFormat("smi");
conv.ReadString(mol, "C(Cl)(=O)CCC(=O)Cl");

// Print out some general information on the molecule, atoms
conv.SetOutFormat("can");
System.out.print("Canonical SMILES: " + conv.WriteString(mol));
System.out.println("The molecular weight of the molecule is "
+ mol.GetMolWt());
for(OBAtom atom : new OBMolAtomIter(mol)) {
System.out.println("Atom " + atom.GetIdx() +
": atomic number = " + atom.GetAtomicNum() +
", hybridisation = " + atom.GetHyb());
}

// What are the indices of the carbon atoms
// of the acid chloride groups?
OBSmartsPattern acidpattern = new OBSmartsPattern();
acidpattern.Init("C(=O)Cl");
acidpattern.Match(mol);

vvInt matches = acidpattern.GetUMapList();
System.out.println("There are " + matches.size() +
" acid chloride groups");
System.out.print("The carbon atoms of the matches are: ");
for(int i=0; i<matches.size(); i++)
System.out.print(matches.get(i).get(0) + " ");
}
}
The output is as follows:
Canonical SMILES: ClC(=O)CCC(=O)Cl
The molecular weight of the molecule is 154.97935999999999
Atom 1: atomic number = 6, hybridisation = 2
Atom 2: atomic number = 17, hybridisation = 0
Atom 3: atomic number = 8, hybridisation = 2
Atom 4: atomic number = 6, hybridisation = 3
Atom 5: atomic number = 6, hybridisation = 3
Atom 6: atomic number = 6, hybridisation = 2
Atom 7: atomic number = 8, hybridisation = 2
Atom 8: atomic number = 17, hybridisation = 0
There are 2 acid chloride groups
The carbon atoms of the matches are: 1 6
Note: although using OpenBabel from Eclipse on Windows works fine, some users have reported problems on Linux with the default OpenBabel build. You probably need to build OpenBabel statically on Linux if you want to use it from Eclipse, but I haven't tested this. In any case, you can just compile it from the command line.

Wednesday, 18 March 2009

The Clockwisdom of SMILES

I was recently confronted with a question that many of us face at some point in our lives: how many ways can the groups attached to a chiral C be moved around in a SMILES string while retaining the clockwisdom?

What's all this about clockwisdom? Well, a chiral SMILES string can indicate R or S around a tetrahedral centre using C@ or C@@. The difference is that R or S refer to clockwisdom of groups arranged by CIP priority (with the lowest priority facing away), whereas @ and @@ refer to clockwisdom of groups arranged in order of their appearance in the SMILES string (with the first appearing facing towards) [1]. Whether this was a good design decision by the Daylight gurus, I'm not 100% sure, but that's how it is.

So in short, if you change the order of groups in the SMILES string, you may need to change the clockwisdom to ensure that stereochemistry is preserved. Specifically, if you swap two groups you will get the other enantiomer ("putting the SMILES on the other face"?) unless you flip the clockwisdom; that is, Cl[C@@](Br)(C)I is the same enantiomer as Cl[C@](Br)(I)C. Another swap and we get back a SMILES string with the original clockwisdom.

So I started off by trying to think of a clever program to identify how many swaps were required to convert between two orderings of groups. Next I tried to write a few loops that would simply perform all possible swaps of groups to generate all of rearrangments, but that missed a few. In the end, I just wrote the dumbest program I could think of and got the following results. For an original ordering of groups 0123, the following orderings have the same clockwisdom: 1032, 3021, 2013, 3210, 1320, 3102, 0123, 0231, 0312, 2301, 1203, 2130.

And the point of all this? OpenBabel was not generating the correct stereochemistry around tetrahedral carbons in canonical SMILES. Now fixed.

Update (19/03/09): Tim Vandermeersch pointed out to me a neat way of determining the parity of a particular ordering of groups. Simply count the number of pairs in the ordering where one number is larger than another number to its right. For example, for 1032, there are two pairs (10, 32); for 3021, there are 3 pairs (32, 31, 21). Orderings with even numbers of pairs have one parity while orderings with odd number of pairs have the opposite parity.

[1] The OpenSMILES specification on stereochemistry

Image credit: Swamibu

Wednesday, 7 May 2008

ANN: cinfony 0.1 - the toolkit of cheminformatics toolkits

What?
cinfony presents a common Python API to three open source cheminformatics toolkits: OpenBabel, RDKit and the CDK. cinfony makes it easy to carry common cheminformatics tasks, but allows you to access the underlying toolkit objects at any point.

Why?
(1) In the past, cheminformaticians have chosen a particular toolkit and have only used that toolkit. This is because of the impedance mismatch between APIs of different toolkits, and indeed, the language differences between different toolkits (CDK is Java, OpenBabel and RDKit are C++, for example). cinfony makes it easy to start using a new toolkit because the API for reading/writing files, calculating fingerprints, creating 2D depictions, etc. is the same no matter whether you are using OpenBabel, RDKit or the CDK.

(2) Different toolkits have different capabilities. cinfony allows you to access all of these capabilities in the same Python script. For example, you can read a molecule using OpenBabel, draw it using RDKit and calculate descriptors using the CDK.

Where?
cinfony 0.1 is now available for download from googlecode.

Who?
Early adopters. This is pretty much a test release.

Why are there so many dependencies?
Oops, look at the time. Gotta go. Have fun installing.

Image: DryIcons (modified)

Thursday, 24 April 2008

Cheminformatics toolkit face-off - LogP and TPSA

Updated 23/04/08 (see below)

In a recent post, I discussed some thoughts I had about using the Pybel API as a generic API for cheminformatics toolkits. I've started to implement this (more on this at a later date) and have begun writing some tests to compare results from different toolkits. To begin with, I was hoping that RDKit (Jan2008), CDK (1.0.2) and OpenBabel (dev, soon-to-be 2.2.0) could agree on LogP and TPSA values calculated with the same group contribution approach.

The code is below, but first here are the results:
ALogP
=====
****Example1*****
The calculated value from JCICS, 1999, 39, 868 is 1.400000
cinfony.rdk {'MolLogP': 1.4007999999999998,
'pyMolLogP': 1.4007999999999998}
cinfony.pybel {'LogP': 0.80749999999999988}



****Example2*****
The calculated value from JCICS, 1999, 39, 868 is 2.750000
cinfony.rdk {'MolLogP': 2.7486000000000006,
'pyMolLogP': 2.7486000000000006}
cinfony.pybel {'LogP': 1.6415999999999995}



****methane*****
The calculated value from JCICS, 1999, 39, 868 is 0.144100
cinfony.rdk {'MolLogP': 0.6331, 'pyMolLogP': 0.6331}
cinfony.pybel {'LogP': 0.14410000000000001}



TPSA
====
****metoprolol*****
The calculated value from JMC, 2000, 43, 3714 is 50.700000
cinfony.cdk {'tpsa': 50.719999999999999}
cinfony.rdk {'TPSA': 50.719999999999999}
cinfony.pybel {'TPSA': 50.719999999999999}



****nordiazempam*****
The calculated value from JMC, 2000, 43, 3714 is 41.500000
cinfony.cdk {'tpsa': 41.460000000000001}
cinfony.rdk {'TPSA': 41.460000000000001}
cinfony.pybel {'TPSA': 41.460000000000001}
The good news is that they all agree on the TPSA of the two examples. The bad news is that CDK 1.0.2 does not calculate ALogP, and OpenBabel doesn't seem to be doing well for the more complicated examples (which are taken from the JCICS paper). It's possible that OpenBabel is using a different parameterisation than the paper I reference. I'm not sure either if there's any difference between the two ALogP values that RDKit calculates. Hopefully, the people that know will comment below.

Update (23/04/08): (1) I calculated the LogP of methane incorrectly - it should be 0.63610. (2) RDKit has a transposition error in the value for C in methane (only a minor error). (3) OpenBabel calculates the correct answer if hydrogens are added. I'm going to recommend that the descriptor adds the hydrogens itself, to avoid user errors in future. (Done! Thanks to Chris Morley.) (4) The two ALogP descriptors in RDKit are implementations in two different languages, one in C++, the other in Python. They should be identical. (Thanks to Greg Landrum of RDKit for identifying several of these issues)

Here's the code...
from cinfony import cdk, rdk, pybel


descs = [("ALogP", "JCICS, 1999, 39, 868"),
("TPSA", "JMC, 2000, 43, 3714")]
toolkits = [(cdk, {'ALogP': None, 'TPSA': ['tpsa']}),
(rdk, {'ALogP': ['MolLogP', 'pyMolLogP'],
'TPSA': ['TPSA']}),
(pybel, {'ALogP': ['LogP'], 'TPSA': ['TPSA']})]
testcases = {'ALogP':
{"methane": (0.1441, 'C'),
"Example1": (1.40, 'c1(O)ccccc1(OC)'),
"Example2": (2.75, 'c1ccccc1c1ccccn1')},
'TPSA':
{"metoprolol": (50.7,
"c1(OCC(O)CNC(C)C)ccc(CCOC)cc1"),
"nordiazempam": (41.5,
"c1c(Cl)ccc2NC(=O)CN=C(c3ccccc3)c12")}}

for desc, ref in descs:
print desc + "\n" + "="*len(desc)
for k, v in testcases[desc].iteritems():
print "****%s*****" % k
print "The calculated value from %s is %f" % (ref, v[0])
for toolkit in toolkits:
if toolkit[1][desc]:
mol = toolkit[0].readstring("smi", v[1])
print toolkit[0].__name__,
print mol.calcdesc(toolkit[1][desc])
print "\n\n"
Image credit: pfly

Tuesday, 22 April 2008

Cheminformatics toolkit face-off - Molecular weight

Next up in this series of toolkit comparisons is the molecular weight. Here, cinfony is going to support two ideas:
The gold standard is the element data in the Blue Obelisk Data Repository (BODR), which was created for just this purpose.

The following are the results for a hydrogen atom and a carbon atom according to the BODR and each of the toolkits. The first figure in parentheses is the molwt, then the exactmass.
BODR: (1.00794, 1.007823032)
(12.0107, 12)

Pybel: (1.00794, 1.007825032)
(12.0107, 12.0)

RDKit: (1.008)
(12.011)

CDK: (1.0079400539398193, 2.0156500339508057)
(12.010700225830078, 16.031299591064453)

OpenBabel is the only toolkit that gets everything right. RDKit is doing okay, but should consider using BODR data in future. The CDK presents the most intriguing results. I'm not sure whether the Python/Java interface has introduced noise into the molwt values, but they are exactly in agreement with the BODR up to the 7th decimal place or so, after which something goes weird. On the other hand, it's quite clear that the CDK's exactmass is simply not behaving as advertised. It is using Deuterium for the hydrogen and C-16 for the carbon. (I note that MFAnalyser has already been replaced in the CDK development code.)

Image credit: Atomium by bugmonkey (CC BY-NC 2.0)

Friday, 28 March 2008

Pybel as a generic API for cheminformatics libraries - proof of concept using CDK

I'm very interested in interoperability of open source chemistry codes. Following a comment by Egon on a recent post of mine, I started wondering whether the Pybel API could used with other cheminformatics libraries as a backend.

The advantage of this for the user would be (a) to reduce the learning curve - if you know how to use Pybel, you can access any of several different cheminformatics libraries with the same syntax, (b) the same scripts could be used to carry out a particular analysis using different cheminformatics libraries - different libraries may have different fingerprints, descriptors or implementations of particular algorithms (this is of course also useful for cross-checking the results of different programs) and (c) help reduce the divide between different cheminformatics toolkits (interoperability!!).

The rationale behind Pybel (described in the paper) lends itself to this use. Pybel doesn't attempt to wrap all the functionality of OpenBabel, but only the most common tasks in cheminformatics. For advanced options, or additional functionality, you can go behind the scenes and access OpenBabel directly. As a result, I propose that the Pybel API represents a generic API (one of many possible, of course) for accessing any cheminformatics library.

To test this, I have created CDKabel, a proof of concept which shows that the Chemistry Development Kit (CDK) can be accessed using Pybel syntax through Jython. CDKabel does not yet pass all of the Pybel tests, but there's enough to show that the approach has some merit. Compare the following: here's some Python code using Pybel and OpenBabel:
C:\Documents and Settings\oboyle>python25
Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32
bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more inf
ormation.
>>> from pybel import *
>>> for mol in readfile("sdf", "head.sdf"):
... print "Molecule has molwt of %.2f and %d atoms" %
(mol.molwt, len(mol.atoms))
...
Molecule has molwt of 122.12 and 15 atoms
Molecule has molwt of 332.49 and 28 atoms
>>>
Now here's some Jython code with CDKabel and CDK:
D:\Tools\CDK>set CLASSPATH=cdk-1.0.2.jar

D:\Tools\CDK>..\jython2.2.1\jython
Jython 2.2.1 on java1.6.0_05
Type "copyright", "credits" or "license" for more informa
tion.
>>> from cdkabel import *
>>> for mol in readfile("sdf", "head.sdf"):
... print "Molecule has molwt of %.2f and %d atoms" %
(mol.molwt, len(mol.atoms))
...
Molecule has molwt of 122.04 and 15 atoms
Molecule has molwt of 331.96 and 28 atoms
>>>
Well, at least they agree on the number of atoms :-) (It's my fault - CDK has like, ten different ways of calculating the molecular mass, and I just chose randomly :-) )

I've only spent a few minutes throwing CDKabel together, so it doesn't do much beyond the example shown. However, if interested, you can download it and try it for yourself.

I'd appreciate comments on the idea that there is a core Python API that could be usefully applied to several cheminformatics libraries. Would anyone use CDKabel if it were available?

Wednesday, 26 March 2008

MadMol - Chemistry-aware code

There are definitely better things you can do with the functional group fingerprint (FP4) of OpenBabel, but instead, I've created MadMol. Basically, give MadMol a SMILES string such as CC(=O)CCl and it will tell you all you ever wanted to know about that molecule:
That there's a Primary_carbon. Isn't that a Alkylchloride? Wake up and smell the Ketone, cause that's what it is. Most people wouldn't realise this is a C_ONS_bond. It's a 1,3-Tautomerizable (or your money back). I don't believe it, it's a Rotatable_bond! Wake up and smell the CH-acidic, cause that's what it is.

And here's the code (requires Pybel, and the file SMARTS_InteLigand.txt):

import sys
import random

import pybel

def readsmartsfile(filename="SMARTS_InteLigand.txt"):
patterns = []
inputfile = open(filename, "r")
for line in inputfile:
line = line.strip()
if line and line[0]!="#":
colon = line.find(":")
name = line[:colon]
smarts = line[colon+1:].strip()
patterns.append([pybel.Smarts(smarts), name])
return patterns

phrases = ["I don't believe it, it's a %s!",
"Isn't that a %s?",
"It's a whadyamacallit, a %s.",
"Looks like a %s to me.",
"That there's a %s.",
"Most people wouldn't realise this is a %s.",
"It's a %s (or your money back).",
"Wow, a %s. Last time I saw one of these, I hit the"
"fire alarm and ran.",
"Could be a %s...yes, I'm sure of it.",
"It's a %s if I've ever seen one.",
"Wake up and smell the %s, cause that's what it is.",
"It's a %s. I wish I had one.",
"You've hit the jackpot, you and your %s!",
"A %s. You know, back in the day, we used to have"
"fun with these.",
"It takes me back years, this %s does."]

if __name__=="__main__":
if not len(sys.argv)==2:
sys.exit("You need a SMILES string like CC(=O)CCl")
molecule = pybel.readstring("smi", sys.argv[1])
print "So you want me to tell you about %s, do you?\n" % (
sys.argv[1],)
patterns = readsmartsfile()
print " ".join([random.choice(phrases) % name for
smarts, name in patterns if smarts.findall(molecule)])

Monday, 10 March 2008

Pybel publication in Chemistry Central Journal

Our Pybel paper has just been published in the new Open Access journal, Chemistry Central Journal:
Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit Noel M O'Boyle, Chris Morley and Geoffrey R Hutchison. Chemistry Central Journal 2008, 2:5. [DOI]

Here's the abstract:
Background

Scripting languages such as Python are ideally suited to common programming tasks in cheminformatics such as data analysis and parsing information from files. However, for reasons of efficiency, cheminformatics toolkits such as the OpenBabel toolkit are often implemented in compiled languages such as C++. We describe Pybel, a Python module that provides access to the OpenBabel toolkit.

Results

Pybel wraps the direct toolkit bindings to simplify common tasks such as reading and writing molecular files and calculating fingerprints. Extensive use is made of Python iterators to simplify loops such as that over all the molecules in a file. A Pybel Molecule can be easily interconverted to an OpenBabel OBMol to access those methods or attributes not wrapped by Pybel.

Conclusion

Pybel allows cheminformaticians to rapidly develop Python scripts that manipulate chemical information. It is open source, available cross-platform, and offers the power of the OpenBabel toolkit to Python programmers.

The paper describes the rationale behind Pybel, its implementation and some examples of its use. If you are looking for help on how to install or use Pybel, see the Python page on the OpenBabel website, or the examples page.

And if you don't like Python (!), you can also access OpenBabel from Perl, Ruby, Java and C++. So check it out.

Saturday, 23 February 2008

Calculate circular fingerprints with Pybel

I've just realised that a friend of mine, Florian Nigsch, has started blogging about cheminformatics. In one of his posts, he describes the calculation of circular fingerprints with Python.

In his program, he parses the structure of a MOL2 file himself to work out the connectivity and so on. I wondered whether it would be easier with Pybel. I was hoping that the breadth-first iterator that comes with OpenBabel could be used to simplify things, but in the end I needed to do the breadth-first search myself (here I use a queue).
import os
import sys
from collections import deque # Better than list.pop(0)

import pybel
import openbabel as ob

# Set up a translator to Sybyl atom types
ttab = ob.OBTypeTable()
ttab.SetFromType("INT")
ttab.SetToType("SYB")

def bfs(mol, startatom):
visited = [False] * (mol.OBMol.NumAtoms() + 1)
ans = []
queue = deque([(startatom, 0)])
while queue:
atom, depth = queue.popleft()
idx = atom.GetIdx()
visited[idx] = True
atomtype = ttab.Translate(atom.GetType())
ans.append("%s%s" % (["", "%d-" % depth] [depth>0],
atomtype))
if depth < D:
for atom in ob.OBAtomAtomIter(atom):
if not visited[atom.GetIdx()]:
queue.append((atom, depth+1))
return ";".join(ans)

if __name__ == "__main__":
D = 2 # default max depth
if len(sys.argv)>1:
inputfile = sys.argv[1]
if len(sys.argv)>2:
D = int(sys.argv[2])
else:
sys.exit("Usage: BFS.py inputfile [depth=2]")

for mol in pybel.readfile(inputfile.split(".")[-1],
inputfile):
fp = []
for atom in ob.OBMolAtomIter(mol.OBMol):
fp.append(bfs(mol, atom))
print "\t".join([mol.title] + fp)
The advantage of using a library like Pybel is that this program will work for any of several file formats including SMILES. Also, it's trivial to use other atom types apart from Sybyl atom types if desired. Note that the output is not exactly identical to that of Florian's program. In particular, I don't fold together multiple environments that occur at the same depth.

Update (25/02/08): Improved the code after review by Andrew Dalke (see below for original and comments).

Friday, 12 October 2007

ANN: Frog donates code to OpenBabel for SMILES to 3D conversion


Recently, researchers at the French research institutes INSERM and CNRS developed an online service for converting SMILES string to 3D conformers: "FRee Online druG 3D conformation generator (Frog)". A description of this service was published in T. Bohme Leite, D. Gomes, M.A. Miteva, J. Chomilier, B.O. Villoutreix and P. Tufféry. Nucleic Acids Research, 2007, 35, W568-W572:
Frog is an on-line service aimed at generating 3D conformations for drug-like compounds starting from their 1D or 2D descriptions. Given the atomic constitution of the molecules and connectivity information, Frog can identify the different unambiguous isomers corresponding to each compound, and generate single or multiple low-to-medium energy 3D conformations, using an assembly process that does not presently consider ring flexibility. Tests show that Frog is able to generate bioactive conformations close to those observed in crystallographic complexes.

On behalf of the OpenBabel project, I am pleased to announce that Dr. Bruno Villoutreix (INSERM, University of Paris 5) and Dr. Pierre Tufféry (INSERM, University of Paris 7) have generously donated their code to OpenBabel. This code will be incorporated into OpenBabel under the GPL in the coming months, making fast and accurate SMILES-to-3D conformer generation available to the open source community for the first time.

The absence of an open source 3D conformer generation algorithm has increasingly become a problem in recent years due to the popularity of SMILES strings for the description of molecular information. Fortunately, this problem has now been solved. Thanks again to all those involved in the development and release of this code.

For further information on Frog, please contact the corresponding author of the Frog paper.

Image credit: gottcha78

Friday, 14 September 2007

RDKit: Not just yet another cheminformatics toolkit

I'm sitting here with a bandage covering my head, as RDKit has just blown my mind.

RDKit is an cheminformatics toolkit written in C++ and Python. It was developed in-house in a company called Rational Discovery (hence the RD) since 2001. In Feb 2006, it appeared on SourceForge under a liberal license (BSD, except for the GPL Qt code), an appearance which presumably coincided with the demise of Rational Discovery (the company, not the concept, that is :-) ). And there it stayed, actively developed by two developers, but unknown among the open source chemistry community until...

A month ago, I happened to be glancing through the SourceForge software map of chemistry software and I was intrigued by the description of RDKit as "A collection of cheminformatics and machine-learning software developed at Rational Discovery". The website was pretty minimal and there didn't even appear to be any documentation. I dashed off an email to Greg Landrum, the main developer (who it turns out is also the developer of YAeHMOP (Yet Another extended Huckel Molecular Orbital Package) ), and asked him what the story was. Two days ago, he returned from holidays and pointed me to the correct website and the documentation, and I couldn't believe what I was seeing...

Some features that I think are cool:
(1) Molecules based on the Boost Graph Library
(2) All the Python stuff works for me on Windows!
(3) 2D depiction!!!
(4) 2D depiction that mimics 3D conformations!!!
(5) 2D --> 3D conversion in a similar method to Rajarshi's smi23D! (doesn't use stochastic promixity embedding though)

Here's a summary of some of the rest: SMILES, substructure searching, sophisticated fingerprints, machine learning stuff, a GUI, clustering, MACCS keys, descriptors (84 or so), chemical reaction transformations, implementation of Recap (not sure what this is, but there's a ref in the docs), basic pharmacophore stuff, and two types of SSSR (there's some text about this in the docs).

Cool! There's obviously a lot of overlap between OpenBabel and RDKit, and hopefully we can use this to both projects' advantage in terms of testing against each other and developing interfaces. In the meanwhile, here's to diversity, and to discovering that someone else has implemented 2D depiction in C++ so I Don't Have To.

For more info, see www.rdkit.org, and in particular, the Python interface documentation which gives a good overview.

Image credit: Toolkit by Neil T (CC BY-SA 2.0)

Monday, 3 September 2007

Python code for Huckel theory calculations

Over at Chemical Quantum Images, Felix has posted Python code that will read a file containing a molecular structure and calculate the molecular orbitals using Huckel theory. The code uses the OpenBabel Python module to read the structure, and NumPy, the Python extension for numerical computing, to do the heavy lifting (to diagonalise a matrix).

Now, if I only understood the theory...:-)

Friday, 10 August 2007

Access embedded molecular information in images

Recently Rich Apodaca has been discussing (here, here and here) embedding molecular information in images of molecules, such as a PNG file depicting a 2D structure.

I'm going to show how to extract this type of embedded metadata using Python.

First of all, you'll need an image to work with. Grab the PNG file, rosiglitazone.png, from Rich's post.

Next, you'll need the Python Imaging Library (PIL), a 3rd-party Python extension library available from Pythonware.

Here's the text of an interactive Python session showing how to access the image metadata:

C:\>python
ActivePython 2.4.1 Build 247 (ActiveState Corp.) based on
Python 2.4.1 (#65, Jun 20 2005, 17:01:55) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more inf
ormation.
>>> import Image
>>> myimage = Image.open("rosiglitazone.png")
>>> dir(myimage)
['_Image__transformer', '_PngImageFile__idat', '__doc__', '__
init__', '__module__', '_copy', '_dump', '_expand', '_makesel
...
im', 'getpalette', 'getpixel', 'getprojection', 'histogram',
'im', 'info', 'load', 'load_end', 'load_prepare', 'load_read'
, 'mode', 'offset', 'palette', 'paste', 'png', 'point', 'puta
lpha', 'putdata', 'putpalette', 'putpixel', 'quantize', 'read
...
transform', 'transpose', 'verify']
>>> myimage.info
{'molfile': 'name\nparams\ncomments\n 25 27 0 0 0 0 0 0
0 0 0 V2000\n 1.6910 -6.1636 0.0000 C 0 0 0
0 0 0 0 0 0 0 0 0\n 2.5571 -6.6636 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0\n 3.4231 -6.1636
0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0\n 3.4231 -
5.1636 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n
2.5571 -4.6636 0.0000 C 0 0 0 0 0 0 0 0 0 0
n 7 8 1 0 0 0 0\n 8 9 1 0 0 0 0\n 7 10 1 0
...
...
0 0 0\n 9 11 1 0 0 0 0\n 11 12 1 0 0 0 0\n 12 13
2 0 0 0 0\n 13 14 1 0 0 0 0\n 14 15 2 0 0 0 0
\n 15 16 1 0 0 0 0\n 16 17 2 0 0 0 0\n 17 12 1 0
0 0 0\n 15 18 1 0 0 0 0\n 18 19 1 0 0 0 0\n 19 2
0 1 0 0 0 0\n 20 21 1 0 0 0 0\n 21 22 1 0 0 0
0\n 22 23 1 0 0 0 0\n 23 19 1 0 0 0 0\n 22 24 2 0
0 0 0\n 20 25 2 0 0 0 0\nM END', 'aspect': (1, 1)}
>>> moldata = myimage.info['molfile']


Cool. So now we could write this information to a file (print >> open("myoutputfile.mol"), moldata), or convert it into an OpenBabel molecule and calculate some properties:


>>> import pybel
>>> mymol = pybel.readstring("MOL", moldata)
>>> print mymol.molwt
357.42676

Thursday, 12 July 2007

Pybel - Just how unique are your molecules? Part II

I recently posted on using SMILES, Fingerprints and InChIs to identify unique molecules in a dataset. Geoff pointed out that I should have been using canonical SMILES ("can" in Open Babel), instead of non-canonical SMILES ("smi" in Open Babel). Why?

Well, the same molecule will always have the same canonical SMILES. Whereas depending on the order of the atoms in the input file, the same molecule might have a different non-canonical SMILES. Since I was interested in identifying unique molecules, I should have been using canonical SMILES.

Method

The same as before, just with "can" instead of "smi". The code used is only slightly different from that in the previous post.

Results

Here is the output of the script (see code below):

Starting...

The number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?

There are 2 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates


The number of (unique) molecules is 10792

Are there any remaining molecules with the same canonical
SMILES?

There are 815 molecules with 2 duplicates
There are 61 molecules with 3 duplicates
There are 36 molecules with 4 duplicates
There are 10 molecules with 5 duplicates
There are 9 molecules with 6 duplicates
There are 9 molecules with 7 duplicates
There is 1 molecule with 8 duplicates
There are 3 molecules with 10 duplicates
There are 3 molecules with 11 duplicates
There are 2 molecules with 13 duplicates
There is 1 molecule with 14 duplicates
There is 1 molecule with 24 duplicates
There is 1 molecule with 34 duplicates

Are there any remaining molecules with the same fingerprint?

None found

Are there any remaining molecules with the same InChI?

There are 2 molecules with 2 duplicates


...Finished!
Discussion

If you compare the results here with those in my previous post, you will find that that there are two instances where two molecules have the same FP, InChI and non-canonical SMILES, but they have different canonical SMILES! By eye the two molecules are identical. This appears to be a bug in the canonicalisation routine and I've reported it to OpenBabel. The details are:

non-canonical SMILES is CC1OC(C[NH](C1)CC(C(F)(F)F)O)C but canonical SMILES is...
  • ZINC03883383: C[C@H]1O[C@@H](C)C[NH](C1)C[C@H](O)C(F)(F)F
  • ZINC03883386: C[C@@H]1O[C@H](C)C[NH](C1)C[C@H](O)C(F)(F)F

non-canonical SMILES is C1C(CC(C(C1O)O)O)(C(=O)O)O but canonical SMILES is...
  • ZINC03870192: O[C@@H]1CC(O)(C[C@H](O)C1O)C(=O)O
  • ZINC03870194: O[C@H]1CC(O)(C[C@@H](O)C1O)C(=O)O

Although my original aim was simply to give an example of how to use Pybel, this trivial analysis of ZINC has identified areas for improvement in both ZINC and OpenBabel. Without access to such a large dataset as ZINC, errors such as those identified here would go unnoticed. If anyone has any more ideas for tests, let me know...

Tuesday, 10 July 2007

Pybel - Hack that SD file

At some point, most cheminformaticians need to hack at SD files. Maybe they need to extract the values of particular data fields, or add in some new ones. Unfortunately, our friends 'awk', 'grep' and 'Excel' (!) are not very useful for this so we need to use a scripting language.

The examples shown here use the Python module Pybel, included with the Open Babel distribution. Although it's a Python module, all of the hard work is done by the underlying C++ library.

Convert an SD file to a spreadsheet

Given an SD file with calculated descriptor values in the fields, I want to write out a Tab-separated file containing a SMILES string in the first column, the title in the second column, and then the descriptor values in the remaining columns.

import pybel

inputfile = pybel.readfile("sdf", "ace_ligands.sdf")
# We need to read the first molecule to find
# the descriptor names

mol = inputfile.next()
print "\t".join(["SMILES", "Title"] + mol.data.keys())

# Print out information on the first molecule
print "\t".join([mol.write("smi").split()[0], mol.title]
+ mol.data.values())
# Print out information on the remaining molecules
for mol in inputfile:
print "\t".join([mol.write("smi").split()[0], mol.title]
+ mol.data.values())
Notes:
  • The data fields in an SD file are in the 'data' attribute of a molecule, which behaves like a dictionary with keys() and values().
  • The "\t".join(mylist) concatenates a list of strings around Tab characters.
  • If you convert a molecule to a SMILES string, the result is a string containing the SMILES string, a Tab, the molecule title, and a carraige return. To simply access the SMILES string, you need to split() it and access the first element.

Add Rule-of-5 descriptor values to an SD file

Although Lipinksi's Rule-of-5 (or fives) may be both a Sacred Cow and an Evil Metric (according to the Great Molecular Crapshoot), we still love it. Here's how to add RO5 descriptor values to an SD file (requires Open Babel 2.1.1):

import pybel

HBD = pybel.Smarts("[!#6;!H0]")
HBA = pybel.Smarts("[$([$([#8,#16]);!$(*=N~O);" +
"!$(*~N=O);X1,X2]),$([#7;v3;" +
"!$([nH]);!$(*(-a)-a)])]")

def moredesc(mol):
ans = {}
ans['RotBonds'] = mol.OBMol.NumRotors()
ans['HBD'] = len(HBD.findall(mol))
ans['HBA'] = len(HBA.findall(mol))
ans['molwt'] = mol.molwt
return ans

output = pybel.Outputfile("sdf", "LipinskiRulesOK.sdf")

for mol in pybel.readfile("sdf", "NoRules.sdf"):
mol.data.update(mol.calcdesc())
mol.data.update(moredesc(mol))
output.write(mol)

output.close()
Notes:
  • The 'calcdesc' method of a Molecule returns a dictionary containing descriptor values for LogP, Molecular Refractivity (MR), and Polar Surface Area (PSA).
  • The 'update' method of a dictionary adds the contents of one dictionary to another.

Sunday, 8 July 2007

ANN: Open Babel 2.1.1 Release


(To paraphrase Geoff...)

I am pleased to announce the release of Open Babel 2.1.1, the latest stable version of the open source chemistry toolbox.

This release represents a stable bug-fix release and is highly recommended for all users, even those who have not had problems with 2.1.0. It includes fixes for several important crashes, and many, many bug fixes and improvements.

What's new? See the full release notes.

To download, see our install page.

For more information, see the project website.

Have fun!

Thursday, 5 July 2007

Pybel - Just how unique are your molecules?

Cheminformaticians often want to remove or identify duplicates in a dataset of molecules. There are a couple of popular ways to do this: ensure that each molecule has a unique SMILES string, has a unique Daylight-type fingerprint, or has a unique InChI.

But what's the difference? Is there any? Does a unique fingerprint imply a unique InChI, and vice versa?

Method

Enough with the questions. Let's just get some numbers. Our test set is the first slice of the drug-like subset of the ZINC database. Specifically, it's the file 3_p0.0.mol2.gz. According to ZINC, this should contain "only a single form of each molecule". We're going to use Pybel, the Python module included with Open Babel, to calculate the three common measures of uniqueness and check for duplicates.

Results

Here is the output of the script (see code below):

Starting...

The total number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?
There are 4 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates

The total number of (unique) molecules is 10790

Are there any remaining molecules which have the same FP \
and SMI?

There are 846 molecules with 2 duplicates
There are 55 molecules with 3 duplicates
There are 23 molecules with 4 duplicates
There are 3 molecules with 5 duplicates
There is 1 molecule with 7 duplicates

Are there any remaining molecules which have the same InChI?

...Finished!
Discussion

So it seems that uniqueness is not quite as easy to measure as you might think. Even once you remove all molecules that share the same fingerprint, SMILES string, and InChI, there are still many that share the same fingerprint and SMILES string, although the InChIs are unique.

The results shown here either expose errors in OpenBabel, or possible errors in the creation of the ZINC database. In either case, this is an example of two Open chemistry resources being used to test and validate each other. Hopefully these results will aid in the development of even better software or data.

Code

import os
import glob
import pybel
import cPickle

def duplicates(keys, data):
"""Find the duplicates in 'data' in terms of particular
keys.

'keys' is a list containing some selection of the
indices 1, 2 and 3.
"""
ans = {}
for d in data:
ans.setdefault(tuple([d[key] for key in keys]),[]) \
.append(d[TITLE])
result = {}
for k, v in ans.iteritems():
if len(v) > 1: # len(v) is the number of duplicates
result[len(v)] = result.get(len(v), 0) + 1
for k in sorted(result.keys()):
if result[k]>1:
print "There are %d molecules with %d" \
" duplicates" % (result[k], k)
else:
print "There is %d molecule with %d" \
" duplicates" % (result[k], k)
return ans

def readfrommol2(filename):
data = []
for mol in pybel.readfile("mol2", filename):
data.append(( mol.title,
"%s" % mol.calcfp('FP2').bits,
mol.write("smi").split()[0],
mol.write("inchi").rstrip() ))
outputfile = open("dups.pickle", "w")
cPickle.dump(data, outputfile)
outputfile.close()
return data

def readfrompickle():
inputfile = open("dups.pickle", "r")
data = cPickle.load(inputfile)
inputfile.close()
return data

if __name__ == "__main__":
print "Starting...\n"

if not os.path.isfile("dups.pickle"):
data = readfrommol2("3_p0.0.mol2")
else:
data = readfrompickle()

TITLE, FP, SMI, INC = range(4)

# How many molecules are there?
print "The number of molecules is %d\n" \
% len(data)

# Find molecules with the same FP, SMI and INC
print "Are there any molecules with the same" \
" FP, SMI and InChI?"
answer = duplicates( [SMI, FP, INC], data)

# Select only unique molecules
uniquedata = [(v[0], k[0], k[1], k[2]) for k, v \
in answer.iteritems()]
print "\nThe number of (unique) molecules is %d\n" \
% len(uniquedata)

# Find molecules with the same FP and SMI but not INC
print "Are there any remaining molecules which have" \
" the same FP and SMI?\n"
duplicates( [SMI, FP], uniquedata )

# Find any remaining molecules with the same InChI
print "\nAre there any remaining molecules which have" \
" the same InChI?\n"
duplicates( [INC], uniquedata ) # Nothing is found

print "...Finished!"

Friday, 29 June 2007

Parsing SMILES with a frown?

In a recent post, Andrew Dalke compares SMILES parsing in Frowns (a now unsupported Open Source Python cheminformatics library) and the CDK (an actively developed Open Source Java cheminformatics library). Andrew is somewhat of a parsing expert - indeed, BioPython is built upon a parser called Martel which Andrew developed. The Frowns SMILES parser, contributed by Andrew, shows what an expert can do. It would be interesting to know whether similiar code could be incorporated into the CDK and OpenBabel, and if so, what is the speed tradeoff involved?

Andrew also discusses the compression of SMILES strings, which James Melville and Johnathan Hirst have also recently been studying. Rajarshi has also reviewed this work.

Saturday, 2 June 2007

Open Babel Python module for Windows (1.2) released

Announcing the release of the OpenBabel module for Python, version 1.2, for Windows.

OpenBabel is a chemical toolbox designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas. This Python module allows you to access this popular C++ library in your Python scripts.

New features:
(1) Data fields in file formats like MOL2 and SDF can now be accessed and edited
(2) Unit cell data in crystallographic file formats such as CIF can now be accessed
(3) A list of all detected input and output file formats can be accessed

Core features:
(1) Read and write over 80 molecular file formats
(2) Access molecular properties like molecular weight, formula, charge
(3) Daylight-type fingerprints and calculation of the Tanimoto coefficient
(4) SMARTS pattern matching
(5) Graph algorithms such as the Smallest Set of Smallest Rings (SSSR) and Depth-First Iteration over atoms

Documentation:
(1) The Python module
(2) The C++ toolkit
(3) The OpenBabel web site

Support:
(1) If you have any questions, send an email to openbabel-scripting@lists.sourceforge.net
(2) Report a bug

P.S. The Python module is also available for Linux and MacOSX. It is also possible to use OpenBabel from C++, Perl, Ruby and Java, although getting these to work on Windows may be more difficult.