Noel O'Blog: July 2008

Thursday, 31 July 2008

Calculate circular fingerprints with Pybel II

OpenBabel 2.2.0 came out a little while ago with a lot of new features. One of the new features is a breadth-first iterator over a molecule that returns the current depth as well as the current atom (OBMolAtomBFSIter). Nice. Because now it's a doddle to write a Python script to create circular fingerprints.

The script below creates a circular fingerprint for the example molecule in Bender et al., JCICS, 2004, 44, 170 as follows:

D:\>python25 fp.py
0-C.ar;1-C.ar;1-C.ar;1-N.pl3;2-C.2;2-C.ar;2-C.ar        0
-N.pl3;1-C.ar;2-C.ar;2-C.ar     0-C.ar;1-C.ar;1-C.ar;2
-C.ar;2-C.ar;2-N.pl3    0-C.ar;1-C.ar;1-C.ar;2-C.ar;2
-C.ar   0-C.ar;1-C.ar;1-C.ar;2-C.ar;2-C.ar      0-C.ar;1-
C.ar;1-C.ar;2-C.2;2-C.ar;2-C.ar 0-C.ar;1-C.2;1-C.a
r;1-C.ar;2-C.ar;2-C.ar;2-N.pl3;2-O.co2;2-O.co2  0-C
.2;1-C.ar;1-O.co2;1-O.co2;2-C.ar;2-C.ar 0-O.co2;1-
C.2;2-C.ar;2-O.co2      0-O.co2;1-C.2;2-C.ar;2-O.co2

Here's the script, which is considerably shorter than the original one.:

import pybel

# Set up a translator to Sybyl atom types
ttab = pybel.ob.OBTypeTable()
ttab.SetFromType("INT")
ttab.SetToType("SYB")

if __name__ == "__main__":
    D = 3 # default max depth
    mol = pybel.readstring("smi", "c1(N)ccccc1c(=O)[O-]")

    fp = []
    for i in range(len(mol.atoms)):
        ans = []
        for atom, depth in pybel.ob.OBMolAtomBFSIter(mol.OBMol, i + 1):
            if depth > D: break
            atomtype = ttab.Translate(atom.GetType())
            ans.append("%d-%s" % (depth - 1, atomtype))
        fp.append(";".join(sorted(ans)))
    print "\t".join(fp)

Wednesday, 23 July 2008

Chemistry in R

For many cheminformaticians, R is the preferred way of analysing multivariate data and developing predictive models. However, it is not so widely known that there are R packages available that are directly aimed at handling chemical data.

Over the last few years, Rajarshi Guha (Indiana University) has been doing some nice work integrating the CDK and R. His publication in J. Stat. Soft., Chemical Informatics Functionality in R, describes the rcdk and rpubchem packages. The rcdk package allows the user to read SDF files directly into R, calculate fingerprints and descriptors, calculate Tanimoto values, view molecules in 2D (JChemPaint) and 3D (Jmol), calculate SMILES strings, and access the property fields of file formats. The rpubchem package is focused on downloading compounds, property values and assay data from PubChem. See also articles in R news and CDK news [1], [2], [3].

A more recent development is ChemmineR, described in the latest issue of Bioinformatics: "ChemmineR: a compound mining framework for R". The authors appear to be unaware of the earlier work by Rajarshi, and so there is no comparison of available features. However, based on the documentation on their website, it seems that much of the functionality revolves around a type of fingerprint called atom-pair descriptors (APD). SDF files, when read in, are converted to a database of APDs and these can be used for similarity searching, clustering, removal of duplicates and so on. Sets of molecules can be visualised using a web connection to the ChemMine portal (I'm not sure what software is used). According to the documentation, future work will include descriptor calculation with JOELib.

So, there you have it. An exhaustive survey of the two available methods for bringing chemistry into R. Is the time ripe for a cheminformatics equivalent to Bioconductor?

Tuesday, 15 July 2008

Review of "IronPython in Action"

The three most well-known implementations of Python are the original C implementation (CPython), one in Java (Jython), and one in .NET (IronPython). Jython makes it easy to write Python programs that use Java classes and to test a Java library at an interactive prompt. Similarly, IronPython allows Python programs to be written that use the .NET libraries (these are the same libraries used by C# and Visual Basic).

IronPython (sometimes abbreviated as FePy) is relatively new and had its first release in 2006. It is Open Source and developed by Microsoft (by the same developer who started Jython). Although .NET is strictly for Windows, IronPython can also run on top of Mono, the open source implementation of .NET which is available cross-platform.

As a cheminformatician should you be interested in IronPython or .NET? Well, Dimitris Agrafiotis thinks so. And Eli Lilly recently open sourced a life science platform implemented in .NET (here's the SF website, but where's the mailing list?). I'm not yet convinced, but I'm keeping an open mind. I'm helping Matt Sprague to prepare C# bindings for OpenBabel. I think this will make OpenBabel the first cheminformatics library accessible from C# (although probably someone will contradict me below).

At this point, if you're still interested, you should check out the first chapter of "IronPython in Action", which you can read on the web. This explains in more detail the background to IronPython and why you might want to use it. IronPython in Action will be published later this year by Manning Press, and is written by Michael Foord and Christian Muirhead. Foord (aka Fuzzyman) is a very active blogger on Python and FePy in particular, and works at a company whose main product is implemented in FePy.

Just to make it clear, this is not a book for someone who wants to pick up programming from scratch. There's little hand holding here. The obligatory introduction to Python is quickly and efficiently dealt with before moving on to the main subject - accessing .NET classes and creating a GUI application. For those coming from C# to IronPython, there is a whole chapter on unit testing with Python, as well as another that covers everything from Python magic methods to metaprogramming. A particularly nice feature of the authors' style is to logically link each section to the following, so that you always understand the point of what you've just read and where it fits into the bigger picture.

From my point of view, it's nice to see that explanations are provided for those using the free version of Visual Studio, Visual Studio Express, as this means that I can test out all of the examples myself. Other chapters yet to be finalised cover using IronPython in a webbrowser (apparently IronPython can run in Silverlight, the new browser plugin from Microsoft), and extending IronPython with C# or VisualBasic.

In short, for any serious users of IronPython, this book is a must have. It may also convince those using C# to make the switch to a better and more productive life with Python...

Reacting to questions

Don't blink or you'll miss it. My 15 minutes of fame starts now.* David Bradley tracked me down in cyberspace, and turned me into a Reactive Profile.

*Update: 15 minutes now over.

Thursday, 10 July 2008

Pipeline Python - Generate a workflow

Workflow packages such as Pipeline Pilot, Taverna and KNIME allow the user to graphically create a pipeline to process molecular data. A downside of these packages is that the units of the workflow, the nodes, process data sequentially. That is, no data gets to Node 2 until Node 1 has finished processing all of it. Correction (thanks Egon): The previous line is plain incorrect. Both KNIME and Taverna2, at least, pass on partially processed data as soon as it's available.

Wouldn't it be nicer if they worked more like Unix pipes, that is, as soon as some data comes out of Node 1 it gets passed onto the next Node and so on. This would have three advantages: (1) you get the first result quicker, (2) you don't use up loads of memory storing all of the intermediate results, (3) you can run things in parallel, e.g. Node 2 could start processing the data from Node 1 immediately, perhaps even on a different computer.

Luckily, there is a neat feature in Python called a generator that allows you to create a pipeline that processes data in parallel. Generators are functions that return a sequence of values. However, unlike just returning a list of values, they only calculate and return the next item in the sequence when requested. One reason this is useful is because the sequence of items could be very large, or even infinite in length. (For a more serious introduction, see David Beazley's talk at PyCon'08, which is the inspiration for this blog post.)

Let's create a pipeline for processing an SDF file that has three nodes: (1) a filter node that looks for the word "ZINC00" in the title of the molecule, (2) a filter node for Tanimoto similarity to a target molecule, (3) an output node that returns the molecule title. (The full program is presented at the end of this post.)

# Pipeline Python!
pipeline = createpipeline((titlematches, "ZINC00"),
(similarto, targetmol, 0.50),
(moltotitle,))

# Create an input source
dataset = pybel.readfile("sdf", inputfile)

# Feed the pipeline    
results = pipeline(dataset)

The variable 'results' is a generator, so nothing actually happens until we request the values returned by the generator...

# Print out each answer as it comes
for title in results:
print title

The titles of the molecules found will appear on the screen one by one as they are found, just like in a Unix pipe. Note how easy it is to combine nodes into a pipeline.

Here's the full program:

import re
import os
import itertools

# from cinfony import pybel
import pybel

def createpipeline(*filters):
def pipeline(dataset):
piped_data = dataset
for filter in filters:
piped_data = filter[0](piped_data, *filter[1:])
return piped_data
return pipeline

def titlematches(mols, patt):
p = re.compile(patt)
return (mol for mol in mols if p.search(mol.title))

def similarto(mols, target, cutoff=0.7):
target_fp = target.calcfp()
return (mol for mol in mols if (mol.calcfp() | target_fp) >= cutoff)

def moltotitle(mols):
return (mol.title for mol in mols)

if __name__ == "__main__":
inputfile = os.path.join("..", "face-off", "timing", "3_p0.0.sdf")
dataset = pybel.readfile("sdf", inputfile)
findtargetmol = createpipeline((titlematches, "ZINC00002647"),)
targetmol = findtargetmol(dataset).next()

# Pipeline Python!
pipeline = createpipeline((titlematches, "ZINC00"),
(similarto, targetmol, 0.50),
(moltotitle,))

# Create an input source
dataset = pybel.readfile("sdf", inputfile)

# Feed the pipeline    
results = pipeline(dataset)

# Print out each answer as it comes through the pipeline    
for title in results:
print title

So, if in future someone tells you that Python generators can be used to make a workflow, don't say "I never node that".

Image: Pipeline by Travis S. (CC BY-NC 2.0)

Tuesday, 8 July 2008

Chemoinformatics p0wned by cheminformatics

The headline says it all. After two weeks, and thousands, er...40 votes, the results are in: 26 for cheminformatics, 14 for that other one.

Next week, bioinformatics versus binformatics...

Sunday, 6 July 2008

Cheminformatics toolkit face-off: Speed (Python vs Java vs C++)

It's a bit meaningless to compare speeds of different toolkits performing the same operations. Functionality is really the reason you are going to favour one toolkit over another. Here I'm focusing on comparing the speed of accessing the same toolkit from CPython or Jython versus accessing it directly in its original language. In other words, what price do you pay to be able to work in Python rather than C++ or Java? (I'll discuss the advantages of working in Python in a later post.)

To begin with, let's consider accessing the CDK from Python versus from Java. The test input file for all of these tests is the first subset of the drug-like ligands in ZINC, 3_p0.0.sdf, which contains 24098 molecules. The three test cases are (1) Iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) calculate 25 descriptor values for the first 228 molecules. I implemented these in Java, and using cinfony. For example, here's the cinfony version for test case 2:

import time
from cinfony import cdk

t = time.time()
for mol in cdk.readfile("sdf", "3_p0.0.sdf"):
print mol.molwt
print time.time() - t

Here are the results (times are in seconds):

Method	Test 1	Test 2	Test 3
Java	22.2	38.9	31.7
CPython (cinfony, cdkjpype)	34.0	72.6	38.2
Jython (cinfony, cdkjython)	23.7	44.4	34.0

It's clear that accessing the CDK from Jython is almost as fast as using it in a Java program. However, there is an overhead associated with using it from CPython except where, as in Test 3, most of the time is spent in computation.

Next, let's look at accessing OpenBabel from Python versus from C++. Here I will compare the following tests cases: (1) iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) apply 30 steps of a forcefield optimisation to the first 200 molecules. Here's an example of the cinfony script for (2). Notice any similarities to the one above? :-)

import time
from cinfony import pybel

t = time.time()
for mol in pybel.readfile("sdf", "3_p0.0.sdf"):
print mol.molwt
print time.time() - t

Here are the results (note: measured on a different machine than the tests above):

Method	Test 1	Test 2	Test 3
C++	77.7	126.0	56.8
CPython (cinfony, Pybel)	78.3	132.9	60.0
Jython (cinfony, Jybel)	80.4	135.7	59.4

In short, the cost of using Pybel or Jybel is small.

Technical notes: For the CDK, I calculated the values of all descriptors (except IP) that didn't return an array of values. This came to 25 descriptors. I also skipped over one molecule, #20, that took several seconds to process. Jython can natively access Java libraries such as the CDK, but to access the CDK from CPython, cinfony uses JPype. cinfony uses the Python SWIG wrappers to access OpenBabel from CPython; Jython is using the Java SWIG wrappers for OpenBabel. I need to repeat the runs a few times on a quiet machine to get better figures, but I note that the figures do not include the cost of loading the OpenBabel DLL or starting up the JVM.

Image credit:MacRonin47

Saturday, 5 July 2008

ANN: OpenBabel 2.2.0 Released

Here is the official announcement from Geoff Hutchison:

I am very happy to finally announce the release of Open Babel 2.2.0, the latest stable version of the open source chemistry toolbox.

This release represents a major update and should be a stable upgrade, strongly recommended for all users of Open Babel. Highlights include improved force fields and coordinate generation, conformer searching, enhanced plugins including molecular descriptors, filters, and command-line transformations. Many formats are improved or added, including CIF, mmCIF, Gaussian cube, PQR, OpenDX cubes, and more. Improved developer API and scripting support and many, many bug fixes are also included.

What's new? See the full release notes.

To download, see our Install Page.

For more information, see the project website.

I would like to personally thank a few people for making this release a great one. In alphabetical order, Jean Bréfort, Andrew Dalke, Marcus Hanwell, Chris Morley, Noel O'Boyle, Kevin Shepherd, Tim Vandermeersch, and Ugo Varetto.

This is a community project and we couldn't have made this release without you. Many thanks to all the contributors to Open Babel including those of you who submitted feedback, bug reports, and code.

Cheers,
-Geoff