Friday 27 July 2007

Quick thoughts: Poignant Ruby, Bryan Vickery, Python in Browsers, Another ChemRank

If I ever learned a programming language just to read the book, the language would be Ruby and the 'book', "Why's (Poignant) Guide to Ruby" (check out the introduction :-) ). (Sorry Rich, was considering Rails, but am going with Django.)

Bryan Vickery has been interviewed by David Bradley over at Reactive Reports. Someone buy this man a beer.

Python and Ruby may soon be as ubiquitous as JavaScript in browsers. Guess what I'll be using.

Hmmm...ChemRank was a neat idea, but it looks like they've already done it in Biology. No voting down, though!

Tuesday 17 July 2007

ccWeb - A lightweight web service for parsing computational chemistry


cclib is a Python library for enabling package-independent computational chemistry algorithms. At its core is a parser for comp chem output files. The latest release, cclib 0.7, can extract information which can be used, for example, to implement charge decomposition analysis, or population analysis, or to convolute the Raman spectrum.

Let's imagine that you cannot or don't want to install cclib, but you would like to be able to access the parsed data from a comp chem output file (e.g. using a different programming language, or in a workflow environment like Taverna). The solution is ccWeb, a web service that you can send your comp chem file to, and it will send you back the result.

Specifically, you need to POST to http://www.redbrick.dcu.ie/~noel/cclib/ccWeb.py with the contents of your logfile in the parameter 'logfile'. In Python, this is done as follows:

import urllib

myurl = "http://www.redbrick.dcu.ie/~noel/cclib/ccWeb.py"
logfiletext = open("mylogfile.log", "r").read()
params = urllib.urlencode({'logfile': logfiletext})
webconnection = urllib.urlopen(myurl, params)
result = webconnection.read()
The results are returned in JSON format as a dictionary of parsed attributes, which can be directly evaluated as an expression in Python and Javascript, or more securely, parsed using a third-party library. Here's how to process the result in Python. If the third-party library, simplejson, is installed, it will be used to safely parse the results:

try:
import simplejson
except:
simplejson = None

if simplejson:
data = simplejson.loads(result)
else:
result = result.strip()
assert result[0] == '{' and result[-1] == '}', \
"Don't trust this JSON!"
data = eval(result)

The actual code for the webservice is as follows:

#!/usr/bin/env python

import cgitb; cgitb.enable() # For traceback information
import os
import cgi
import tempfile
import logging

import simplejson

from cclib.parser import ccopen

fields = cgi.FieldStorage()


if fields.has_key("logfile"):

handle, filename = tempfile.mkstemp()
tmpfile = open(filename, "w")
print >> tmpfile, fields.getfirst("logfile")
tmpfile.close()

failed = False
try:
a = ccopen(filename)
a.logger.setLevel(logging.ERROR)
data = a.parse()
except:
failed = True
os.close(handle)
os.remove(filename)

if not failed:
print "Content-Type: application/json; charset=utf-8"
print "Status: 200 Ok"
print
print data.writejson()
else:
print "Content-Type: application/json; charset=utf-8"
print "Status: 500 Internal Server Error"
print
print simplejson.dumps({'error': 'cclib was not able \
to parse this logfile without errors. Please \
send an email to cclib-users@lists.sf.net.'})
else:
print "Content-Type: application/json; charset=utf-8"
print "Status: 400 Bad Request"
print
print simplejson.dumps({'error': 'You need to set the \
value of the logfile parameter to the contents of \
a logfile'})
Note: ccWeb is based on the development version of the forthcoming cclib 0.8.

Thursday 12 July 2007

Pybel - Just how unique are your molecules? Part II

I recently posted on using SMILES, Fingerprints and InChIs to identify unique molecules in a dataset. Geoff pointed out that I should have been using canonical SMILES ("can" in Open Babel), instead of non-canonical SMILES ("smi" in Open Babel). Why?

Well, the same molecule will always have the same canonical SMILES. Whereas depending on the order of the atoms in the input file, the same molecule might have a different non-canonical SMILES. Since I was interested in identifying unique molecules, I should have been using canonical SMILES.

Method

The same as before, just with "can" instead of "smi". The code used is only slightly different from that in the previous post.

Results

Here is the output of the script (see code below):

Starting...

The number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?

There are 2 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates


The number of (unique) molecules is 10792

Are there any remaining molecules with the same canonical
SMILES?

There are 815 molecules with 2 duplicates
There are 61 molecules with 3 duplicates
There are 36 molecules with 4 duplicates
There are 10 molecules with 5 duplicates
There are 9 molecules with 6 duplicates
There are 9 molecules with 7 duplicates
There is 1 molecule with 8 duplicates
There are 3 molecules with 10 duplicates
There are 3 molecules with 11 duplicates
There are 2 molecules with 13 duplicates
There is 1 molecule with 14 duplicates
There is 1 molecule with 24 duplicates
There is 1 molecule with 34 duplicates

Are there any remaining molecules with the same fingerprint?

None found

Are there any remaining molecules with the same InChI?

There are 2 molecules with 2 duplicates


...Finished!
Discussion

If you compare the results here with those in my previous post, you will find that that there are two instances where two molecules have the same FP, InChI and non-canonical SMILES, but they have different canonical SMILES! By eye the two molecules are identical. This appears to be a bug in the canonicalisation routine and I've reported it to OpenBabel. The details are:

non-canonical SMILES is CC1OC(C[NH](C1)CC(C(F)(F)F)O)C but canonical SMILES is...
  • ZINC03883383: C[C@H]1O[C@@H](C)C[NH](C1)C[C@H](O)C(F)(F)F
  • ZINC03883386: C[C@@H]1O[C@H](C)C[NH](C1)C[C@H](O)C(F)(F)F

non-canonical SMILES is C1C(CC(C(C1O)O)O)(C(=O)O)O but canonical SMILES is...
  • ZINC03870192: O[C@@H]1CC(O)(C[C@H](O)C1O)C(=O)O
  • ZINC03870194: O[C@H]1CC(O)(C[C@@H](O)C1O)C(=O)O

Although my original aim was simply to give an example of how to use Pybel, this trivial analysis of ZINC has identified areas for improvement in both ZINC and OpenBabel. Without access to such a large dataset as ZINC, errors such as those identified here would go unnoticed. If anyone has any more ideas for tests, let me know...

Tuesday 10 July 2007

Pybel - Hack that SD file

At some point, most cheminformaticians need to hack at SD files. Maybe they need to extract the values of particular data fields, or add in some new ones. Unfortunately, our friends 'awk', 'grep' and 'Excel' (!) are not very useful for this so we need to use a scripting language.

The examples shown here use the Python module Pybel, included with the Open Babel distribution. Although it's a Python module, all of the hard work is done by the underlying C++ library.

Convert an SD file to a spreadsheet

Given an SD file with calculated descriptor values in the fields, I want to write out a Tab-separated file containing a SMILES string in the first column, the title in the second column, and then the descriptor values in the remaining columns.

import pybel

inputfile = pybel.readfile("sdf", "ace_ligands.sdf")
# We need to read the first molecule to find
# the descriptor names

mol = inputfile.next()
print "\t".join(["SMILES", "Title"] + mol.data.keys())

# Print out information on the first molecule
print "\t".join([mol.write("smi").split()[0], mol.title]
+ mol.data.values())
# Print out information on the remaining molecules
for mol in inputfile:
print "\t".join([mol.write("smi").split()[0], mol.title]
+ mol.data.values())
Notes:
  • The data fields in an SD file are in the 'data' attribute of a molecule, which behaves like a dictionary with keys() and values().
  • The "\t".join(mylist) concatenates a list of strings around Tab characters.
  • If you convert a molecule to a SMILES string, the result is a string containing the SMILES string, a Tab, the molecule title, and a carraige return. To simply access the SMILES string, you need to split() it and access the first element.

Add Rule-of-5 descriptor values to an SD file

Although Lipinksi's Rule-of-5 (or fives) may be both a Sacred Cow and an Evil Metric (according to the Great Molecular Crapshoot), we still love it. Here's how to add RO5 descriptor values to an SD file (requires Open Babel 2.1.1):

import pybel

HBD = pybel.Smarts("[!#6;!H0]")
HBA = pybel.Smarts("[$([$([#8,#16]);!$(*=N~O);" +
"!$(*~N=O);X1,X2]),$([#7;v3;" +
"!$([nH]);!$(*(-a)-a)])]")

def moredesc(mol):
ans = {}
ans['RotBonds'] = mol.OBMol.NumRotors()
ans['HBD'] = len(HBD.findall(mol))
ans['HBA'] = len(HBA.findall(mol))
ans['molwt'] = mol.molwt
return ans

output = pybel.Outputfile("sdf", "LipinskiRulesOK.sdf")

for mol in pybel.readfile("sdf", "NoRules.sdf"):
mol.data.update(mol.calcdesc())
mol.data.update(moredesc(mol))
output.write(mol)

output.close()
Notes:
  • The 'calcdesc' method of a Molecule returns a dictionary containing descriptor values for LogP, Molecular Refractivity (MR), and Polar Surface Area (PSA).
  • The 'update' method of a dictionary adds the contents of one dictionary to another.

Sunday 8 July 2007

ANN: Open Babel 2.1.1 Release


(To paraphrase Geoff...)

I am pleased to announce the release of Open Babel 2.1.1, the latest stable version of the open source chemistry toolbox.

This release represents a stable bug-fix release and is highly recommended for all users, even those who have not had problems with 2.1.0. It includes fixes for several important crashes, and many, many bug fixes and improvements.

What's new? See the full release notes.

To download, see our install page.

For more information, see the project website.

Have fun!

Thursday 5 July 2007

Pybel - Just how unique are your molecules?

Cheminformaticians often want to remove or identify duplicates in a dataset of molecules. There are a couple of popular ways to do this: ensure that each molecule has a unique SMILES string, has a unique Daylight-type fingerprint, or has a unique InChI.

But what's the difference? Is there any? Does a unique fingerprint imply a unique InChI, and vice versa?

Method

Enough with the questions. Let's just get some numbers. Our test set is the first slice of the drug-like subset of the ZINC database. Specifically, it's the file 3_p0.0.mol2.gz. According to ZINC, this should contain "only a single form of each molecule". We're going to use Pybel, the Python module included with Open Babel, to calculate the three common measures of uniqueness and check for duplicates.

Results

Here is the output of the script (see code below):

Starting...

The total number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?
There are 4 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates

The total number of (unique) molecules is 10790

Are there any remaining molecules which have the same FP \
and SMI?

There are 846 molecules with 2 duplicates
There are 55 molecules with 3 duplicates
There are 23 molecules with 4 duplicates
There are 3 molecules with 5 duplicates
There is 1 molecule with 7 duplicates

Are there any remaining molecules which have the same InChI?

...Finished!
Discussion

So it seems that uniqueness is not quite as easy to measure as you might think. Even once you remove all molecules that share the same fingerprint, SMILES string, and InChI, there are still many that share the same fingerprint and SMILES string, although the InChIs are unique.

The results shown here either expose errors in OpenBabel, or possible errors in the creation of the ZINC database. In either case, this is an example of two Open chemistry resources being used to test and validate each other. Hopefully these results will aid in the development of even better software or data.

Code

import os
import glob
import pybel
import cPickle

def duplicates(keys, data):
"""Find the duplicates in 'data' in terms of particular
keys.

'keys' is a list containing some selection of the
indices 1, 2 and 3.
"""
ans = {}
for d in data:
ans.setdefault(tuple([d[key] for key in keys]),[]) \
.append(d[TITLE])
result = {}
for k, v in ans.iteritems():
if len(v) > 1: # len(v) is the number of duplicates
result[len(v)] = result.get(len(v), 0) + 1
for k in sorted(result.keys()):
if result[k]>1:
print "There are %d molecules with %d" \
" duplicates" % (result[k], k)
else:
print "There is %d molecule with %d" \
" duplicates" % (result[k], k)
return ans

def readfrommol2(filename):
data = []
for mol in pybel.readfile("mol2", filename):
data.append(( mol.title,
"%s" % mol.calcfp('FP2').bits,
mol.write("smi").split()[0],
mol.write("inchi").rstrip() ))
outputfile = open("dups.pickle", "w")
cPickle.dump(data, outputfile)
outputfile.close()
return data

def readfrompickle():
inputfile = open("dups.pickle", "r")
data = cPickle.load(inputfile)
inputfile.close()
return data

if __name__ == "__main__":
print "Starting...\n"

if not os.path.isfile("dups.pickle"):
data = readfrommol2("3_p0.0.mol2")
else:
data = readfrompickle()

TITLE, FP, SMI, INC = range(4)

# How many molecules are there?
print "The number of molecules is %d\n" \
% len(data)

# Find molecules with the same FP, SMI and INC
print "Are there any molecules with the same" \
" FP, SMI and InChI?"
answer = duplicates( [SMI, FP, INC], data)

# Select only unique molecules
uniquedata = [(v[0], k[0], k[1], k[2]) for k, v \
in answer.iteritems()]
print "\nThe number of (unique) molecules is %d\n" \
% len(uniquedata)

# Find molecules with the same FP and SMI but not INC
print "Are there any remaining molecules which have" \
" the same FP and SMI?\n"
duplicates( [SMI, FP], uniquedata )

# Find any remaining molecules with the same InChI
print "\nAre there any remaining molecules which have" \
" the same InChI?\n"
duplicates( [INC], uniquedata ) # Nothing is found

print "...Finished!"