Noel O'Blog

Saturday, 11 August 2007

Move over FORTRAN: Python makes inroads into computational chemistry

Old-school dyed-in-the-wool computational chemists like to use FORTRAN and C a lot, argue about which is better, and create programs whose names are written all in CAPITALS and will only compile on an SGI-IRIX. Although you gotta love them for trying, it's good to see that times are a-changing.

Looking at "Early View" for Wiley's Journal of Computational Chemistry, you will see two "Software News and Updates": pyVib and pyFrag. Yes, you've guessed it - they're both Python programs. And if that wasn't enough, I might as well announce here that a paper on the Python library cclib has just been accepted by the same journal.

It's likely that at least some of this interest in Python for computational chemistry is due to PyQuante, developed by Rick Muller. Who'd have believed that a scripting language could be used to carry out quantum chemical calculations? Of course, what people don't realise is that Python has an extension library for efficient numerical computation. Or maybe they're starting to realise it...

Librarians go Web 2.0

Yes, librarians are doing it too. To begin with, my Greasemonkey userscript for adding bloggers' quotes to journal pages has just gotten an enthusiastic write up by Mark Rabnett, a hospital librarian and blogger.

He learned of this userscript by reading a recent paper in ACS Chemical Biology by D.P. Martinsen, "Scholarly Communication 2.0: Evolution or Design?". This was news to me, so I checked it up. It turns out that it's pretty much a review of the Spring ACS sessions on Web 2.0. He begins by giving a good description of what the term Web 2.0 means, and why scientists should know about it. Then he goes on to discuss the presentations by Nick Day, Henzy Rzepa and Colin Batchelor among others (these are just the people I know or know of).

Then we come to the good bit. At the end of page 370 it says:

Two additional items, unrelated to the ACS meeting, are significant. Using Greasemonkey, a Firefox extension that allows anyone to write scripts that can change the way a web page looks, the Blue Obelisk group, a community of chemists who develop open source applications and databases in chemistry [ref to BO paper], has created several such scripts to enable chemistry-related features. One of these tools will insert links to blog stories about journal articles into the tables of contents of any ACS, RSC, Wiley, or NPG journal [ref to old BO wiki]. This enhancement to a journal’s table of contents is completely independent of the journal publisher.

That's a pretty lucid summing up of the userscript and its significance. Somewhere I suspect PMR's hand in this. :-)

Friday, 10 August 2007

Access embedded molecular information in images

Recently Rich Apodaca has been discussing (here, here and here) embedding molecular information in images of molecules, such as a PNG file depicting a 2D structure.

I'm going to show how to extract this type of embedded metadata using Python.

First of all, you'll need an image to work with. Grab the PNG file, rosiglitazone.png, from Rich's post.

Next, you'll need the Python Imaging Library (PIL), a 3rd-party Python extension library available from Pythonware.

Here's the text of an interactive Python session showing how to access the image metadata:


C:\>python
ActivePython 2.4.1 Build 247 (ActiveState Corp.) based on
Python 2.4.1 (#65, Jun 20 2005, 17:01:55) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more inf
ormation.
>>> import Image
>>> myimage = Image.open("rosiglitazone.png")
>>> dir(myimage)
['_Image__transformer', '_PngImageFile__idat', '__doc__', '__
init__', '__module__', '_copy', '_dump', '_expand', '_makesel
...
im', 'getpalette', 'getpixel', 'getprojection', 'histogram',
'im', 'info', 'load', 'load_end', 'load_prepare', 'load_read'
, 'mode', 'offset', 'palette', 'paste', 'png', 'point', 'puta
lpha', 'putdata', 'putpalette', 'putpixel', 'quantize', 'read
...
transform', 'transpose', 'verify']
>>> myimage.info
{'molfile': 'name\nparams\ncomments\n 25 27  0  0  0  0  0  0
  0  0  0 V2000\n    1.6910   -6.1636    0.0000 C   0  0  0
0  0  0  0  0  0  0  0  0\n    2.5571   -6.6636    0.0000 C
 0  0  0  0  0  0  0  0  0  0  0  0\n    3.4231   -6.1636
0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n    3.4231   -
5.1636    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n
2.5571   -4.6636    0.0000 C   0  0  0  0  0  0  0  0  0  0
n  7  8  1  0  0  0  0\n  8  9  1  0  0  0  0\n  7 10  1  0
...
...
0  0  0\n  9 11  1  0  0  0  0\n 11 12  1  0  0  0  0\n 12 13
  2  0  0  0  0\n 13 14  1  0  0  0  0\n 14 15  2  0  0  0  0
\n 15 16  1  0  0  0  0\n 16 17  2  0  0  0  0\n 17 12  1  0
 0  0  0\n 15 18  1  0  0  0  0\n 18 19  1  0  0  0  0\n 19 2
0  1  0  0  0  0\n 20 21  1  0  0  0  0\n 21 22  1  0  0  0
0\n 22 23  1  0  0  0  0\n 23 19  1  0  0  0  0\n 22 24  2  0
  0  0  0\n 20 25  2  0  0  0  0\nM  END', 'aspect': (1, 1)}
>>> moldata = myimage.info['molfile']

Cool. So now we could write this information to a file (print >> open("myoutputfile.mol"), moldata), or convert it into an OpenBabel molecule and calculate some properties:


>>> import pybel
>>> mymol = pybel.readstring("MOL", moldata)
>>> print mymol.molwt
357.42676

Friday, 27 July 2007

Quick thoughts: Poignant Ruby, Bryan Vickery, Python in Browsers, Another ChemRank

If I ever learned a programming language just to read the book, the language would be Ruby and the 'book', "Why's (Poignant) Guide to Ruby" (check out the introduction :-) ). (Sorry Rich, was considering Rails, but am going with Django.)

Bryan Vickery has been interviewed by David Bradley over at Reactive Reports. Someone buy this man a beer.

Python and Ruby may soon be as ubiquitous as JavaScript in browsers. Guess what I'll be using.

Hmmm...ChemRank was a neat idea, but it looks like they've already done it in Biology. No voting down, though!

Tuesday, 17 July 2007

ccWeb - A lightweight web service for parsing computational chemistry

cclib is a Python library for enabling package-independent computational chemistry algorithms. At its core is a parser for comp chem output files. The latest release, cclib 0.7, can extract information which can be used, for example, to implement charge decomposition analysis, or population analysis, or to convolute the Raman spectrum.

Let's imagine that you cannot or don't want to install cclib, but you would like to be able to access the parsed data from a comp chem output file (e.g. using a different programming language, or in a workflow environment like Taverna). The solution is ccWeb, a web service that you can send your comp chem file to, and it will send you back the result.

Specifically, you need to POST to http://www.redbrick.dcu.ie/~noel/cclib/ccWeb.py with the contents of your logfile in the parameter 'logfile'. In Python, this is done as follows:


import urllib

myurl = "http://www.redbrick.dcu.ie/~noel/cclib/ccWeb.py"
logfiletext = open("mylogfile.log", "r").read()
params = urllib.urlencode({'logfile': logfiletext})
webconnection = urllib.urlopen(myurl, params)
result = webconnection.read()

The results are returned in JSON format as a dictionary of parsed attributes, which can be directly evaluated as an expression in Python and Javascript, or more securely, parsed using a third-party library. Here's how to process the result in Python. If the third-party library, simplejson, is installed, it will be used to safely parse the results:


try:
    import simplejson
except:
    simplejson = None

if simplejson:
    data = simplejson.loads(result)
else:
    result = result.strip()
    assert result[0] == '{' and result[-1] == '}', \
           "Don't trust this JSON!"
    data = eval(result)

The actual code for the webservice is as follows:


#!/usr/bin/env python

import cgitb; cgitb.enable() # For traceback information
import os
import cgi
import tempfile
import logging

import simplejson

from cclib.parser import ccopen

fields = cgi.FieldStorage()


if fields.has_key("logfile"):

  handle, filename = tempfile.mkstemp()
  tmpfile = open(filename, "w")
  print >> tmpfile, fields.getfirst("logfile")
  tmpfile.close()

  failed = False
  try:
    a = ccopen(filename)
    a.logger.setLevel(logging.ERROR)
    data = a.parse()
  except:
    failed = True
  os.close(handle)
  os.remove(filename)

  if not failed:
    print "Content-Type: application/json; charset=utf-8"
    print "Status: 200 Ok"
    print
    print data.writejson()
  else:
    print "Content-Type: application/json; charset=utf-8"
    print "Status: 500 Internal Server Error"
    print
    print simplejson.dumps({'error': 'cclib was not able \
            to parse this logfile without errors. Please \
            send an email to cclib-users@lists.sf.net.'})
else:
  print "Content-Type: application/json; charset=utf-8"
  print "Status: 400 Bad Request"
  print
  print simplejson.dumps({'error': 'You need to set the \
      value of the logfile parameter to the contents of \
      a logfile'})

Note: ccWeb is based on the development version of the forthcoming cclib 0.8.

Thursday, 12 July 2007

Pybel - Just how unique are your molecules? Part II

I recently posted on using SMILES, Fingerprints and InChIs to identify unique molecules in a dataset. Geoff pointed out that I should have been using canonical SMILES ("can" in Open Babel), instead of non-canonical SMILES ("smi" in Open Babel). Why?

Well, the same molecule will always have the same canonical SMILES. Whereas depending on the order of the atoms in the input file, the same molecule might have a different non-canonical SMILES. Since I was interested in identifying unique molecules, I should have been using canonical SMILES.

Method

The same as before, just with "can" instead of "smi". The code used is only slightly different from that in the previous post.

Results

Here is the output of the script (see code below):


Starting...

The number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?

There are 2 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates


The number of (unique) molecules is 10792

Are there any remaining molecules with the same canonical 
   SMILES?

There are 815 molecules with 2 duplicates
There are 61 molecules with 3 duplicates
There are 36 molecules with 4 duplicates
There are 10 molecules with 5 duplicates
There are 9 molecules with 6 duplicates
There are 9 molecules with 7 duplicates
There is 1 molecule with 8 duplicates
There are 3 molecules with 10 duplicates
There are 3 molecules with 11 duplicates
There are 2 molecules with 13 duplicates
There is 1 molecule with 14 duplicates
There is 1 molecule with 24 duplicates
There is 1 molecule with 34 duplicates

Are there any remaining molecules with the same fingerprint?

None found

Are there any remaining molecules with the same InChI?

There are 2 molecules with 2 duplicates


...Finished!

Discussion

If you compare the results here with those in my previous post, you will find that that there are two instances where two molecules have the same FP, InChI and non-canonical SMILES, but they have different canonical SMILES! By eye the two molecules are identical. This appears to be a bug in the canonicalisation routine and I've reported it to OpenBabel. The details are:

non-canonical SMILES is CC1OC(C[NH](C1)CC(C(F)(F)F)O)C but canonical SMILES is...

ZINC03883383: C[C@H]1O[C@@H](C)C[NH](C1)C[C@H](O)C(F)(F)F
ZINC03883386: C[C@@H]1O[C@H](C)C[NH](C1)C[C@H](O)C(F)(F)F

non-canonical SMILES is C1C(CC(C(C1O)O)O)(C(=O)O)O but canonical SMILES is...

ZINC03870192: O[C@@H]1CC(O)(C[C@H](O)C1O)C(=O)O
ZINC03870194: O[C@H]1CC(O)(C[C@@H](O)C1O)C(=O)O

Although my original aim was simply to give an example of how to use Pybel, this trivial analysis of ZINC has identified areas for improvement in both ZINC and OpenBabel. Without access to such a large dataset as ZINC, errors such as those identified here would go unnoticed. If anyone has any more ideas for tests, let me know...

Tuesday, 10 July 2007

Pybel - Hack that SD file

At some point, most cheminformaticians need to hack at SD files. Maybe they need to extract the values of particular data fields, or add in some new ones. Unfortunately, our friends 'awk', 'grep' and 'Excel' (!) are not very useful for this so we need to use a scripting language.

The examples shown here use the Python module Pybel, included with the Open Babel distribution. Although it's a Python module, all of the hard work is done by the underlying C++ library.

Convert an SD file to a spreadsheet

Given an SD file with calculated descriptor values in the fields, I want to write out a Tab-separated file containing a SMILES string in the first column, the title in the second column, and then the descriptor values in the remaining columns.


import pybel

inputfile = pybel.readfile("sdf", "ace_ligands.sdf")
#  We need to read the first molecule to find
#  the descriptor names
mol = inputfile.next()
print "\t".join(["SMILES", "Title"] + mol.data.keys())

# Print out information on the first molecule
print "\t".join([mol.write("smi").split()[0], mol.title]
                 + mol.data.values())
# Print out information on the remaining molecules
for mol in inputfile:
    print "\t".join([mol.write("smi").split()[0], mol.title]
                 + mol.data.values())

Notes:

The data fields in an SD file are in the 'data' attribute of a molecule, which behaves like a dictionary with keys() and values().
The "\t".join(mylist) concatenates a list of strings around Tab characters.
If you convert a molecule to a SMILES string, the result is a string containing the SMILES string, a Tab, the molecule title, and a carraige return. To simply access the SMILES string, you need to split() it and access the first element.

Add Rule-of-5 descriptor values to an SD file

Although Lipinksi's Rule-of-5 (or fives) may be both a Sacred Cow and an Evil Metric (according to the Great Molecular Crapshoot), we still love it. Here's how to add RO5 descriptor values to an SD file (requires Open Babel 2.1.1):


import pybel

HBD = pybel.Smarts("[!#6;!H0]")
HBA = pybel.Smarts("[$([$([#8,#16]);!$(*=N~O);" +
                   "!$(*~N=O);X1,X2]),$([#7;v3;" +
                   "!$([nH]);!$(*(-a)-a)])]")

def moredesc(mol):
    ans = {}
    ans['RotBonds'] = mol.OBMol.NumRotors()
    ans['HBD'] = len(HBD.findall(mol))
    ans['HBA'] = len(HBA.findall(mol))
    ans['molwt'] = mol.molwt
    return ans

output = pybel.Outputfile("sdf", "LipinskiRulesOK.sdf")

for mol in pybel.readfile("sdf", "NoRules.sdf"):
    mol.data.update(mol.calcdesc())
    mol.data.update(moredesc(mol))
    output.write(mol)

output.close()

Notes:

The 'calcdesc' method of a Molecule returns a dictionary containing descriptor values for LogP, Molecular Refractivity (MR), and Polar Surface Area (PSA).
The 'update' method of a dictionary adds the contents of one dictionary to another.