Wednesday 19 December 2007

Matplotlib tips

Recently I've been doing some plots with matplotlib. Although it has all gone well, it hasn't been entirely obvious how to do certain things. So, for the benefit of posterity (i.e. me in a month's time) I'm going to record some of my newly-gained knowledge here.
  • Colours: 'gray' is a colour too, 'k' is short for black.
  • Don't like the border on a legend? You need to stick the legend into a variable, "leg = legend()" for example, and say "leg.draw_frame(False)" (this pearl of wisdom comes courtesy of the matplotlib source code).
  • Want to make the legend text (and hence the legend) smaller? Try legend(prop={"size":10}).
  • To adjust the position or font of text in a plot, it may be easiest to save as SVG, open in Inkscape (a truly fantastic program), click on the image and hit 'Ungroup' a few times (CTRL+SHIFT+G), make the adjustments, resave the SVG, and finally Export to Bitmap (300 dpi is probably good enough).
  • Want two overlapping histograms on the same plot? Use the width keyword to hist() to set the bar widths to half the bin size. Also, catch the rectangles used to make the second histogram so that you can offset the bars by adding half the bin size:

    a, b, c = hist(x, bins, facecolor='k', width=width)

    for rect in c:
    rect.set_x(rect.get_x() + width)
    Update 11/05/2010: With current version of Matplotlib, you need to replace width=width with rwidth=0.5.
    You might also want to store the first rectangle of each histogram in a list and pass it as a first argument to legend.
  • To add a regression line between x=100 and x=600 (and get the R and significance values for free), try:

    from scipy import stats
    grad, inter, r, p, std_err = stats.linregress(x, y)
    plot((100, 600), [grad*x + inter for x in [100, 600]])
  • Line styles: "-" for solid, "--" for dashed, ":" for dotted. Unfortunately, the gaps in 'dashed' are too big. To define your own line style, you need to catch the line in a variable, e.g. "lines = plot(...)", and use lines[0].set_dashes((5,2)) for dashes of length 5 separated by gaps of 2 (unknown units).
  • Want Greek symbols and subscripts in axes labels? It worked for me on windows with something like the following:

    rc('text', usetex=True)
    xlabel(r"$\rho_{8.0}", fontsize="large")
    Note that the first time you run this on Windows, matplotlib might need to connect to the internet to download a needed component (MiKTeX). Also, when you make changes to the TeX stuff you might need to run your program twice to pick up on the changes.
    Update 03/01/2011: With current version of Matplotlib, a TeX parser is included.
    The Angstrom symbol can be include in a legend as follows: "RMSD ($\AA$)". For more information on available symbols, see the Matplotlib mathtext page.

Wednesday 12 December 2007

Using Jmol for Drug Design - Depict Ligand in Active Site

In the previous article, I showed how to install Jmol. This article will look at depicting a ligand in a protein active site with a few extra complications. This example was not chosen to showcase Jmol - this is a real-life example of a figure I need to create. You can find on the web lots of examples of scripting Jmol to turn on/off ball-and-stick models. Figuring out how to do difficult things is a bit harder, and I hope others will find this example useful...

Specifically, I want to depict the ligand in the protein active site, with the protein represented by an isosurface coloured by depth in the protein (values which I will supply), and with protein-ligand hydrogen bonds shown. I have a protein .MOL2 file with hydrogens added; I have a ligand .MOL file, I have the corresponding protein .PDB file (no hydrogens and the ligand and waters have been removed); and I have a text file containing, for each atom in the protein .MOL2 file, a value related to the depth of that atom in the protein. And here is the final image (screenshotted from the browser , cropped a little, and saved as a JPG with zero compression - click for the original image):Before presenting the necessary code, to help understanding I note that Jmol imposes some constraints:
  • bonds can only exist between molecules in the same model
  • if the slab is turned on, it cuts through every model
  • Jmol can identify the residue names in the PDB file, but not in the .MOL2 file
  • Jmol does not automatically identify protein-ligand hydrogen bonds
  • if mapping property data onto an isosurface, it needs to be supplied for every atom in the model (actually, there is an alternative, but this is a good rule of thumb)
I should also point out that Jmol can colour isosurfaces by depth in the protein automatically, but I wanted to use my own measure of depth. Update 13/12/07: Due to popular demand, here is the actual applet in motion: [low quality] [high quality]. And finally, after all that, here is the code (available also as a download):
<html>
<head>
  <script type="text/javascript" 
  src="file:///D:/Tools/Jmol/jmol-11.3.53/Jmol.js">
  </script>
  </head>
<body>
<script type="text/javascript">
  var protein = "1p62";
  jmolInitialize("file:///D:/Tools/Jmol/jmol-11.3.53", true);
  var jmolcmds = [
  // Load the molecules
  "load file:///D:/Ligands/" + protein + "_ligand.mol",
  "set appendNew = false",
  "load APPEND file:///D:/Proteins/" + protein + "_protein.mol2",
  "set appendNew = true",
  "load APPEND file:///D:/Proteins/" + protein + ".pdb",

  // Turn up display quality but don't yet show anything
  "display none; set antialiasDisplay true",

  // The following is the output of "show orientation"
  "reset;center {67.99645 35.5767 19.92025}",
  "rotate z -21.47; rotate y 25.81; rotate z 74.65",
  "zoom 400.0; set rotationRadius 35.03",

  // Add the HBonds
  "model 1.1",
  "connect 3.1 (molecule=1 and elemno=8) \
(molecule!=1 and elemno=7) HBOND RADIUS 0.07 CREATE",
  "connect 3.2 (molecule=1 and elemno=7) \
          (molecule!=1 and elemno=8) HBOND RADIUS 0.07 CREATE",
  "connect 3.1 (molecule=1 and elemno=8) \
(molecule!=1 and elemno=8) HBOND RADIUS 0.07 CREATE",
  "connect 3.1 (molecule=1 and elemno=7) \
(molecule!=1 and elemno=7) HBOND RADIUS 0.07 CREATE",

  // Load info for isosurface and draw it
  "{model=1.1}.property_d = load(\"D:/Data/" + protein +"_rd.txt\")",
  "isosurface select(molecule!=1 and within(10.0, molecule=1)) \
ignore(molecule=1) resolution 5 sasurface 0 \
COLOR ABSOLUTE 30 120 REVERSECOLOR MAP property_d",

  // Select and label the HBonded protein atoms in model 2.1
  "define hbondedproteinatoms within(3.2, true, \
model=1.1 and molecule=1 and (elemno=7 or elemno=8)) \
and not within(0.1, true, model=1.1 and molecule=1) \
and model=2.1 and not elemno=1 and not solvent",
  "select hbondedproteinatoms; font label 15 SANSERIF BOLD",
  "set labelFront; color label white; label %n%R",

  // Add to selection the ligand and
  // its HBonded protein atoms in model 1.1
  "select selected or (((molecule=1 and model=1.1) or \
within(3.2, molecule=1 and model=1.1 and \
(elemno=7 or elemno=8))) and not elemno=1)",

  // Slice through everything with a slab
  "slab on; slab 51",

  // Remove the Jmol logo (I will reference Jmol instead)
  "set frank off",

  // Show the selected data from all models
  "model 0; display selected",
  ];
  jmolApplet(700, jmolcmds.join("; "));
</script>
</body>
</html>

Tuesday 11 December 2007

Using Jmol for Drug Design - Installation

Jmol is well known as the program for showing 3D structures in webpages. It's free, it's open source and it has excellent support. However, I've often felt that Jmol's potential is not being fully realised by chemists in general, and by me in particular. So this series of posts will describe how to use Jmol to create publication quality pictures of drug molecules in a protein active site. More about that in the next post; first we have to install Jmol.

I'm going to be running Jmol as an applet in Firefox on my Win XP machine. Note that Jmol also comes as a stand-alone application. The benefit in running Jmol in a web browser is that you have access to the JavaScript scripting language. The only downside is that you need to remember to use the signed applet (as described below) or else you will have problems opening files at arbitary locations on your computer.

Step one is to download the absolutely latest Jmol. If you look at the download page, see under Jmol Prerelease Tests for the cutting edge (currently 11.3.54). There's a new version practically every day. The file you want is called something like jmol-11.3.54-binary.zip. Download and unzip the file into something like D:\Tools\Jmol\jmol-11.3.54.

Next, download the protein structure 1hnn from the PDB and save as something like D:\Proteins\1hnn.pdb.

Now we can create the "Hello World" of Jmol applets, which simply loads and displays the protein structure you have just downloaded. Save the following code as something like D:\Applet\HelloWorld.htm. You should change the paths in the following HTML to correspond to the actual paths on your computer:
<html>
<head>
<script type="text/javascript"
src="file:///D:/Tools/Jmol/jmol-11.3.54/Jmol.js">
</script>
</head>
<body>
<script type="text/javascript">
jmolInitialize("file:///D:/Tools/Jmol/jmol-11.3.54", true);
var jmolcmds = [
"load file:///D:/Proteins/1hnn.pdb",
];
jmolApplet(700, jmolcmds.join("; "));
</script>
</body>
</html>

Open the HTML file with Firefox. The first time you do this, you will be asked whether you trust Jmol to access your hard drive responsibly. You should say yes to this if you wish to proceed. This warning indicates that you are using the signed applet rather than the unsigned applet. Using the signed applet on a live website will probably discourage users, so is best avoided, but it is very useful when running locally on your machine, as the unsigned applet is restricted to opening files in particular directories.

The next post takes Jmol to the next level, uh, level 2.

Wednesday 5 December 2007

Python gives you wings

Along with Piled Higher and Deeper, I really like xkcd. Today's comic is on a theme close to my heart:To pander to the readership of this blog, here's one on a more popular scripting language:

Tuesday 13 November 2007

Using R from Python - the best of both worlds

R is a programming environment containing every known statistical method; however, it's ugly and difficult to program in. Python has only a small number of statistical methods (which are available through SciPy), but it's elegant and easy to program in. Both have graphing libraries which leave a lot to be desired in terms of ease of use, but the graphs from Python's matplotlib are a lot more polished. So, I'd like to use R from Python...

In order to do this, you will need to install the RPy library (as well as Python and R, of course). I'm working on Windows, but I've used this previously on Linux. As an example, here are three ways to access the same hclust object from Python. They each illustrate different aspects of the Python/R interface:
from rpy import *

# START OF METHOD 1
hclust = r("""
a <- read.table("cpOfSimMatrix.txt")
mydist <- dist(1-a)
hclust(mydist)
""")
r("rm(a)") # Ends with an error if you leave anything in memory
# END OF METHOD 1

# START OF METHOD 2
set_default_mode(NO_CONVERSION)
a = r.read_table("cpOfSimMatrix.txt")
mydist = r.dist(r["-"](1,a)) # Note the trick for '1-a' here
set_default_mode(BASIC_CONVERSION)
hclust = r.hclust(mydist) # Converts R object to Python dict
# END OF METHOD 2

# START OF METHOD 3 (here's one I created earlier)
r.load(".RData")
hclust = r('myHclust')
# END OF METHOD 3

For further examples, check out Peter Cock's excellent pages on combining R and Python.

Thursday 8 November 2007

Take a molecule for a spin with Avogadro

Avogadro 0.2 is an early stage release of a new open source molecular viewer and editor. It's available for Windows, Linux and MacOSX, so there's no excuse not to try it. I took it for a test drive for the first time a week or so ago (this was on Windows), and was very impressed. Sure, there are some rough edges, as you might expect with such a low version number, but it's already capable of matching the existing competition and is sure to become the molecular editor of choice with the next release.

Once it starts, open up a molecular structure file (for example, a PDB entry) and immediately choose the navigation tool (next to the Pencil in the Tools toolbar). The left mouse button rotates the molecule, the middle zooms, and the right translates. So far, so Rasmol.

However, it's the attention to detail in the user interface that makes Avogadro stand out from the crowd. If the initial click to rotate is on an atom, the molecule rotates around the atom, whereas otherwise the molecule rotates around its centroid. Something else you will notice is the eye-candy, which really aids clarity. When rotating, a set of curvy arrows indicate the degree of rotation. If you choose the Bond Centric Manipulation Tool (indicated by an icon containing the number 90) and click on a bond, you will see bond angles lovingly displayed as semi-transparent segments. Then if you click and drag on an atom adjacent to the bond, the visualisation of the dihedral angles as you alter them is pretty cool (see below).

So what else has Avogadro got? I didn't try the molecular builder, the force-field optimisation nor the export to POV-Ray, but they're all there. I'm more interested in what's to come. Avogadro uses a plugin architecture, which means that it will be easy to incorporate third-party add-ons. Along with forthcoming support for scripting languages, this will allow me to incorporate elements of cclib or GaussSum into Avogadro.

But that's all in the future. For now, I'm just wondering how many times I can rotate this dihedral angle until it falls off...

Wednesday 7 November 2007

Preventing Spam on Mediawiki at SourceForge

Mail spam is annoying. But spam on a wiki is even more annoying. If you've spent your free time writing documentation, there's no way you want to see junk text or link spam inserted into the middle of your flowing prose. In an earlier article, I described how to install a Mediawiki wiki on SourceForge. After a while you will start to get your first spam. And so the battle begins...

Wikis don't look very impressive as a website, but for an open source project they are a great way of maintaining documentation. Nobody likes writing documentation, which is why it needs to be made as easy as possible. For example, if a user mails the project mailing list with a question about installing that software, it usually means that the documentation needs to be updated and this can be done in about a minute. (On the other hand, if a project has a frequently-asked questions page, it means that they couldn't be bothered improving the documentation.)

Fortunately, spam can simply be controlled using permissions (thanks to David Wild for some of these suggestions). To begin with, make sure you keep an eye on the Recent Changes feed on your wiki (use an RSS reader). (As an aside, if your Recent Changes feed is broken, a possible cause is that you have installed an extension and included a rouge blank line after the final "?>" - this happened to me.)

Next you need to edit the permissions in LocalSettings.php to disable anonymous edits and disable account creation except by users who already have accounts. Of course, the admin is always allowed to do whatever. I'll explain the rationale for this below, but first here are the settings:
# No anonymous editing allowed
$wgGroupPermissions['*']['edit'] = false;
$wgGroupPermissions['user']['edit'] = true;
$wgGroupPermissions['sysop']['edit'] = true;
# Only users with accounts can create accounts
$wgGroupPermissions['*']['createaccount'] = false;
$wgGroupPermissions['user']['createaccount'] = true;
$wgGroupPermissions['sysop']['createaccount'] = true

On one of the wikis I am involved with on SourceForge, we had already disabled anonymous edits. However, anyone could create an account. At some point the spammers upgraded their spam software and were then able to create accounts on the wiki. Since the RSS feed for recent changes was not working (for the reason described above), it was three days and about 1000 spam accounts later before I realised the problem. At that stage it was too late to implement the solution I described above. Instead, I created a new group 'human', added the 10 or so real accounts to that group and gave them permissions while simultaneously removing all permissions from the regular 'user' account.
# Only 'humans' can edit
$wgGroupPermissions['*']['edit'] = false;
$wgGroupPermissions['user']['edit'] = false;
$wgGroupPermissions['human']['edit'] = true;
$wgGroupPermissions['sysop']['edit'] = true;
# Only 'humans' can create accounts
$wgGroupPermissions['*']['createaccount'] = false;
$wgGroupPermissions['user']['createaccount'] = false;
$wgGroupPermissions['human']['createaccount'] = true;
$wgGroupPermissions['sysop']['createaccount'] = true;

Bye-bye spam. Of course, I still had to revert about 30 edits...:-/

Image credit: Spam wall by freezelight

Friday 2 November 2007

ANN: cclib 0.8 released - parsers and algorithms for comp chem


On behalf of the cclib development team, I am pleased to announce that cclib 0.8 is now available for download.

cclib is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. It currently parses output files from ADF, GAMESS (US), GAMESS-UK, Gaussian, Jaguar, Molpro and PC GAMESS. A paper is currently in press.

The main changes in this release are:
* the addition of a Molpro parser
* support for writing data to/from files as JSON
* API additions: charge, multiplicity, Natural Orbital coefficients
* New method: Lowdin population analysis
* use of Numpy instead of Numeric

Among other data, cclib extracts:
* coordinates and energies
* information about geometry optimization
* atomic orbital information
* molecular orbital information
* information on vibrational modes
* the results of a TD-DFT calculation

cclib also provides some calculation methods for interpreting the electronic properties of molecules using analyses such as:
* Mulliken and Lowdin population analyses
* Overlap population analysis
* Calculation of Mayer's bond orders.

For more information see our website, read the tutorial, or send us an email.

Tuesday 30 October 2007

A few brief words about journal names

If you solved the cryptic crossword that is today's blog title, you will have realised that I'm talking about journal abbreviations. For some reason, almost all journals require you to abbreviate the names of the journals you cite, while simultaneously (and this is the bit that really gets me) not letting you know what abbreviations you should use: it is "J. Mol. Struc." or "J. Mol. Struct." or even "J. Mol. Struct. THEOCHEM"?

Perhaps you do as I do; in that last-minute tidy up of citations, I google "journal abbreviations" and try to find suitable abbreviations for the last few remaining articles. Or sometimes I spend a quiet afternoon searching through the list of citations in recent articles from that journal just to find how out to cite that particular journal!

But sometimes a little voice inside me cries "Why!" - why am I wasting the best years of my life solving the non-problem that is how to abbreviate the title of a particular journal? Don't journals want to be cited correctly? In this age of electronic linking of data, misspelling the journal abbreviation could mean that a citation to a particular journal would be missed. Since journals seem to think that Impact Factors are so important, couldn't they at least give some hint what abbreviation they think you should use?

So, on to my favourite part: "The Case Study", aka "The one that pushed me over the edge". I wanted to cite a paper from Nucleic Acids Research. This is commonly known (to me) as NAR. Indeed, the website, is nar.oxfordjournals.org. And the blurb on the front page says "Nucleic Acids Research (NAR) is a fully Open Access journal". Fantastic, NAR must be the accepted journal abbreviation. Job done.

Just to be sure, let's check the Instructions for Authors. Here they give an example of how they wish references to be formatted. The great thing is that the examples they give are from NAR:
1. Schmitt,E., Panvert,M., Blanquet,S. and Mechulam,Y. (1995) Transition state stabilisation by the 'high' motif of class I aminoacyl-tRNA synthetases: the case of Escherichia coli methionyl-tRNA synthetase. Nucleic Acids Res., 23, 4793-4798.
...but wait a second, that's "Nucleic Acids Res." not "NAR"!

Now our curiosity (aka annoyance) is piqued, so let's check out the current issue of Nucleic Acids Research. But what's that in the title of the page, "Nucl. Acids Res. -- Table of Contents...".

So NAR, or "Nucl. Acids Res." or "Nucleic Acids Res."? And don't even get me started on why journals require these abbreviations in the first place even if they are web-only...

Image credit: Raistrick's Index to Legal Citations and Abbreviations by ex_libris_gul (CC BY-NC-SA 2.0)

Saturday 20 October 2007

At last - merge, split and create PDFs with open source tools

I recently had to manipulate some PDFs, and I was pleasantly surprised that things had improved so much since the last time I had to do this, a couple of years ago.

Whatever you feel about PDFs (for example, you may believe they are a very effective way of destroying all scientific data in the literature), your opinion will suddenly take a nosedive the first time you have to, for example, extract a single page from a PDF, or merge two PDFs together. At this point, you will suddenly realise that Adobe own you, and that you will need to buy Adobe's software if you want to perform this trivial task.

But help is at hand in the form of Open Source software. I found that I was able to manipulate PDFs by using Open Source tools, and on Windows. The first thing to do is to install PDFCreator (GPL). Once it's installed, when you print from any application (for example, Word) you just choose the PDFCreator 'printer', and click OK. After a couple of seconds, a dialog box will pop up where you can just click "Save" (or you might want to adjust the page size to A4 or Letter), and it will make a PDF with the same name as your original document.

A couple of years ago, I used PDFCreator and it worked 96% of the time. That is, Word documents sometimes had strange symbols inserted instead of Greek letters or bullet points; also, extra spacing was sometimes inserted in lines with some text in superscript. This time, it worked perfectly. Well done, PDFCreator creators.

Scanned documents are now often provided as PDFs. I needed to merge some pages of one scanned document with my new PDF. For this, I needed Pdftk (GPL). The blurb says it all:
If PDF is electronic paper, then pdftk is an electronic staple-remover, hole-punch, binder, secret-decoder-ring, and X-Ray-glasses. Pdftk is a command-line tool for doing everyday things with PDF documents.
Here's an example commandline which is similar to the one I actually used (again I did this on Windows). It creates a combined PDF which consists of pages 1-7 from one.pdf, 1-5 from two.pdf, and ends with page 8 from one.pdf.
pdftk A=one.pdf B=two.pdf cat A1-7 B1-5 A8 output combined.pdf
If you ever need to deal with PDFs, hopefully these tools can help reduce the pain.

Friday 12 October 2007

ANN: Frog donates code to OpenBabel for SMILES to 3D conversion


Recently, researchers at the French research institutes INSERM and CNRS developed an online service for converting SMILES string to 3D conformers: "FRee Online druG 3D conformation generator (Frog)". A description of this service was published in T. Bohme Leite, D. Gomes, M.A. Miteva, J. Chomilier, B.O. Villoutreix and P. Tufféry. Nucleic Acids Research, 2007, 35, W568-W572:
Frog is an on-line service aimed at generating 3D conformations for drug-like compounds starting from their 1D or 2D descriptions. Given the atomic constitution of the molecules and connectivity information, Frog can identify the different unambiguous isomers corresponding to each compound, and generate single or multiple low-to-medium energy 3D conformations, using an assembly process that does not presently consider ring flexibility. Tests show that Frog is able to generate bioactive conformations close to those observed in crystallographic complexes.

On behalf of the OpenBabel project, I am pleased to announce that Dr. Bruno Villoutreix (INSERM, University of Paris 5) and Dr. Pierre Tufféry (INSERM, University of Paris 7) have generously donated their code to OpenBabel. This code will be incorporated into OpenBabel under the GPL in the coming months, making fast and accurate SMILES-to-3D conformer generation available to the open source community for the first time.

The absence of an open source 3D conformer generation algorithm has increasingly become a problem in recent years due to the popularity of SMILES strings for the description of molecular information. Fortunately, this problem has now been solved. Thanks again to all those involved in the development and release of this code.

For further information on Frog, please contact the corresponding author of the Frog paper.

Image credit: gottcha78

O'woe is me - It's a capostrophe

"The username contains an illegal character." This is the final straw. I can take no more. I need to share the pain of living with an apostrophe.

Yesterday I read Pedro's post and decided to get on the crest of that Web 2.0 wave. So I requested a beta account on JournalFire, which sounds an interesting way to review the literature. Today, I received permission to create an account. It asked me to: "Enter your name as it would appear for publication. For example: John M. Delacruz". So I entered "Noel M. O'Boyle". And I received the infamous message, "The username contains an illegal character"!!

What's the story with this apostrophe anyway? Well, at some point in history, when my ancestor was called "Ó Baoighill", he had this great idea: "hey, what's the story with this accent on the O; I want to use this new thing called an apostrophe that's going to mess up web accounts for my descendants" (or more likely "Yó, cad é an scéal leis an fada seo; ba mhaith liom usaid a bhaint as an uaschamóg nua sin, a scriosfaidh saol mo pháistí ar an idirlíon"). I'd have appreciated a bit more foresight, forefather.

On the web, the Irish are outlaws. This isn't the first time this has happened. A lot of websites think I'm trying to hack into their systems with my carefully-crafted surname. SourceForge is one of the only sites that allows me to use the apostrophe. But even there, they spell my name "Noel O\'Boyle" just to say..."okay, you can use the apostrophe, but we're keeping an eye on you to make sure you don't try anything crazy with it". Actually, I've just noticed that it's gotten even worse. Now, my name is spelt "Noel O\\\'Boyle". That's like two pairs of handcuffs.

Certain organisations should be more familiar with author names than others; for example, a journal. A paper of mine was recently accepted by J. Comp. Chem. On the PDF proof, the names are all in capitals, i.e. "NOEL M. O'BOYLE" for mine. However, on the web abstract, in the Table of Contents, and in the XML they provided to PubMed, my name is "O'boyle". I feel diminished.

It's not only on the web, of course. When spelling your name over the phone you find yourself saying things like "it's a comma in the air" when trying to describe an apostrophe. Sometimes you just give up and spell your name "Oboyle" with the result that people think it's "O-boy-lay". I've even gotten letters in the post addressed to "Noel O?Boyle". Which, right now, is how I feel.

Friday 5 October 2007

Bring PDB codes to life on web pages

Wouldn't it be nice if you could click on a PDB code on a web page and you could instantly see the actual structure in 3D? For example, you might be reading the HTML version of a paper which is discussing a particular protein structure.

Some time ago I wrote a Greasemonkey userscript to do just that. You can get it on the Blue Obelisk web site, as well as several other userscripts. Since I haven't previously mentioned it on this blog I thought I might do so now.

To avoid false positives, it only runs on web pages containing the words "protein", "pdb" or "enzyme". For any PDB codes found, it adds the appropriate link to Eric Martz's FirstGlance in Jmol site. For example, the PDB codes on this page are tranformed to:(The yellow Jmol links were added by the script)

Friday 14 September 2007

RDKit: Not just yet another cheminformatics toolkit

I'm sitting here with a bandage covering my head, as RDKit has just blown my mind.

RDKit is an cheminformatics toolkit written in C++ and Python. It was developed in-house in a company called Rational Discovery (hence the RD) since 2001. In Feb 2006, it appeared on SourceForge under a liberal license (BSD, except for the GPL Qt code), an appearance which presumably coincided with the demise of Rational Discovery (the company, not the concept, that is :-) ). And there it stayed, actively developed by two developers, but unknown among the open source chemistry community until...

A month ago, I happened to be glancing through the SourceForge software map of chemistry software and I was intrigued by the description of RDKit as "A collection of cheminformatics and machine-learning software developed at Rational Discovery". The website was pretty minimal and there didn't even appear to be any documentation. I dashed off an email to Greg Landrum, the main developer (who it turns out is also the developer of YAeHMOP (Yet Another extended Huckel Molecular Orbital Package) ), and asked him what the story was. Two days ago, he returned from holidays and pointed me to the correct website and the documentation, and I couldn't believe what I was seeing...

Some features that I think are cool:
(1) Molecules based on the Boost Graph Library
(2) All the Python stuff works for me on Windows!
(3) 2D depiction!!!
(4) 2D depiction that mimics 3D conformations!!!
(5) 2D --> 3D conversion in a similar method to Rajarshi's smi23D! (doesn't use stochastic promixity embedding though)

Here's a summary of some of the rest: SMILES, substructure searching, sophisticated fingerprints, machine learning stuff, a GUI, clustering, MACCS keys, descriptors (84 or so), chemical reaction transformations, implementation of Recap (not sure what this is, but there's a ref in the docs), basic pharmacophore stuff, and two types of SSSR (there's some text about this in the docs).

Cool! There's obviously a lot of overlap between OpenBabel and RDKit, and hopefully we can use this to both projects' advantage in terms of testing against each other and developing interfaces. In the meanwhile, here's to diversity, and to discovering that someone else has implemented 2D depiction in C++ so I Don't Have To.

For more info, see www.rdkit.org, and in particular, the Python interface documentation which gives a good overview.

Image credit: Toolkit by Neil T (CC BY-SA 2.0)

Wednesday 12 September 2007

Exclusive: PRISM's response to scientific community's outrage

In a world exclusive, I can reveal the considered response of PRISM to the detailed arguments raised by Peter Suber, Peter Murray-Rust, Steve Harnad, etc., etc.:



For more thoughtful lampoonery, see Trapped in the USA and PISD.

* Image credit: This is a derivative of the image in the Wikipedia article "All your base are belong to us". I plead fair use, and a sense of humour.

Wikifying chemistry

There has been some interest in using wikis to annotate molecules: e.g. the ChemSpider blog where Antony is interested in using a local wiki to annotate entries, chem-bla-ics where Egon is trying to ensure that molecular data on Wikipedia can be accessed like a database, and most recently where PMR has commented on DBPedia.

This idea is already being used in biology. The Rfam database at the Sanger Institute is directly using wikipedia to create annotations for the major RNA families, which each have a page on Wikipedia. The full list of Wikipedia pages seems to be available here. From what I've heard second-hand, interested academics are invited to contribute to the pages on their favourite families. Every day the pages are downloaded and backed up, and the information made available through the Rfam database. All edits are tracked using Wikipedia's own tracking facilities (i.e. watchlists) so that vandalism is easily detected, although apparently this hasn't been a problem.

I'm not sure how much this idea could be used in chemistry (perhaps a database of drugs...?), but it sure is some food for thought...

Wednesday 5 September 2007

Scooby, D. D., where are you? - Searching for papers online

One of my pet annoyances is searching on a journal website for an article that I know exists, but not being able to find it. Let's take a real-life example...

I met Douglas Hawkins at the ACS and wanted to look up his papers in JCIM/JCICS. I've bookmarked the JCIM TOC page, so I go there. At the top is a handy little shortcut for finding papers by a particular author. So I type in "Hawkins" next to the Author drop-down box, and click "Search"...

10 documents. Of which half are Douglas Hawkins (the right guy), and the other half are Donald Hawkins (the wrong guy). In addition, both JCIM and JCICS were searched. Great. This is where I should have stopped. Instead I decided I only wanted to find those papers by Douglas Hawkins. The following are my attempts to find his papers:

Hawkins, D. M.: 7 article. All false positives.
Hawkins D M: ditto
"Hawkins, D. M.": 0 documents
D M Hawkins: It's my favourite 7 articles again.
D Hawkins: 24 articles. No articles by D. M. Hawkins included.

But how can I have found 24 articles for the last search? That's more than I found with just "Hawkins". Wait a second, I've been moved onto the "Advanced Article Search" page, which is searching all of the journals. So what did it find instead? It found "...Mass, J. D.; Hawkins, A. R.;". Pretty advanced searching, eh?

Thankfully, I happen to know that unlike most chemistry journals, the ACS has contributed data to PubMed. So off I go, and try:

Hawkins, D. M.: first hit contains "M. M. Hawkins"
DM Hawkins: first hit contains "Hawkins EC"
Hawkins DM: jackpot!

Now I just want those papers where Basak is a co-author:
Hawkings DM Basak: jackpot!

Incidentally, this is the first time that Pubmed has ever worked for me. This is because I always try "DM Hawkins Basak" which gives no hits (even after reading the instructions I was never able to find anything). As for how to limit the results to a particular journal? Why isn't there a drop-down box with a list of journals? Why do I have to read the instructions for the obscure syntax used by PubMed every time I want to find a paper?

What about Google scholar? "DM Hawkins" or "D. M. Hawkins" gives 1700 hits, of which the first page at least are true positives. Advanced Search allows me to specify the journal, and I get 1 hit with "J Chem Inf Comput Sci", a different hit for "J Chem Inf Comp Sci" (of which there are 5 versions, one of which has "Comput Sci" instead of "Comp Sci") and 1 for "J Chem Inf Model". In fact, Hawkins has 3 papers in JCICS and 2 in JCIM. You look good Google, but you're not trying very hard.

Searching by author should be easy. It's not quantum mechanics. It's not even rocket science. Can't they make it work better?

Monday 3 September 2007

Python code for Huckel theory calculations

Over at Chemical Quantum Images, Felix has posted Python code that will read a file containing a molecular structure and calculate the molecular orbitals using Huckel theory. The code uses the OpenBabel Python module to read the structure, and NumPy, the Python extension for numerical computing, to do the heavy lifting (to diagonalise a matrix).

Now, if I only understood the theory...:-)

Saturday 11 August 2007

Move over FORTRAN: Python makes inroads into computational chemistry

Old-school dyed-in-the-wool computational chemists like to use FORTRAN and C a lot, argue about which is better, and create programs whose names are written all in CAPITALS and will only compile on an SGI-IRIX. Although you gotta love them for trying, it's good to see that times are a-changing.

Looking at "Early View" for Wiley's Journal of Computational Chemistry, you will see two "Software News and Updates": pyVib and pyFrag. Yes, you've guessed it - they're both Python programs. And if that wasn't enough, I might as well announce here that a paper on the Python library cclib has just been accepted by the same journal.

It's likely that at least some of this interest in Python for computational chemistry is due to PyQuante, developed by Rick Muller. Who'd have believed that a scripting language could be used to carry out quantum chemical calculations? Of course, what people don't realise is that Python has an extension library for efficient numerical computation. Or maybe they're starting to realise it...

Librarians go Web 2.0

Yes, librarians are doing it too. To begin with, my Greasemonkey userscript for adding bloggers' quotes to journal pages has just gotten an enthusiastic write up by Mark Rabnett, a hospital librarian and blogger.

He learned of this userscript by reading a recent paper in ACS Chemical Biology by D.P. Martinsen, "Scholarly Communication 2.0: Evolution or Design?". This was news to me, so I checked it up. It turns out that it's pretty much a review of the Spring ACS sessions on Web 2.0. He begins by giving a good description of what the term Web 2.0 means, and why scientists should know about it. Then he goes on to discuss the presentations by Nick Day, Henzy Rzepa and Colin Batchelor among others (these are just the people I know or know of).

Then we come to the good bit. At the end of page 370 it says:
Two additional items, unrelated to the ACS meeting, are significant. Using Greasemonkey, a Firefox extension that allows anyone to write scripts that can change the way a web page looks, the Blue Obelisk group, a community of chemists who develop open source applications and databases in chemistry [ref to BO paper], has created several such scripts to enable chemistry-related features. One of these tools will insert links to blog stories about journal articles into the tables of contents of any ACS, RSC, Wiley, or NPG journal [ref to old BO wiki]. This enhancement to a journal’s table of contents is completely independent of the journal publisher.
That's a pretty lucid summing up of the userscript and its significance. Somewhere I suspect PMR's hand in this. :-)

Friday 10 August 2007

Access embedded molecular information in images

Recently Rich Apodaca has been discussing (here, here and here) embedding molecular information in images of molecules, such as a PNG file depicting a 2D structure.

I'm going to show how to extract this type of embedded metadata using Python.

First of all, you'll need an image to work with. Grab the PNG file, rosiglitazone.png, from Rich's post.

Next, you'll need the Python Imaging Library (PIL), a 3rd-party Python extension library available from Pythonware.

Here's the text of an interactive Python session showing how to access the image metadata:

C:\>python
ActivePython 2.4.1 Build 247 (ActiveState Corp.) based on
Python 2.4.1 (#65, Jun 20 2005, 17:01:55) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more inf
ormation.
>>> import Image
>>> myimage = Image.open("rosiglitazone.png")
>>> dir(myimage)
['_Image__transformer', '_PngImageFile__idat', '__doc__', '__
init__', '__module__', '_copy', '_dump', '_expand', '_makesel
...
im', 'getpalette', 'getpixel', 'getprojection', 'histogram',
'im', 'info', 'load', 'load_end', 'load_prepare', 'load_read'
, 'mode', 'offset', 'palette', 'paste', 'png', 'point', 'puta
lpha', 'putdata', 'putpalette', 'putpixel', 'quantize', 'read
...
transform', 'transpose', 'verify']
>>> myimage.info
{'molfile': 'name\nparams\ncomments\n 25 27 0 0 0 0 0 0
0 0 0 V2000\n 1.6910 -6.1636 0.0000 C 0 0 0
0 0 0 0 0 0 0 0 0\n 2.5571 -6.6636 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0\n 3.4231 -6.1636
0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0\n 3.4231 -
5.1636 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n
2.5571 -4.6636 0.0000 C 0 0 0 0 0 0 0 0 0 0
n 7 8 1 0 0 0 0\n 8 9 1 0 0 0 0\n 7 10 1 0
...
...
0 0 0\n 9 11 1 0 0 0 0\n 11 12 1 0 0 0 0\n 12 13
2 0 0 0 0\n 13 14 1 0 0 0 0\n 14 15 2 0 0 0 0
\n 15 16 1 0 0 0 0\n 16 17 2 0 0 0 0\n 17 12 1 0
0 0 0\n 15 18 1 0 0 0 0\n 18 19 1 0 0 0 0\n 19 2
0 1 0 0 0 0\n 20 21 1 0 0 0 0\n 21 22 1 0 0 0
0\n 22 23 1 0 0 0 0\n 23 19 1 0 0 0 0\n 22 24 2 0
0 0 0\n 20 25 2 0 0 0 0\nM END', 'aspect': (1, 1)}
>>> moldata = myimage.info['molfile']


Cool. So now we could write this information to a file (print >> open("myoutputfile.mol"), moldata), or convert it into an OpenBabel molecule and calculate some properties:


>>> import pybel
>>> mymol = pybel.readstring("MOL", moldata)
>>> print mymol.molwt
357.42676

Friday 27 July 2007

Quick thoughts: Poignant Ruby, Bryan Vickery, Python in Browsers, Another ChemRank

If I ever learned a programming language just to read the book, the language would be Ruby and the 'book', "Why's (Poignant) Guide to Ruby" (check out the introduction :-) ). (Sorry Rich, was considering Rails, but am going with Django.)

Bryan Vickery has been interviewed by David Bradley over at Reactive Reports. Someone buy this man a beer.

Python and Ruby may soon be as ubiquitous as JavaScript in browsers. Guess what I'll be using.

Hmmm...ChemRank was a neat idea, but it looks like they've already done it in Biology. No voting down, though!

Tuesday 17 July 2007

ccWeb - A lightweight web service for parsing computational chemistry


cclib is a Python library for enabling package-independent computational chemistry algorithms. At its core is a parser for comp chem output files. The latest release, cclib 0.7, can extract information which can be used, for example, to implement charge decomposition analysis, or population analysis, or to convolute the Raman spectrum.

Let's imagine that you cannot or don't want to install cclib, but you would like to be able to access the parsed data from a comp chem output file (e.g. using a different programming language, or in a workflow environment like Taverna). The solution is ccWeb, a web service that you can send your comp chem file to, and it will send you back the result.

Specifically, you need to POST to http://www.redbrick.dcu.ie/~noel/cclib/ccWeb.py with the contents of your logfile in the parameter 'logfile'. In Python, this is done as follows:

import urllib

myurl = "http://www.redbrick.dcu.ie/~noel/cclib/ccWeb.py"
logfiletext = open("mylogfile.log", "r").read()
params = urllib.urlencode({'logfile': logfiletext})
webconnection = urllib.urlopen(myurl, params)
result = webconnection.read()
The results are returned in JSON format as a dictionary of parsed attributes, which can be directly evaluated as an expression in Python and Javascript, or more securely, parsed using a third-party library. Here's how to process the result in Python. If the third-party library, simplejson, is installed, it will be used to safely parse the results:

try:
import simplejson
except:
simplejson = None

if simplejson:
data = simplejson.loads(result)
else:
result = result.strip()
assert result[0] == '{' and result[-1] == '}', \
"Don't trust this JSON!"
data = eval(result)

The actual code for the webservice is as follows:

#!/usr/bin/env python

import cgitb; cgitb.enable() # For traceback information
import os
import cgi
import tempfile
import logging

import simplejson

from cclib.parser import ccopen

fields = cgi.FieldStorage()


if fields.has_key("logfile"):

handle, filename = tempfile.mkstemp()
tmpfile = open(filename, "w")
print >> tmpfile, fields.getfirst("logfile")
tmpfile.close()

failed = False
try:
a = ccopen(filename)
a.logger.setLevel(logging.ERROR)
data = a.parse()
except:
failed = True
os.close(handle)
os.remove(filename)

if not failed:
print "Content-Type: application/json; charset=utf-8"
print "Status: 200 Ok"
print
print data.writejson()
else:
print "Content-Type: application/json; charset=utf-8"
print "Status: 500 Internal Server Error"
print
print simplejson.dumps({'error': 'cclib was not able \
to parse this logfile without errors. Please \
send an email to cclib-users@lists.sf.net.'})
else:
print "Content-Type: application/json; charset=utf-8"
print "Status: 400 Bad Request"
print
print simplejson.dumps({'error': 'You need to set the \
value of the logfile parameter to the contents of \
a logfile'})
Note: ccWeb is based on the development version of the forthcoming cclib 0.8.

Thursday 12 July 2007

Pybel - Just how unique are your molecules? Part II

I recently posted on using SMILES, Fingerprints and InChIs to identify unique molecules in a dataset. Geoff pointed out that I should have been using canonical SMILES ("can" in Open Babel), instead of non-canonical SMILES ("smi" in Open Babel). Why?

Well, the same molecule will always have the same canonical SMILES. Whereas depending on the order of the atoms in the input file, the same molecule might have a different non-canonical SMILES. Since I was interested in identifying unique molecules, I should have been using canonical SMILES.

Method

The same as before, just with "can" instead of "smi". The code used is only slightly different from that in the previous post.

Results

Here is the output of the script (see code below):

Starting...

The number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?

There are 2 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates


The number of (unique) molecules is 10792

Are there any remaining molecules with the same canonical
SMILES?

There are 815 molecules with 2 duplicates
There are 61 molecules with 3 duplicates
There are 36 molecules with 4 duplicates
There are 10 molecules with 5 duplicates
There are 9 molecules with 6 duplicates
There are 9 molecules with 7 duplicates
There is 1 molecule with 8 duplicates
There are 3 molecules with 10 duplicates
There are 3 molecules with 11 duplicates
There are 2 molecules with 13 duplicates
There is 1 molecule with 14 duplicates
There is 1 molecule with 24 duplicates
There is 1 molecule with 34 duplicates

Are there any remaining molecules with the same fingerprint?

None found

Are there any remaining molecules with the same InChI?

There are 2 molecules with 2 duplicates


...Finished!
Discussion

If you compare the results here with those in my previous post, you will find that that there are two instances where two molecules have the same FP, InChI and non-canonical SMILES, but they have different canonical SMILES! By eye the two molecules are identical. This appears to be a bug in the canonicalisation routine and I've reported it to OpenBabel. The details are:

non-canonical SMILES is CC1OC(C[NH](C1)CC(C(F)(F)F)O)C but canonical SMILES is...
  • ZINC03883383: C[C@H]1O[C@@H](C)C[NH](C1)C[C@H](O)C(F)(F)F
  • ZINC03883386: C[C@@H]1O[C@H](C)C[NH](C1)C[C@H](O)C(F)(F)F

non-canonical SMILES is C1C(CC(C(C1O)O)O)(C(=O)O)O but canonical SMILES is...
  • ZINC03870192: O[C@@H]1CC(O)(C[C@H](O)C1O)C(=O)O
  • ZINC03870194: O[C@H]1CC(O)(C[C@@H](O)C1O)C(=O)O

Although my original aim was simply to give an example of how to use Pybel, this trivial analysis of ZINC has identified areas for improvement in both ZINC and OpenBabel. Without access to such a large dataset as ZINC, errors such as those identified here would go unnoticed. If anyone has any more ideas for tests, let me know...

Tuesday 10 July 2007

Pybel - Hack that SD file

At some point, most cheminformaticians need to hack at SD files. Maybe they need to extract the values of particular data fields, or add in some new ones. Unfortunately, our friends 'awk', 'grep' and 'Excel' (!) are not very useful for this so we need to use a scripting language.

The examples shown here use the Python module Pybel, included with the Open Babel distribution. Although it's a Python module, all of the hard work is done by the underlying C++ library.

Convert an SD file to a spreadsheet

Given an SD file with calculated descriptor values in the fields, I want to write out a Tab-separated file containing a SMILES string in the first column, the title in the second column, and then the descriptor values in the remaining columns.

import pybel

inputfile = pybel.readfile("sdf", "ace_ligands.sdf")
# We need to read the first molecule to find
# the descriptor names

mol = inputfile.next()
print "\t".join(["SMILES", "Title"] + mol.data.keys())

# Print out information on the first molecule
print "\t".join([mol.write("smi").split()[0], mol.title]
+ mol.data.values())
# Print out information on the remaining molecules
for mol in inputfile:
print "\t".join([mol.write("smi").split()[0], mol.title]
+ mol.data.values())
Notes:
  • The data fields in an SD file are in the 'data' attribute of a molecule, which behaves like a dictionary with keys() and values().
  • The "\t".join(mylist) concatenates a list of strings around Tab characters.
  • If you convert a molecule to a SMILES string, the result is a string containing the SMILES string, a Tab, the molecule title, and a carraige return. To simply access the SMILES string, you need to split() it and access the first element.

Add Rule-of-5 descriptor values to an SD file

Although Lipinksi's Rule-of-5 (or fives) may be both a Sacred Cow and an Evil Metric (according to the Great Molecular Crapshoot), we still love it. Here's how to add RO5 descriptor values to an SD file (requires Open Babel 2.1.1):

import pybel

HBD = pybel.Smarts("[!#6;!H0]")
HBA = pybel.Smarts("[$([$([#8,#16]);!$(*=N~O);" +
"!$(*~N=O);X1,X2]),$([#7;v3;" +
"!$([nH]);!$(*(-a)-a)])]")

def moredesc(mol):
ans = {}
ans['RotBonds'] = mol.OBMol.NumRotors()
ans['HBD'] = len(HBD.findall(mol))
ans['HBA'] = len(HBA.findall(mol))
ans['molwt'] = mol.molwt
return ans

output = pybel.Outputfile("sdf", "LipinskiRulesOK.sdf")

for mol in pybel.readfile("sdf", "NoRules.sdf"):
mol.data.update(mol.calcdesc())
mol.data.update(moredesc(mol))
output.write(mol)

output.close()
Notes:
  • The 'calcdesc' method of a Molecule returns a dictionary containing descriptor values for LogP, Molecular Refractivity (MR), and Polar Surface Area (PSA).
  • The 'update' method of a dictionary adds the contents of one dictionary to another.

Sunday 8 July 2007

ANN: Open Babel 2.1.1 Release


(To paraphrase Geoff...)

I am pleased to announce the release of Open Babel 2.1.1, the latest stable version of the open source chemistry toolbox.

This release represents a stable bug-fix release and is highly recommended for all users, even those who have not had problems with 2.1.0. It includes fixes for several important crashes, and many, many bug fixes and improvements.

What's new? See the full release notes.

To download, see our install page.

For more information, see the project website.

Have fun!

Thursday 5 July 2007

Pybel - Just how unique are your molecules?

Cheminformaticians often want to remove or identify duplicates in a dataset of molecules. There are a couple of popular ways to do this: ensure that each molecule has a unique SMILES string, has a unique Daylight-type fingerprint, or has a unique InChI.

But what's the difference? Is there any? Does a unique fingerprint imply a unique InChI, and vice versa?

Method

Enough with the questions. Let's just get some numbers. Our test set is the first slice of the drug-like subset of the ZINC database. Specifically, it's the file 3_p0.0.mol2.gz. According to ZINC, this should contain "only a single form of each molecule". We're going to use Pybel, the Python module included with Open Babel, to calculate the three common measures of uniqueness and check for duplicates.

Results

Here is the output of the script (see code below):

Starting...

The total number of molecules is 12712

Are there any molecules with the same FP, SMI and InChI?
There are 4 molecules with 2 duplicates
There is 1 molecule with 162 duplicates
There is 1 molecule with 661 duplicates
There is 1 molecule with 1098 duplicates

The total number of (unique) molecules is 10790

Are there any remaining molecules which have the same FP \
and SMI?

There are 846 molecules with 2 duplicates
There are 55 molecules with 3 duplicates
There are 23 molecules with 4 duplicates
There are 3 molecules with 5 duplicates
There is 1 molecule with 7 duplicates

Are there any remaining molecules which have the same InChI?

...Finished!
Discussion

So it seems that uniqueness is not quite as easy to measure as you might think. Even once you remove all molecules that share the same fingerprint, SMILES string, and InChI, there are still many that share the same fingerprint and SMILES string, although the InChIs are unique.

The results shown here either expose errors in OpenBabel, or possible errors in the creation of the ZINC database. In either case, this is an example of two Open chemistry resources being used to test and validate each other. Hopefully these results will aid in the development of even better software or data.

Code

import os
import glob
import pybel
import cPickle

def duplicates(keys, data):
"""Find the duplicates in 'data' in terms of particular
keys.

'keys' is a list containing some selection of the
indices 1, 2 and 3.
"""
ans = {}
for d in data:
ans.setdefault(tuple([d[key] for key in keys]),[]) \
.append(d[TITLE])
result = {}
for k, v in ans.iteritems():
if len(v) > 1: # len(v) is the number of duplicates
result[len(v)] = result.get(len(v), 0) + 1
for k in sorted(result.keys()):
if result[k]>1:
print "There are %d molecules with %d" \
" duplicates" % (result[k], k)
else:
print "There is %d molecule with %d" \
" duplicates" % (result[k], k)
return ans

def readfrommol2(filename):
data = []
for mol in pybel.readfile("mol2", filename):
data.append(( mol.title,
"%s" % mol.calcfp('FP2').bits,
mol.write("smi").split()[0],
mol.write("inchi").rstrip() ))
outputfile = open("dups.pickle", "w")
cPickle.dump(data, outputfile)
outputfile.close()
return data

def readfrompickle():
inputfile = open("dups.pickle", "r")
data = cPickle.load(inputfile)
inputfile.close()
return data

if __name__ == "__main__":
print "Starting...\n"

if not os.path.isfile("dups.pickle"):
data = readfrommol2("3_p0.0.mol2")
else:
data = readfrompickle()

TITLE, FP, SMI, INC = range(4)

# How many molecules are there?
print "The number of molecules is %d\n" \
% len(data)

# Find molecules with the same FP, SMI and INC
print "Are there any molecules with the same" \
" FP, SMI and InChI?"
answer = duplicates( [SMI, FP, INC], data)

# Select only unique molecules
uniquedata = [(v[0], k[0], k[1], k[2]) for k, v \
in answer.iteritems()]
print "\nThe number of (unique) molecules is %d\n" \
% len(uniquedata)

# Find molecules with the same FP and SMI but not INC
print "Are there any remaining molecules which have" \
" the same FP and SMI?\n"
duplicates( [SMI, FP], uniquedata )

# Find any remaining molecules with the same InChI
print "\nAre there any remaining molecules which have" \
" the same InChI?\n"
duplicates( [INC], uniquedata ) # Nothing is found

print "...Finished!"