Monday, 29 June 2009

I'll fix the bug...but only if you give me a public domain test file

Recently, Avogadro/OpenBabel have been increasing their support for computational chemistry log files. I am hoping that they will learn from our experience at GaussSum/cclib.

GaussSum was the first Python program I ever wrote, and still bears the hallmarks. When I first started GaussSum (a program which analyses the results of comp chem calculations), I would use the test cases from users to fix bugs. Then over time, I'd lose the test cases as I moved from computer to computer. I couldn't place the test cases in my version control system as the test cases might have been the results of someone's research, and they mightn't be happy to see them publicly available.

Things came to head when dealing with the parsing of vibrational frequencies in the various versions of GAMESS. It turned out that each version of GAMESS (PC-GAMESS, WinGAMESS and GAMESS US) had slightly different output for vibrational frequencies. I ended up bouncing between code that worked for WinGAMESS but not GAMESS and vice versa, depending on who sent me the last bug report. In other words, I was wasting my time fixing bugs which might reappear later. It was around this time that (a) I realised I needed a test suite, and (b) I needed public domain test files, so I could use them in my test suite.

The parser used by GaussSum is now available as a separate project, cclib, and is developed in collaboration with Adam Tenderholt and Karol Langner. This time I put a lot of thought into the test suite, and I think we've done very well. The parsers are initially developed using a set of calculations which are the same for each comp chem package; our test suite ensures that the same results are found in each case and that the units are consistent. We only fix bugs for which a public domain test file is provided ("I place this file in the public domain" is all we need to hear), and regression tests are easily added to the test suite. Our test suite has the final say on commits; commits are reverted if they cause an existing test to fail. This guarantees that cclib can only improve over time.

The inevitable consequence of this policy is that some reported bugs don't get fixed. Sometimes the reporter simply does not respond to the query to place it in the public domain. On two occasions, the reporter was working in a pharmaceutical company and felt it was more hassle than it was worth to do the necessary paperwork to place it in the public domain. So it goes... On the other hand, we do now have a set of more than 200 comp chem log files which go a long way to ensuring that our parsers can handle anything that is thrown at them. The best way of getting these files is to check the data directory of cclib out of subversion and run wget.sh.

In conclusion, if you are thinking of writing software that handles comp chem files, either try to collaborate with others who are working on the same problem (e.g. cclib or OpenBabel), or at the very least take into account some of the comments here. Otherwise, you are simply building a house of cards.

Friday, 19 June 2009

Using PyActiveResource to access ChemCaster

ChemCaster, from Rich's Metamolecular, is a platform for developing web-based cheminformatics applications. The advantage of such a system is that the user does not need to install any special software, nor does the application developer need to maintain a server.

Rich invited me to take it for a spin, so I signed up for a trial account and moved quickly on to my first problem, how do I access the API through Python?

It turns out that RESTful APIs tend to have common patterns, a fact which is taken advantage of by Active Resource, a Ruby library for defining classes which directly map onto the objects implied by a RESTful API. Or something like that - I neglected to read any documentation. Instead I just took Rich's example and tried to code it up in Python using PyActiveResource (this is a documentation-free project so using it is quite exciting).

Et voilá

Tuesday, 9 June 2009

From zero to Zotero - One man's journey out of PDF hell

Zotero is a reference management software. Sorry, let me correct that - Zotero is THE reference management software. I had tried Zotero before, and it certainly looked good; but frankly I couldn't figure out how to get it to work and so reverted to my usual system, the 'zero' of the title. Hearing the news that Endnote vs. Zotero was just thrown out of court, I decided to try it again.

And it's just amazing.

Let me begin by describing a typical workflow:
(1) Go to the summary page for an ACS paper online
(2) Click on the icon that appears in the address bar (looks like a sheet of paper with writing).

That's it. You've just saved the PDF, the HTML full-text and the paper's metadata.

If you've created an account on zotero.org (free of course!), you can synch your library so that multiple computers can share the same data. And best of all you can also synch the attachments (i.e. PDFs, HTML pages) if you have a WebDAV account (e.g. from your university or in my case, JungleDisk Plus/Amazon S3). If that wasn't enough, it also integrates with Word to make it easy to prepare a publication (though I haven't tested this Update: it works just fine, but you first need to install the bibliographic styles you need from Zotero settings/Preferences/Styles/Get additional styles).

In other words, Zotero makes it easy to download papers, back them up, make them accessible from any computer and reference them in papers.

Zotero is open source and freely available from www.zotero.org.

Notes: I'm using Zotero 2.0b5. In the Zotero preferences (click on the gear icon), choose "Automatically attach PDFs and other files when saving items" in the General Tab. JungleDisk and Amazon cost money (we're talking around $1.50 a month), but there may be free alternatives for WebDAV. For any websites that aren't currently supported by Zotero, adding new translators has been made easy. All of the JavaScript files for the translators are stored in a folder on your computer and can easily be extended or added to. That said, I've had no trouble downloading PDFs from Sciencedirect, ACS, RSC, Wiley or BMC.

Image credit: jazzmodeus

Friday, 5 June 2009

The best time to optimise

As a scientist, I worry more about bugs in software than about speed. Changing correct code to improve speed can introduce errors as well as make it unreadable for others. Sometimes though it's nice to find cases where simple changes can improve the performance.

The 3D structure generation code in OpenBabel uses templates to handle the geometry of rings. There are about 2500 templates, which are represented by SMARTS patterns and associated coordinates (see fragments.txt in the distribution). The SMARTS patterns are ordered from large to small. Now, testing 2500 SMARTS patterns against a molecule takes a wee while so I was interested in seeing whether the process could be speeded up.

To begin with, I timed the code for a test set of 1000 PubChem molecules: it took 60ms per structure. Considering that the easiest way to speed something up is to avoid doing it in the first place, I changed the loop to terminate once all ring atoms had been matched. This brought it down to 38ms per structure. Then I changed it so that it skipped any SMARTS patterns that had more atoms than the number of ring atoms in the molecule: now down to 30ms. This is now within an order of magnitude of greased lightning.

In fact, I could have done slightly better than this; I could have skipped any SMARTS patterns with more atoms than the number of atoms in the largest isolated ring system in the molecule. Calculating this value is a bit of work though and may offset the associated performance gain, and so this has been left as an exercise for the reader.

How else could this code be speeded up? Well, the SMARTS matcher can itself be improved. It currently uses an exhaustive depth-first search algorithm instead of something more optimal like Vflib2. This would improve performance across the board as the SMARTS code is widely used for a variety of tasks. Alternatively, the SMARTS patterns could be fingerprinted based on particular common patterns, e.g. 5-membered rings. If a molecule had no 5-membered rings, such patterns could be skipped.

To begin with, though, the code should be profiled more precisely. It may be that 25 of those 30ms have nothing to do with this loop. In that case, further optimisations may be more work than they are worth.

These are the sorts of small studies that would fit nicely into a summer project for an undergrad computer science or chemistry student. If you want to sponsor OpenBabel development in this way, contact us.

Image credit: jpctalbot

Wednesday, 27 May 2009

The RSC - Value for money?

I don't usually advertise for chemical societies, but in these recessionary times I thought the following might be of interest to some readers.

RSC members have:

  • free access to Wiley, Elsevier, and Springer chemistry journals
  • free access to 913 chemistry e-books from a variety of sources
  • 20% off Pearson Education Books, 30% off Wiley, 35% off Blackwell
  • and most importantly, £5 off Pizza Express Club membership
Sure, chemistry societies organise conferences, enable networking, provide travel grants, and lobby politicians; but any society that doesn't look after its most vulnerable members by providing discounted pizza is not a society I want to be a member of.

Thursday, 21 May 2009

Have your hamburger and eat it - Edit molecules in PDFs II

In Part I, I showed how to hack some code together that allowed you to paste images directly from the clipboard (e.g. from a PDF) into Beda's BKChem, a 2D drawing program. The magic conversion from image to chemical was done by Igor's OSRA.

Well, Igor has taken this idea and run with it. The latest version of OSRA now includes plugins for BKChem, Symyx Draw, MolSketch and Pipeline Pilot.

If you use the Windows installer, the Symyx Draw plugin is automatically installed and adds an "Import Structures from OSRA" option to the File menu. The first time you choose it, you will need to change the path to something like "C:\Program Files\osra\osra.exe" under "Settings...". Here's the plugin in action:
Note that the other plugins appear to be only available from the Windows .zip release.

Saturday, 16 May 2009

How do enzyme mechanisms evolve?

Evolution is a fascinating topic. Although the principal mechanism by which evolution occurs is quite simple to understand, namely the introduction of changes (mutations) into the DNA, the consequences that follow are enormous.

The term selective pressure is used to describe an imaginary operator that affects the incidence of particular mutations in a population. What makes evolution difficult for me to get my head around is that selection operates on many levels. In a population, a particular physical characteristic might be more advantageous (think of the famous finches) or more attractive. In your DNA, a particular mutation might preserve the amino acid coded for, or it may change to another amino acid that does not affect the protein's function. On the other hand, if the amino acid is involved in the catalytic action of the protein it's going to be conserved, right? But then how do new mechanisms evolve?

My former postdoc supervisor, Dr. John Mitchell, is currently advertising a PhD position on "Modelling the Evolution of Enzyme Catalysis" at the University of St. Andrews. I'm particularly interested in this project as it builds on earlier work I carried out in the Mitchell Group along with Gemma Holliday and Daniel Almonacid. Here's an excerpt from the project description:

We will create a simulation using a population of model enzyme-catalysed reactions, mimicking a state early in evolutionary history, and allow them to evolve in EC space. The reactions will consist of steps and be represented, in a manner familiar from genetic algorithms, by "chromosomes" describing the chemical properties of each step. Parameters will control the likelihood of different kinds of evolutionary event, such as a change of substrate with the same underlying chemical mechanism, taking place. The simulations will be calibrated, and then compared with the results from a study of real-world convergent and divergent evolution.
Cool. Closing date 31 July.

Image credit: Colin Purrington