Monday 29 June 2009

I'll fix the bug...but only if you give me a public domain test file

Recently, Avogadro/OpenBabel have been increasing their support for computational chemistry log files. I am hoping that they will learn from our experience at GaussSum/cclib.

GaussSum was the first Python program I ever wrote, and still bears the hallmarks. When I first started GaussSum (a program which analyses the results of comp chem calculations), I would use the test cases from users to fix bugs. Then over time, I'd lose the test cases as I moved from computer to computer. I couldn't place the test cases in my version control system as the test cases might have been the results of someone's research, and they mightn't be happy to see them publicly available.

Things came to head when dealing with the parsing of vibrational frequencies in the various versions of GAMESS. It turned out that each version of GAMESS (PC-GAMESS, WinGAMESS and GAMESS US) had slightly different output for vibrational frequencies. I ended up bouncing between code that worked for WinGAMESS but not GAMESS and vice versa, depending on who sent me the last bug report. In other words, I was wasting my time fixing bugs which might reappear later. It was around this time that (a) I realised I needed a test suite, and (b) I needed public domain test files, so I could use them in my test suite.

The parser used by GaussSum is now available as a separate project, cclib, and is developed in collaboration with Adam Tenderholt and Karol Langner. This time I put a lot of thought into the test suite, and I think we've done very well. The parsers are initially developed using a set of calculations which are the same for each comp chem package; our test suite ensures that the same results are found in each case and that the units are consistent. We only fix bugs for which a public domain test file is provided ("I place this file in the public domain" is all we need to hear), and regression tests are easily added to the test suite. Our test suite has the final say on commits; commits are reverted if they cause an existing test to fail. This guarantees that cclib can only improve over time.

The inevitable consequence of this policy is that some reported bugs don't get fixed. Sometimes the reporter simply does not respond to the query to place it in the public domain. On two occasions, the reporter was working in a pharmaceutical company and felt it was more hassle than it was worth to do the necessary paperwork to place it in the public domain. So it goes... On the other hand, we do now have a set of more than 200 comp chem log files which go a long way to ensuring that our parsers can handle anything that is thrown at them. The best way of getting these files is to check the data directory of cclib out of subversion and run wget.sh.

In conclusion, if you are thinking of writing software that handles comp chem files, either try to collaborate with others who are working on the same problem (e.g. cclib or OpenBabel), or at the very least take into account some of the comments here. Otherwise, you are simply building a house of cards.

Friday 19 June 2009

Using PyActiveResource to access ChemCaster

ChemCaster, from Rich's Metamolecular, is a platform for developing web-based cheminformatics applications. The advantage of such a system is that the user does not need to install any special software, nor does the application developer need to maintain a server.

Rich invited me to take it for a spin, so I signed up for a trial account and moved quickly on to my first problem, how do I access the API through Python?

It turns out that RESTful APIs tend to have common patterns, a fact which is taken advantage of by Active Resource, a Ruby library for defining classes which directly map onto the objects implied by a RESTful API. Or something like that - I neglected to read any documentation. Instead I just took Rich's example and tried to code it up in Python using PyActiveResource (this is a documentation-free project so using it is quite exciting).

Et voilá

Tuesday 9 June 2009

From zero to Zotero - One man's journey out of PDF hell

Zotero is a reference management software. Sorry, let me correct that - Zotero is THE reference management software. I had tried Zotero before, and it certainly looked good; but frankly I couldn't figure out how to get it to work and so reverted to my usual system, the 'zero' of the title. Hearing the news that Endnote vs. Zotero was just thrown out of court, I decided to try it again.

And it's just amazing.

Let me begin by describing a typical workflow:
(1) Go to the summary page for an ACS paper online
(2) Click on the icon that appears in the address bar (looks like a sheet of paper with writing).

That's it. You've just saved the PDF, the HTML full-text and the paper's metadata.

If you've created an account on zotero.org (free of course!), you can synch your library so that multiple computers can share the same data. And best of all you can also synch the attachments (i.e. PDFs, HTML pages) if you have a WebDAV account (e.g. from your university or in my case, JungleDisk Plus/Amazon S3). If that wasn't enough, it also integrates with Word to make it easy to prepare a publication (though I haven't tested this Update: it works just fine, but you first need to install the bibliographic styles you need from Zotero settings/Preferences/Styles/Get additional styles).

In other words, Zotero makes it easy to download papers, back them up, make them accessible from any computer and reference them in papers.

Zotero is open source and freely available from www.zotero.org.

Notes: I'm using Zotero 2.0b5. In the Zotero preferences (click on the gear icon), choose "Automatically attach PDFs and other files when saving items" in the General Tab. JungleDisk and Amazon cost money (we're talking around $1.50 a month), but there may be free alternatives for WebDAV. For any websites that aren't currently supported by Zotero, adding new translators has been made easy. All of the JavaScript files for the translators are stored in a folder on your computer and can easily be extended or added to. That said, I've had no trouble downloading PDFs from Sciencedirect, ACS, RSC, Wiley or BMC.

Image credit: jazzmodeus

Friday 5 June 2009

The best time to optimise

As a scientist, I worry more about bugs in software than about speed. Changing correct code to improve speed can introduce errors as well as make it unreadable for others. Sometimes though it's nice to find cases where simple changes can improve the performance.

The 3D structure generation code in OpenBabel uses templates to handle the geometry of rings. There are about 2500 templates, which are represented by SMARTS patterns and associated coordinates (see fragments.txt in the distribution). The SMARTS patterns are ordered from large to small. Now, testing 2500 SMARTS patterns against a molecule takes a wee while so I was interested in seeing whether the process could be speeded up.

To begin with, I timed the code for a test set of 1000 PubChem molecules: it took 60ms per structure. Considering that the easiest way to speed something up is to avoid doing it in the first place, I changed the loop to terminate once all ring atoms had been matched. This brought it down to 38ms per structure. Then I changed it so that it skipped any SMARTS patterns that had more atoms than the number of ring atoms in the molecule: now down to 30ms. This is now within an order of magnitude of greased lightning.

In fact, I could have done slightly better than this; I could have skipped any SMARTS patterns with more atoms than the number of atoms in the largest isolated ring system in the molecule. Calculating this value is a bit of work though and may offset the associated performance gain, and so this has been left as an exercise for the reader.

How else could this code be speeded up? Well, the SMARTS matcher can itself be improved. It currently uses an exhaustive depth-first search algorithm instead of something more optimal like Vflib2. This would improve performance across the board as the SMARTS code is widely used for a variety of tasks. Alternatively, the SMARTS patterns could be fingerprinted based on particular common patterns, e.g. 5-membered rings. If a molecule had no 5-membered rings, such patterns could be skipped.

To begin with, though, the code should be profiled more precisely. It may be that 25 of those 30ms have nothing to do with this loop. In that case, further optimisations may be more work than they are worth.

These are the sorts of small studies that would fit nicely into a summer project for an undergrad computer science or chemistry student. If you want to sponsor OpenBabel development in this way, contact us.

Image credit: jpctalbot