Friday, 5 June 2009

The best time to optimise

As a scientist, I worry more about bugs in software than about speed. Changing correct code to improve speed can introduce errors as well as make it unreadable for others. Sometimes though it's nice to find cases where simple changes can improve the performance.

The 3D structure generation code in OpenBabel uses templates to handle the geometry of rings. There are about 2500 templates, which are represented by SMARTS patterns and associated coordinates (see fragments.txt in the distribution). The SMARTS patterns are ordered from large to small. Now, testing 2500 SMARTS patterns against a molecule takes a wee while so I was interested in seeing whether the process could be speeded up.

To begin with, I timed the code for a test set of 1000 PubChem molecules: it took 60ms per structure. Considering that the easiest way to speed something up is to avoid doing it in the first place, I changed the loop to terminate once all ring atoms had been matched. This brought it down to 38ms per structure. Then I changed it so that it skipped any SMARTS patterns that had more atoms than the number of ring atoms in the molecule: now down to 30ms. This is now within an order of magnitude of greased lightning.

In fact, I could have done slightly better than this; I could have skipped any SMARTS patterns with more atoms than the number of atoms in the largest isolated ring system in the molecule. Calculating this value is a bit of work though and may offset the associated performance gain, and so this has been left as an exercise for the reader.

How else could this code be speeded up? Well, the SMARTS matcher can itself be improved. It currently uses an exhaustive depth-first search algorithm instead of something more optimal like Vflib2. This would improve performance across the board as the SMARTS code is widely used for a variety of tasks. Alternatively, the SMARTS patterns could be fingerprinted based on particular common patterns, e.g. 5-membered rings. If a molecule had no 5-membered rings, such patterns could be skipped.

To begin with, though, the code should be profiled more precisely. It may be that 25 of those 30ms have nothing to do with this loop. In that case, further optimisations may be more work than they are worth.

These are the sorts of small studies that would fit nicely into a summer project for an undergrad computer science or chemistry student. If you want to sponsor OpenBabel development in this way, contact us.

Image credit: jpctalbot

No comments: