Sunday 23 July 2017

Faster toolkit, faster!

After an extended hiatus, I've been back doing a bit of work on Open Babel, and specifically I've been working on improving performance. Basic reading and writing of molecules should be limited by disk I/O speed and not CPU, but this is not (yet) the case with Open Babel. Through a combination of avoiding unneccessary work, trimming inefficiencies in existing code and using more efficient algorithms, I'm hoping to speed things up.

1. Replace slower algorithms with faster algorithms.

This is perhaps the hardest one, as it takes time to get grips with the existing code and figure out whether and where it can be improved. So far the only thing I've done in this line is replace our previous kekulisation procedure with a perfect matching algorithm. Though also, I guess, in this category are the changes I made to replace the original set of SMARTS patterns for aromaticity 'atom-typing' with logic to do the typing directly in code.

2. Streamline existing code.

This can be tedious and doesn't give a big win, but it's a case of avoiding what Roger refers to as Death by a Thousand Cuts. Individually, they don't count for much (and Stack Overflow would have you believe you shouldn't worry about them), but you should think of reading a molecule as the inner loop and consider that it might be done millions of times (e.g. when processing ChEMBL).

In particular, in the context of file format reading and writing, unnecessary string copies are to be avoided. This can be everything from a function that takes a std::string as a parameter, copying part of the input buffer unneccessarily, or concatenating strings with strcat.

3. Avoid unneccesary work by considering the OBMol to be a container for the contents of the file format.

This is a roundabout way of saying that the file format reader should not worry too much about the chemical content of the described molecule and shouldn't spend time checking and validating it. If there's a carbon atom with a +20 charge, fine. If there's a septuple bond between hydrogens, sure, go right ahead. Just read it in and bung it in an OBMol. That's not to say that there isn't a role for validation, but it should be an option as it takes time, may be completely unneccessary (e.g. you have just written out this molecule yourself) and is, strictly speaking, distinct from file format reading.

We already did this to a certain extent, but we didn't follow through completely. If the user said an atom was aromatic, there was no way to preserve this and avoid reperception. This has now been fixed in the current master, and the SMILES reader has an option to preserve aromaticity. Similarly, we currently reperceive stereocenters rather than accept them at face value as present in SMILES for example. This is next on my list of things to change.

Related pull requests:
* Improve performance of element handling
* Improve performance of SMILES parser
* Keep count of implicit hydrogens instead of inferring them
* Change the OBAromTyper from using SMARTS patterns to a switch statement

Image credit:
Renato Carvalho on Flickr