Recently, Avogadro/OpenBabel have been increasing their support for computational chemistry log files. I am hoping that they will learn from our experience at GaussSum/cclib.
GaussSum was the first Python program I ever wrote, and still bears the hallmarks. When I first started GaussSum (a program which analyses the results of comp chem calculations), I would use the test cases from users to fix bugs. Then over time, I'd lose the test cases as I moved from computer to computer. I couldn't place the test cases in my version control system as the test cases might have been the results of someone's research, and they mightn't be happy to see them publicly available.
Things came to head when dealing with the parsing of vibrational frequencies in the various versions of GAMESS. It turned out that each version of GAMESS (PC-GAMESS, WinGAMESS and GAMESS US) had slightly different output for vibrational frequencies. I ended up bouncing between code that worked for WinGAMESS but not GAMESS and vice versa, depending on who sent me the last bug report. In other words, I was wasting my time fixing bugs which might reappear later. It was around this time that (a) I realised I needed a test suite, and (b) I needed public domain test files, so I could use them in my test suite.
The parser used by GaussSum is now available as a separate project, cclib, and is developed in collaboration with Adam Tenderholt and Karol Langner. This time I put a lot of thought into the test suite, and I think we've done very well. The parsers are initially developed using a set of calculations which are the same for each comp chem package; our test suite ensures that the same results are found in each case and that the units are consistent. We only fix bugs for which a public domain test file is provided ("I place this file in the public domain" is all we need to hear), and regression tests are easily added to the test suite. Our test suite has the final say on commits; commits are reverted if they cause an existing test to fail. This guarantees that cclib can only improve over time.
The inevitable consequence of this policy is that some reported bugs don't get fixed. Sometimes the reporter simply does not respond to the query to place it in the public domain. On two occasions, the reporter was working in a pharmaceutical company and felt it was more hassle than it was worth to do the necessary paperwork to place it in the public domain. So it goes... On the other hand, we do now have a set of more than 200 comp chem log files which go a long way to ensuring that our parsers can handle anything that is thrown at them. The best way of getting these files is to check the data directory of cclib out of subversion and run wget.sh.
In conclusion, if you are thinking of writing software that handles comp chem files, either try to collaborate with others who are working on the same problem (e.g. cclib or OpenBabel), or at the very least take into account some of the comments here. Otherwise, you are simply building a house of cards.
7 comments:
Just want to thank you Noel for bringing this perspective to my attention two years ago. Trying to parse all sorts of output files (with no testing) is exactly what I was doing before I got on board the cclib team.
Me too, Karol, me too :-)
I think we're building larger databases of test files. Jmol has some, Open Babel has some, and now there's a Blue Obelisk repository:
http://blueobelisk.svn.sourceforge.net/viewvc/blueobelisk/ctfr/trunk/
I think it's probably a good idea to merge all our test files into the Blue Obelisk one -- much better to have a large, combined database.
Incidentally, I certainly learned from your experience. We have the same problem with Open Babel bugs. For example "adding hydrogens fails" but you can't get the actual structure to sort out the bug.
Or my favorite, an e-mail giving me explicit discussion of stereochemistry issues in Open Babel, but they wouldn't contribute code or test files.
Geoff, how would we go about combining the cclib test files with the BO database?
Just a point to note - at cclib, we store the test files on the SF web server rather than in the version control system. This is because they are quite big, even zipped. I'm not sure what the limit is for SVN at SF, but we should avoid maxing it out needlessly. We could simply check our download script into the BO repo, so that running the script will download all the files from cclib...(?)
My question, of course, concerned the technical aspects... which Noel has started to consider. Perhaps it would be better to continue the discussion on some mailing list.
Post a Comment