Wednesday, 1 January 2014

QM Speed Test: NWChem

When I started the QM Speed Test I was afraid that I'd lose steam (and face!) before getting too far along. At the same time, I hoped that others might be inspired to send in some results for their systems and QM packages. This would make it easier for me (since half the work is setting up the input file), verify that my results were not pure invention, and extend the speed test to packages to which I don't have access.

I've just checked in the results for NWChem 6.3 compiled on Centos (also 6.3). But in the meanwhile, others have shot ahead by forking the speed test and checking in their own results:Fork me on GitHub
  • Eric Berquist, a grad student at the Uni of Pittsburgh, has checked in results for Dalton, ORCA, and Q-Chem.
  • Michael Banck, Debichem guru, has checked in results for all those packages available on Debian 7 (and presumably Ubuntu and other derivatives) namely NWChem, MPQC, and PSI4, as well as ERKALE.

Dudes, I salute you.

Let's start with my results. Rather than repeat the table here, you can check out the figures on the Github page. The summary is that NWChem is quite a bit faster at geo-opts than ERKALE. This is a combination of more steps in the geo-opt, and faster energy calculations. So if you are using some of ERKALE's other features, it would probably make sense to geo-opt first with NWChem. I did recompile ERKALE with "-O3 -ffast-math -funroll-loops -fPIC -march=native" and got a 10% improvement but that's not enough to change things.

From the other calculations, what do we now know? Well all of the packages tested can actually do geo-opts at the HF/6-31G and B3LYP/6-31G level of theory. Figuring this out by looking at their websites is not always as easy as you might think. Also worth noting is the relative slowdown when using B3LYP versus HF. In some cases the slowdown is of the order of 30-40%, in other cases the DFT calculations are twice as slow.

Michael's show that NWChem is faster than any of the other Open Source QM packages tested. Very useful information; again, it points at a large time-saving by using NWChem for the initial geo-opt before turning to whatever package you need for the subsequent analysis. Eric has compared several proprietary packages (I was going to say commercial but I can't tell if ORCA is available for purchase) and Q-Chem is much faster than ORCA and Dalton.

What we haven't looked at is scaling. Maybe the orders all change when we double the size of the molecule and increase the basis set. But that's a question for another day. Right now, the goal is to get as much breadth as possible on the original question.

Notes on compiling NWChem

1. Boy were there some quirks with this one. First of all, although the instructions on the website give everything in terms of the CShell (remember csh? - big on Unix, not so big these days), you can just use the bash equivalents.
2. I wanted only single-CPU calculations so I tried to turn off MPI support. My first attempt "USE_MPI=n" caused build failures, and the only way to get around build failures appeared to be to delete the folder and untar the source again. Following the instructions on this post, I used "export ARMCI_NETWORK=SOCKETS" instead.
3. My favourite quirk was that when you untar the source, the generated folder has a really long name which causes a compile failure after some time as follows:
The directory name chosen for NWCHEM_TOP is longer than
the maximum allowed value of 65 characters
current NWCHEM_TOP=/home/noel/Tools/Quantum/nwchem-6.3.revision2-src.2013-10-17/src
Renaming to "nwchem" also failed (different error - can't find something or other I think). Renaming to "nwchem-6.3" worked.
4. I couldn't figure out how to get NWChem to use the system lapack library but got it to link to blas as described in the docs.

15 comments:

  1. Both ORCA and Dalton are free for academic usage. ORCA requires accepting a license, but only binaries are distributed, with no access to the source. Dalton requires a signed (!) license agreement, but is only distributed as source.

    ReplyDelete
  2. Thanks for the details. I'll try to get licenses for all of these.

    ReplyDelete
  3. A useful further benchmark might be a calculation that involves 2nd derivatives (frequencies). These are tough to compute quickly and to do so is essential for transition states and intrinsic reaction coordinates.

    My gut feeling is that there will be a vast difference in this (depending on the Hamiltonian) some codes only have finite difference 2nd derivatives available (which take an age, and scale horribly as the molecule gets larger).

    It would be important when running a benchmark to include a reasonably large molecule (~100 atoms?), since an important aspect of the benchmark is how it scales with molecule size. And 2nd derivatives can potentially scale very badly.

    I can guess that two codes will beat all the others, but they are both commercial (and one is not allowed to publish benchmarks for one of these codes). But in a sense, for this type of calculation, you pays for what you gets.

    ReplyDelete
  4. Another benchmark follow up might be to see how any code scales in parallel mode. Some scale linearly with the number of processors up to at least 16 processors, but it is desirable for this to extend up to say 64 processors, which are becoming more common.

    Another crucial aspect is how the benchmark scales with available memory. Obviously, the more memory the better (and some configurations now have access to 96 Gbytes), but how does the code cope on lower memory systems?

    ReplyDelete
  5. @Eric Berquist. Could you try to repeat the calcs with LSDalton also, in case you've build that as well?

    ReplyDelete
  6. Noel, if you are going for non-free/proprietary packages, I suggest to also compare GAMESS (probably the US version, as the UK version is really only accessible to UK residents I think), which appears to be rather popular overall.

    ReplyDelete
  7. @Eric Berquist: the DALTON/ORCA results you have posted are using VWN5 for the LYP correlation functional, not VWN3.

    VWN5 is generally considered more correct, however, the B3LYP functional in Gaussian uses VWN3 so most other codes use that as well by default.

    Note the rather large energy difference between the DALTON/ORCA results and Q-Chem. Not sure whether those packages allow for an explicit tuning of the VWN variant.

    (Erkale and NWChem both use VWN3 by default)

    ReplyDelete
  8. At least Dalton does. From the manual:

    B3LYPg
    Hybrid functional with VWN3 form used for correlation. This is the form used by
    the Gaussian quantum chemistry program. Keyword B3LYPGauss is a synonym for B3LYPg. This functional can be explicitly set up by the

    Combine HF=0.2 Slater=0.8 Becke=0.72 LYP=0.81 VWN3=0.19

    option in the .DFT section, cf. the manual.

    ReplyDelete
  9. @everyone: I love the peer reviewing of the QM. Seriously. Open Notebook Science or what! :-)

    @Henry: The scope of this benchmark is deliberately constrained to make the barrier to entry fairly easy (small molecule, small basis, etc.). If successful, we can start looking at scaling and other effects that you mention.

    However, I am specifically targeting serial calculations, because I feel that speed for serial calculations has been forgetten in the rush to massively scale QM for supercomputers (see my initial post).

    @Michael: I hope to try as many as possible of the comp chem packages listed on the Wikipedia page.

    ReplyDelete
  10. @Noel: I got my hands on GAMESS(US), MOLPRO, Turbomole and some other proprietary package, as well as NWChem for reference, via a login at a compute facility. Will report back.

    ReplyDelete
  11. I also ran the inputs through several programs. Noel, should we send you pull requerst? :)

    ReplyDelete
  12. Great. Just to note that I'm not merging the forks. We are basically using the repos as independents records of people's calculations.

    ReplyDelete
  13. Why not try Firefly? I would love to see speed tests of Firefly vs. other free software, as its authors claim it is one of the fastest codes available.

    ReplyDelete
  14. Hello to all learned people here.
    I know I am joining too late. Just came across this blog of @Noel Sir. I specifically work with FIREFLY and I have done few tests on single point and hessian calculations.
    I found that for single point, single core works most efficiently, while for hessian, multiple cores perform better (not substantial performance boost though). In general single core seems the best bet, considering the RAM and even for large molecules then.
    I haven't yet tested any GPU based calculations though. Any reviews on that would be great.

    ReplyDelete