Noel O'Blog: 2014

Thursday, 7 August 2014

In memory of Jean-Claude Bradley Part II

Here is the talk I gave at the symposium in memory of Jean-Claude Bradley.

I start off with a description of an open notebook science experiment I did for JCB, a protein-ligand docking related to malaria. I was able to pull all the details from his wiki, the datasets, the method, the reasons for certain choices, the results and even the edit history.

Next I discuss his use of webservices to develop chemistry applications and show several examples from my past. Finally, I suggest that today JCB would use MineCraft instead of Second Life if he was looking for an immersive environment in which to build chemistry activities for students.

Chemistrify the Web from baoilleach

Thursday, 5 June 2014

In memory of Jean-Claude Bradley

This Autumn I will be attending an ACS Meeting in San Francisco for the second time. The first time was in 2010 when I co-organised a symposium with Jean-Claude Bradley and Andy Lang.

I was pretty nervous. I stumbled through some opening remarks before finding my feet and paying tribute to the memory of Warren DeLano, another pioneer of openness in chemistry. When Jean-Claude arrived the next day to chair the second session, I remember thinking wow, this guy is so relaxed and confident he can just turn up in bermuda shorts and a casual shirt and not worry about whether his tie is sending out the right signals - I wish I was like that.

Subsequently, I found out that it hadn't always been like that. He been like, well, everyone else: wearing suits every day eagerly trying to make a good impression, following the funding, playing the game. A day came when he tired of it, looked at what he was doing, and decided it was not going to make the world a better place. So he sat down and thought about how to identify what areas of chemistry were actually "useful":

The best answer I could come up with is to trust what human researchers have to say in their papers. I developed a set of search phrases such as "what is needed now" or "what is missing is" and ran them through Google Scholar and Scirus. One of the results was "there is a pressing need for identifying and developing new drug-based antimalarial therapies".

Reactive Reports #51, 2006 Interview with David Bradley.

...and that was how he started the Useful Chemistry project. You can see the genesis of the project in his initial blog posts.

Others have commented on his legacy in Open Notebook Science. For me, his story of starting Useful Chemistry was what impressed me most: how many of us have the courage to look at our work and ask ourselves, is it useful?

To pay tribute to his remarkable vision, I will be speaking at the Jean-Claude Bradley Memorial Symposium on July 14th, organised by Andy Lang, Tony Williams and Peter Murray-Rust in Cambridge, UK. I encourage you to come along to celebrate the work of an inspiring person and to hear how others are building on his legacy.

Tuesday, 6 May 2014

cclib 1.2 now available

On behalf of the cclib development team (namely Karol Langner and Adam Tenderholt who did all the work), I am pleased to announce the release of cclib 1.2, which is now available for download. This version marks the first stable release to target Python 3, and includes several new features and bug fixes (see below).

cclib is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. It currently parses output files from ADF, Firefly, GAMESS (US), GAMESS-UK, Gaussian, Jaguar, Molpro and ORCA.

Among other data, cclib extracts:

coordinates and energies
information about geometry optimization
atomic orbital information
molecular orbital information
information on vibrational modes
the results of a TD-DFT calculation

cclib also provides some calculation methods for interpreting the electronic properties of molecules using analyses such as:

Mulliken and Lowdin population analyses
Overlap population analysis
Calculation of Mayer's bond orders

For information on how to use cclib, see the tutorial. If you need help, find a bug, want new features or have any questions, please send an email to our mailing list.

If your published work uses cclib, please support its development by citing the following article:
N. M. O'Boyle, A. L. Tenderholt, K. M. Langner, cclib: a library for package-independent computational chemistry algorithms, J. Comp. Chem. 2008, 29, 839-845

Major changes since cclib 1.1

Move project to github
Transition to Python 3 (Python 2.7 will still work)
Add a multifile mode to ccget script
Extract vibrational displacements for ORCA
Extract natural atom charges for Gaussian (Fedor Zhuravlev)
Updated test file versions to ADF2013.01, GAMESS-US 2012, Gaussian09, Molpro 2012 and ORCA 3.0.1
Many bugfixes, thanks to Scott McKechnie, Tamilmani S, Melchor Sanchez, Clyde Fare, Julien Idé and Andrew Warden

Saturday, 3 May 2014

Turning vim into an IDE: Part 2

Something that's really useful is syntax checking. It can save a lot of time (not to mention frustration) if you can find errors *before* you run your scripts. The Syntastic plugin for vim can be used for this. As far as I can tell, it doesn't do any syntax checking itself, but rather is the glue that links other syntax checkers to vim.

First things first, to simplify matters (*), let's rename _gvimrc to .vimrc (easier to use the commandline for this).

Next let's install Syntastic:

1. Create the folder "vimfiles" in your home directory if not already present, e.g. C:\Users\Noel\vimfiles
2. If you don't already have it, you need to install the Pathogen plugin (this plugin makes it easy to install other plugins):
a. Download the zip
b. Extract the zip into vimfiles (thus creating vimfiles/autoload/pathogen.vim)
c. Create a directory called bundle in vimfiles (this is where to put plugins managed by Pathogen)
d. Add "call pathogen#infect()" near the start of your .vimrc
3. Install Syntastic as described on its website:
a. cd C:\Users\Noel\vimfiles\bundle
b. git clone https://github.com/scrooloose/syntastic.git
c. Test by opening a file in gvim, and checking for error messages after typing ":Helptags"

Next, let's say we want to add some syntax checkers for Python. Your choices are pyflakes, pylint, flakes8 and Python itself. It's not either/or - you can include as many as you want. Note that flakes8 is a combination of pyflakes, pep8 (PEP8 style checker), and McCabe (cyclomatic complexity checker). I'm not so keen on style checkers and so I'm going to go just with pyflakes and Python.

1. If you don't have pip (the Python package manager), then install it
2. Install pyflakes with pip ("C:\Python27\Scripts\pip.exe install --upgrade pyflakes")
3. Add C:\Python27\Scripts to the front of your Windows PATH if not already present
3. Configure Syntastic by adding the following to .vimrc:
let g:syntastic_python_checkers = ['pyflakes', 'python']
let g:syntastic_auto_loc_list = 1
4. Open a Python file and check that pyflakes and python are listed when you type ":SyntasticInfo"

The screenshot at the top of the post shows the plugin in action. It's triggered every time you save a file.

* This is necessary because the call to pathogen#infect() is ignored if in _gvimrc (I don't know why). If instead we used _vimrc, then the _vimrc installed in C:\Program Files (x86)\_vimrc is ignored (and we miss CTRL+C/V behaviour for example). This leaves us with .vimrc.

Sunday, 20 April 2014

QM Speed Test: Firefly and wrap up

You can find my latest results, this time for Firefly, over on my Github page. The summary is that it's slightly faster than NWChem for HF calculations, but much faster for B3LYP calculations. I also tried using FreeON but although it appeared to compile fine, it didn't run correctly on the CentOS system I have been using for testing.

Someone raised the point that these are not like-for-like comparisons in terms of all of the program parameters (integration grid size, convergence criteria, etc, etc.). Pedro Silva investigated the effect of grid-size on the speed and results, and tried to normalise across GAMESS, Q-Chem and Firefly. However, our intention here is to give an idea of comparative speeds using the defaults as some sort of baseline, but also because it reflects typical use.

Unfortunately, I ran out of time to continue this project back in late Jan or so. The first victim of the time shortages were the Turbomole results. The folks at COSMOlogic came forward with an evaluation copy, but I simply ran out of time before having a chance to figure it out and use it. The "tmole" input files are already there though in the repo (from Michael Banck).

The others involved have been much more active. To see all of the results gathered so far, check out the Github pages of:

Eric Berquist - Dalton, MPQC, NWChem, ORCA, Q-Chem
Karol Langner - Dalton, GAMESS, Molpro, NWChem, ORCA
Michael Banck - Erkale, MPQC, NWChem, PSI4
Noel O'Boyle - Erkale, Firefly, NWChem
Pedro Silva - GAMESS, Firefly, Q-Chem

Where to from here? Well, nowhere for me, at least for the immediate future. But these input files are out there now, and if you want to compare different packages, they make a good starting point. An interesting question is whether the relative speeds change once you start looking at larger systems. I don't know the answer to that...maybe you can try finding out.

Thursday, 27 February 2014

Screencasting on Windows

I always find screencasting quite difficult to set up on Windows. Here are some notes on a recent successful attempt that uses the well-known ffmpeg library/command-line tool to record from a Direct Show screen capture device.

Set up the screen capture device

Install Screen Capture Recorder and configure by selecting the shortcut "configure by resizing a transparent window". Start broadcasting the desktop by choosing the shortcut "stream desktop local LAN", and "Start Normal 10 fps".

Capture the screen

Install 32-bit ffmpeg (even on 64-bit Windows) and capture as follows (press "q" to stop):

ffmpeg -f dshow -i video=screen-capture-recorder -r 10
       -y recording.mp4

Convert to the target format

For example, convert to .wmv and retain just the first 60 seconds (see "-ss" for skipping the start and "-q:v 10" to control the size/quality):

ffmpeg -i recording.mp4 -t 60 -b:v 1000k
       -vcodec wmv2 myfile.wmv

Monday, 3 February 2014

When Python goes wrong: Off-by-one

Many niggling problems with Python were fixed in Python 3, but here's one that recently bit me in the expunged. It relates to rounding in format strings and so is a potential pitfall for anyone handling floats.

The thing is, "%f" rounds differently than "%d".

>>> "%.1f" % 1.56 ### I expect 1.6
'1.6'
>>> "%d" % 1.6    ### By analogy, I expect 2
'1'

The correct solution is either "%.0f" (yes - it exists) or explicitly round it yourself how you like, e.g. "%d" % int(x+0.5) or using round().

This was causing off-by-one results in my code which I fortunately identified prior to submitting a manuscript. But I'm disappointed, Python; and don't blame C.

Note: As a further issue, if you need to handle negative values also, doublecheck the rounding behaviour of your code...it may not always be what you expect.

Friday, 31 January 2014

Turning vim into an IDE: Part 1

Probably like many other users of vim (or gvim), I have never taken the time to set it up so that it works like a modern IDE with auto-completion, syntax checking, and whatever else. I've decided to start sorting this out.

My starting point is the following _gvimrc (in C:\Users\noel) which sets the behaviour of the tab key and indentation for several different file formats. I also highlight whitespace at the end of lines so that I don't add it by accident.

autocmd FileType python set expandtab
autocmd FileType python set sw=4
autocmd FileType python set ts=4
autocmd FileType python set autoindent

autocmd FileType cmake set expandtab
autocmd FileType cmake set sw=2
autocmd FileType cmake set ts=2
autocmd FileType cmake set autoindent

autocmd FileType html set expandtab
autocmd FileType html set sw=2
autocmd FileType html set ts=2
autocmd FileType html set autoindent

autocmd FileType javascript set expandtab
autocmd FileType javascript set sw=2
autocmd FileType javascript set ts=2
autocmd FileType javascript set autoindent

autocmd FileType css set expandtab
autocmd FileType css set sw=2
autocmd FileType css set ts=2
autocmd FileType css set autoindent

" Highlight whitespace at end of line
highlight ExtraWhitespace ctermbg=red guibg=red
autocmd Syntax * syn match ExtraWhitespace /\s\+$\| \+\ze\t/

" Don't create those annoying backup files ending with ~
set nobackup

Testing webapps from Python with Selenium

I'm a great believer in tests to ensure code quality, although naturally I don't always practise what I preach. Recently I've been working on a webapp with a Python server backend. I can test the server independently of the webapp, but testing the webapp itself involves clicking on various things, waiting for the server to respond, and then checking the result. Doing this manually quickly becomes a chore and that's without even considering different browsers.

The solution is Python of course (as usual on this blog). The aptly-named Selenium allows you to automate this sort of thing from the point of view of writing tests. It's available in several languages (I think they all use the same server backend) but I'm going to focus here on the Python. "pip install selenium" on Windows installs the basics; this will allow you to test against Firefox and Safari. For Chrome and IE, you need to download two additional files and add them to the PATH.

The Selenium docs are a bit confusing because (a) they keep referring back to an earlier version of the software (Selenium 1) which had different behaviour and (b) the commands for the Selenium IDE are very similar but not the same as the commands for the Python bindings.

Anyhoo, here's a modified version of the Selenium script that I'm using just to give some idea of how it works. When you run this at the command-line, a new Firefox window pops up, resizes to the specified dimensions, flies through the test, shuts down, and then I get the usual unittest output from the Python script. Specifying a browser name (e.g. Chrome) runs the test against Chrome instead.

import sys
import unittest

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

import selenium.webdriver.support.ui as ui

browsers = {"firefox": webdriver.Firefox, "ie": webdriver.Ie, "chrome": webdriver.Chrome, "safari": webdriver.Safari}

class Tests(unittest.TestCase):
    def setUp(self):
        self.driver = browser()
        self.driver.set_window_size(1380, 880)
        self.driver.get("file:///C:/Tools/myfile.htm")

        # Convenience functions
        self.driver.by_id = self.driver.find_element_by_id
        self.driver.by_xpath = self.driver.find_element_by_xpath
        self.wait = ui.WebDriverWait(self.driver, 5)

    def test_the_webpage(self):
        d = self.driver
        d.refresh()

        self.drag_and_drop(d.by_id("ID1"), d.by_xpath("//div[@id='things']/img[3]"))
        self.wait.until(lambda d: d.by_xpath("//table[@id='results']"))
        self.assertEqual("45", d.by_xpath("//table[@id='results']/tbody/tr/td[2]").text)

        d.by_id("thingy").click()
        self.wait.until(lambda d: d.by_xpath("//table[@id='results']"))
        self.assertEqual("64", d.by_xpath("//table[@id='results']/tbody/tr/td[2]").text)

    def drag_and_drop(self, elem, target):
        ActionChains(self.driver).drag_and_drop(elem, target).perform()

    def tearDown(self):
        self.driver.quit()

if __name__ == "__main__":
    browsername = "firefox"
    if len(sys.argv) == 2:
        browsername = sys.argv[1].lower()
    browser = browsers[browsername]

    suite = unittest.TestLoader().loadTestsFromTestCase(Tests)
    unittest.TextTestRunner().run(suite)

Note: For another Python project related to browser automation, see also twill.

Monday, 13 January 2014

Convert distance matrix to 2D projection with Python

In my continuing quest to never use R again, I've been trying to figure out how to embed points described by a distance matrix into 2D. This can be done with several manifold embeddings provided by scikit-learn. The diagram below was generated using metric multi-dimensional scaling based on a distance matrix of pairwise distances between European cities (docs here and here).

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn import manifold

# Distance file available from RMDS project:
#    https://github.com/cheind/rmds/blob/master/examples/european_city_distances.csv
reader = csv.reader(open("european_city_distances.csv", "r"), delimiter=';')
data = list(reader)

dists = []
cities = []
for d in data:
    cities.append(d[0])
    dists.append(map(float , d[1:-1]))

adist = np.array(dists)
amax = np.amax(adist)
adist /= amax

mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=6)
results = mds.fit(adist)

coords = results.embedding_

plt.subplots_adjust(bottom = 0.1)
plt.scatter(
    coords[:, 0], coords[:, 1], marker = 'o'
    )
for label, x, y in zip(cities, coords[:, 0], coords[:, 1]):
    plt.annotate(
        label,
        xy = (x, y), xytext = (-20, 20),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()

Notes: If you don't specify a random_state, then a slightly different embedding may be generated each time (with arbitary rotation) in the 2D plane. If it's slow, you can use multiple CPUs via n_jobs=N.

Wednesday, 1 January 2014

QM Speed Test: NWChem

When I started the QM Speed Test I was afraid that I'd lose steam (and face!) before getting too far along. At the same time, I hoped that others might be inspired to send in some results for their systems and QM packages. This would make it easier for me (since half the work is setting up the input file), verify that my results were not pure invention, and extend the speed test to packages to which I don't have access.

I've just checked in the results for NWChem 6.3 compiled on Centos (also 6.3). But in the meanwhile, others have shot ahead by forking the speed test and checking in their own results:

Eric Berquist, a grad student at the Uni of Pittsburgh, has checked in results for Dalton, ORCA, and Q-Chem.
Michael Banck, Debichem guru, has checked in results for all those packages available on Debian 7 (and presumably Ubuntu and other derivatives) namely NWChem, MPQC, and PSI4, as well as ERKALE.

Dudes, I salute you.

Let's start with my results. Rather than repeat the table here, you can check out the figures on the Github page. The summary is that NWChem is quite a bit faster at geo-opts than ERKALE. This is a combination of more steps in the geo-opt, and faster energy calculations. So if you are using some of ERKALE's other features, it would probably make sense to geo-opt first with NWChem. I did recompile ERKALE with "-O3 -ffast-math -funroll-loops -fPIC -march=native" and got a 10% improvement but that's not enough to change things.

From the other calculations, what do we now know? Well all of the packages tested can actually do geo-opts at the HF/6-31G and B3LYP/6-31G level of theory. Figuring this out by looking at their websites is not always as easy as you might think. Also worth noting is the relative slowdown when using B3LYP versus HF. In some cases the slowdown is of the order of 30-40%, in other cases the DFT calculations are twice as slow.

Michael's show that NWChem is faster than any of the other Open Source QM packages tested. Very useful information; again, it points at a large time-saving by using NWChem for the initial geo-opt before turning to whatever package you need for the subsequent analysis. Eric has compared several proprietary packages (I was going to say commercial but I can't tell if ORCA is available for purchase) and Q-Chem is much faster than ORCA and Dalton.

What we haven't looked at is scaling. Maybe the orders all change when we double the size of the molecule and increase the basis set. But that's a question for another day. Right now, the goal is to get as much breadth as possible on the original question.

Notes on compiling NWChem

1. Boy were there some quirks with this one. First of all, although the instructions on the website give everything in terms of the CShell (remember csh? - big on Unix, not so big these days), you can just use the bash equivalents.
2. I wanted only single-CPU calculations so I tried to turn off MPI support. My first attempt "USE_MPI=n" caused build failures, and the only way to get around build failures appeared to be to delete the folder and untar the source again. Following the instructions on this post, I used "export ARMCI_NETWORK=SOCKETS" instead.
3. My favourite quirk was that when you untar the source, the generated folder has a really long name which causes a compile failure after some time as follows:

The directory name chosen for NWCHEM_TOP is longer than
the maximum allowed value of 65 characters
current NWCHEM_TOP=/home/noel/Tools/Quantum/nwchem-6.3.revision2-src.2013-10-17/src

Renaming to "nwchem" also failed (different error - can't find something or other I think). Renaming to "nwchem-6.3" worked.
4. I couldn't figure out how to get NWChem to use the system lapack library but got it to link to blas as described in the docs.