Wednesday, 6 February 2013

A compilation of speeds - Compiler face-off

Compilers cake!Let's get right into this one. I've compiled Open Babel with g++ in various ways, and am going to compare the speed with the MSVC++ release. Specifically I'm going to compare the wallclock time to convert 10000 molecules (the first 10000 in ChEMBL 13) from an SDF file to SMILES.

Our starting point is the time for the MSVC++ compiled release:
29.6s (MSVC++ 2010 Express 32-bit)

I have a Linux Mint 12 VM (VMWare) on the same machine, so let's run the same executable under Wine on Linux:
37.3s (MSVC++ 32-bit under Wine/Linux)
...so it's slower, pretty much as expected. The not-an-emulation layer slows things down a bit.

How about the MinGW compilation described in the previous post?:
24.1s (MinGW g++ 4.6.2 32-bit)
g++ beats MSVC++. To be honest, I was a bit surprised to see this, although I understand from Roger that g++ is surprisingly highly-optimised for cheminformatics toolkits. Maybe we should look into an official MinGW release in future.

What about Open Babel compiled with Cygwin's g++?:
39.5s (Cygwin g++ 4.5.3 32-bit)
As expected it runs like a pig compared to the MinGW version. Cygwin's handy, but when you're in a hurry it's maybe not the best choice.

So far, so not very unexpected. Now we will enter the realm of weirdness. Let's compile it on Linux in the VM and run it there:
14.8s (Linux Mint 12 g++ 4.6.1 64-bit)

So, in short, the fastest way to run Open Babel on Windows is to use a VM to run Linux. Huh? The like-with-like comparison of MinGW's 24.1 versus Linux's 14.8 is the most intriguing. It suggests that the slowdown is either due to rubbish file I/O by Windows, or sub-optimal platform-specific code in Open Babel's I/O handling code.

Either way, it's a pretty interesting result.

Notes:
1. Hardware was a Dell Latitude E6400 bought 3 years ago (Core 2 Duo 2.4 Ghz, 4GB Ram) running Win 7 64-bit. The timing was the best of three after timings had stabilised (the first one or two is usually a second or two slower).
2. After the initial post, I compiled clang on Linux, and then used it to compile Open Babel. Running the conversion took 15.3s.
3. Also, I ran the MinGW compiled version under Linux, and it took 30.7s.

Image credit: Venkatesh Srinivas (Extrudedaluminiu on Flickr)

Compiling Open Babel with MinGW on Windows

If you want to compile on Windows using GCC, you have two alternatives: Cygwin's GCC and MinGW's. The one from Cygwin is easier to use (easier installation) but has the disadvantage that the resulting software does not run natively on Windows, various system calls go through Cygwin's emulation layer which slows things down. Here I'll show how to compile Open Babel with MinGW.

Installing MinGW

I've previously found this a bit confusing. This time I did a manual installation by creating a folder C:\MinGW, and then downloading all the relevant dlls on the installation page. To do this quickly just middle click on several links, wait a few seconds, and then hit Save on all the dialog boxes. Once they are all downloaded, move them to C:\MinGW and unzip them there.

Installing MSYS

No need to install MSYS (a kind of build environment for MinGW) for a project such as Open Babel that uses CMake to build. Why do I mention it then? Because the MinGW page talks all about it.

Compiling Open Babel

1. Add C:\MinGW\bin to the PATH
2. Get Cygwin's stuff off the PATH (if it's there). This is most easily accomplished by renaming C:\Cygwin to C:\oldCygwin or so.
3. Configure CMake to create makefiles for MinGW. I had some problems (at runtime) with a shared library version, so I went with the static one:
cmake -G "MinGW Makefiles" ../openbabel-2.3.2 -DWITH_INCHI=FALSE -DBUILD_SHARED=FALSE
4. Build it with MinGW's make.
mingw32-make
Hmmm...I wonder if it's as fast as the MSVC-compiled version we distribute?

Thursday, 24 January 2013

You can QSAR that again - Reproducible research with IPython

I've mentioned the IPython Notebook before (here and here). It's an interactive Python session that runs in the web browser, and can capture and display the output including plots. It can be saved, loaded and exported to a static HTML page. Entries in the notebook can be edited, and the whole notebook can be run in order to regenerate the output.

In other words, it's the perfect tool for documenting and presenting an analysis of data, thus bringing us one step closer to the goal of reproducible research. There is one area in which it is a particularly good fit for cheminformatics, and that's QSAR.

Greg Landrum and Nikolas Fechner of Novartis have led the way here. Check out this series of IPython notebooks originally presented at the RDKit UGM in 2012, and in particular the one on Using SciKit-Learn and Descriptors to Build Regression Models. Here's an excerpt:
It's pretty much a complete record of how they went about analysing a particular dataset from start to finish. The only thing that I would add is that I would ask the software used (RDKit, ipython, matplotlib and scikits-learn) to print out their version numbers of the top of the notebook (and add some pretty pictures of outliers too of course).

Hopefully others will follow in these footsteps. It would certainly be something to see such a Notebook included as part of the Methods section in a QSAR paper. Almost makes me want to do some QSAR work again...(almost). :-)

Saturday, 19 January 2013

Chemistrify your Raspberry Pi Part III

Following on from Parts I and II, now for the chemistry bit. It turns out this is the easiest part:
apt-get install python-cinfony python-imaging python-imaging-tk openbabel openbabel-gui indigo-utils python-chemfp python-cclib gausssum pymol jmol rasmol avogadro
Pretty easy huh? A single line install for 13 or so chemistry packages. Note that this install command should work on any other Linux distribution based on Debian (e.g. Linux Mint or Ubuntu).

Specifically, this installs Cinfony 1.1 and all its dependencies (Open Babel, RDKit, CDK, Indigo, OPSIN). Then there's Andrew Dalke's chemfp. Not to mention the 'mols' (Jmol, PyMol, Rasmol) and Avogadro. And let's not forget shameless self-promotion of cclib and GaussSum.

After installation here are some examples of things you could do:
$ obabel -:"CC(=O)Cl" -O testOB.png
$ indigo-depict - "CC(=O)Cl" testIndigo.png
$ gpicview test*.png # Display images
$
$ obgui   # Runs fine
$
$ obabel -:"CC(=O)Cl MyMol" -O tmp.mol --gen2d
$ ob2fps tmp.mol # Run ChemFP
$
$ wget http://www.rcsb.org/pdb/files/1PTQ.pdb
$ pymol 1PTQ.pdb  # Fails to start (no OpenGL GLX extension on RPi)
$ rasmol 1PTQ.pdb # Runs fine
$ jmol 1PTQ.pdb   # Runs fine
$ avogadro        # Fails to start (OpenGL problem)
$
$ cclib-get --list mycompchemfile.log
$ gausssum
To use Cinfony, you need to set some variables first as the Java parts don't work out-of-the-box:
$ export JPYPE_JVM=/usr/lib/jvm/java-6-openjdk-armhf/jre/lib/arm/server/libjvm.so
$ export CLASSPATH=/usr/share/java/cdk-nonotify.jar:/usr/share/java/cdk-io.jar:/usr/share/java/cdk-formula.jar:/usr/share/java/cdk-forcefield.jar:/usr/share/java/cdk-atomtype.jar:/usr/share/java/cdk-pdb.jar:/usr/share/java/cdk-fingerprint.jar:/usr/share/java/cdk-qsar.jar:/usr/share/java/cdk-ionpot.jar:/usr/share/java/cdk-annotation.jar:/usr/share/java/cdk-builder3d.jar:/usr/share/java/cdk-libiocml.jar:/usr/share/java/cdk-libiomd.jar:/usr/share/java/cdk-pcore.jar:/usr/share/java/cdk-ioformats.jar:/usr/share/java/cdk-qsarmolecular.jar:/usr/share/java/cdk-qsaratomic.jar:/usr/share/java/cdk-valencycheck.jar:/usr/share/java/cdk-extra.jar:/usr/share/java/cdk-structgen.jar:/usr/share/java/cdk-dict.jar:/usr/share/java/cdk-smarts.jar:/usr/share/java/cdk-control.jar:/usr/share/java/cdk-render.jar:/usr/share/java/cdk-builder3dtools.jar:/usr/share/java/cdk-qsarprotein.jar:/usr/share/java/cdk-data.jar:/usr/share/java/cdk-charges.jar:/usr/share/java/cdk-qm.jar:/usr/share/java/cdk-qsarionpot.jar:/usr/share/java/cdk-standard.jar:/usr/share/java/cdk-interfaces.jar:/usr/share/java/cdk-core.jar:/usr/share/java/cdk-sdg.jar:/usr/share/java/cdk-isomorphism.jar:/usr/share/java/cdk-qsarbond.jar:/usr/share/java/cdk-reaction.jar:/usr/share/java/cdk-diff.jar:/usr/share/java/cdk-smiles.jar:/usr/share/java/jaxen.jar:/usr/share/java/opsin-1.2.0.jar:/usr/share/java/opsin.jar
$ python
>>> from cinfony import opsin, webel
>>> webel.readstring("name", "aspirin").write("iupac")
'2-acetyloxybenzoic acid'
>>> opsin.readstring("iupac", "2-acetyloxybenzoic acid").write("smi")
'C(C)(=O)OC1=C(C(=O)O)C=CC=C1'
There are many other packages of interest; see under Science category in synaptic (see Notes below). Some examples include autodock, ballview, bkchem, and kalzium. Or to max out on chemistry just install the package science-chemistry.

Notes:
(1) If using apt-get to install software is too hard-core for you, there's also a GUI called synaptic. To install, use "apt-get install synaptic".
(2) After installation, to actually see what has been installed, use "dpkg -L 'package-name'". For example, anything that was installed in /usr/bin is a new command.
(3) The list of CDK jars was created using the following Python script:
import glob

cdk = glob.glob("/usr/share/java/cdk-*.jar")
jar = [jar for jar in cdk if not jar.endswith("1.2.10.jar")]
print ":".join(jar)
(4) Compiling Open Babel oneself works fine but takes 3 or so hours. A similar experience for RDKit has been reported by Jan Holst Jensen.

Tuesday, 15 January 2013

Compiling RDKit with MSVC 2012

Compiling RDKit is a bit like the recipe for Elephant Soup. It's straightforward, but first we have to compile Boost (as there are no binaries provided for MSVC 2012). Unfortunately, the boost build instructions are very poor. The HTML instructions are full of text, none of which will simply tell you how to get the job done.

Preparation
1. Just to be safe, as both Boost and RDKit compile against Python, I deleted all my Python install folders except C:\Python2.7.
2. Make sure that bison and flex are installed in Cygwin, and that they are on the PATH.

Compiling Boost 1.49
1. Choose the right version. Too new, and the API will have changed and RDKit will not compile; too old, and it won't compile with MSVC 2012. I'm using boost 1.49.
2. Unzip into C:\Boost\boost_1_49_0. Do not bother using a different folder as it will install into C:\Boost\lib in any case.
3. Compile bjam as follows:
cd C:\Boost\boost_1_49_0
bootstrap
4. Start the MSVC2012 command prompt (or else bjam won't find 'cl'). Now we're going to compile the bits of Boost that RDKit needs. Some of these are shared libraries and some are dynamically linked libraries.
bjam.exe --with-regex --with-python --with-date_time --with-thread link=shared toolset=msvc-11.0 release install -j4
bjam.exe --with-thread --with-date_time toolset=msvc-11.0 release stage -j4
5. Copy the files from C:\Boost\boost_1_49_0\stage\lib to C:\Boost\lib

Compiling RDKit Q3 2012
1. The setup is all done by one command:
C:\Tools\RDKit\newbuild>cmake -G "Visual Studio 11" ..\RDKit_2012_09_1
-- Check for working C compiler using: Visual Studio 11
-- Check for working C compiler using: Visual Studio 11 -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler using: Visual Studio 11
-- Check for working CXX compiler using: Visual Studio 11 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check if the system is big endian
-- Searching 16 bit integer
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of unsigned short
-- Check size of unsigned short - done
-- Using unsigned short
-- Check if the system is big endian - little endian
-- Found PythonLibs: C:/Python27/libs/python27.lib (found version "2.7.3")
-- Found PythonInterp: C:/Python27/python.exe (found version "2.7.3")
-- Boost version: 1.49.0
-- Found the following Boost libraries:
--   python
-- Found BISON: C:/cygwin/bin/bison.exe
-- Found FLEX: C:/cygwin/bin/flex.exe
-- Looking for include file pthread.h
-- Looking for include file pthread.h - not found.
-- Found Threads: TRUE
-- Boost version: 1.49.0
-- Found the following Boost libraries:
--   regex
-- Configuring done
-- Generating done
-- Build files have been written to: C:/Tools/RDKit/newbuild
2. Type "start RDKit.sln", change to Release build, and build the ALL_BUILD target, followed by the INSTALL target.

Running the RDKit tests
1. Close Visual Studio, and at the command line type:
set RDBASE=C:\Tools\RDKit\RDKit_2012_09_1
set PYTHONPATH=%RDBASE%
set PATH=%RDBASE%\lib;C:\Boost\lib;%PATH%
start RDKit.sln
2. Now you can run the tests by 'building' the RUN_TESTS target
1>  100% tests passed, 0 tests failed out of 76
1>  
1>  Total Test time (real) =  82.36 sec

Notes: For a debug build, you need the debug build of Boost. Just replace release by debug in the bjam command-lines above (to speed things up, use 'stage' for both).

Saturday, 5 January 2013

IPython notebook and animated FiPy simulations

I was back home this Christmas, and met up with a friend, Johan Hjelm. Naturally the conversation turned to the awesomeness of Python, and whether there was any way to create animated FiPy simulations directly in the IPython Notebook.

FiPy is "an object oriented, partial differential equation (PDE) solver, written in Python, based on a standard finite volume (FV) approach." Fair enough. More usefully, there are a couple of examples on the website that model diffusion, electrodeposition and convection. I focussed on the mesh20x20 diffusion example.

If you run the example at the command-line it pops up a matplotlib window showing the progress of the simulation. However, direct entry of the example into an IPython Notebook just results in a single graph for the simulation. To adapt it, I added a call to clear_output(), and used IPython's display() command to directly display the matplotlib figure associated with the simulation.

In short, here are the results as an IPython notebook, a Python script, and as an HTML page (and another HTML page, created on-the-fly from the notebook URL by http://nbviewer.ipython.org).

Notes: I used IPython 0.13.1 on Windows. To downgrade the notebook to earlier versions, see this discussion. Also, the statement "from IPython.display import clear_output" may need to be changed to "from IPython.core.display import clear_output".

Wednesday, 2 January 2013

Chemistrify your Raspberry Pi Part II

In Part I, we got this very lean mean machine up and running. Now we want to take a look at what's going on in its tiny tiny silicon brain. If you have a monitor/TV and USB keyboard and mouse it's easy - just plug them in. In my case I don't so...

...I'm going to log in remotely using my laptop over the network. The good news is that there's an ssh server running by default on the RPi. The username is pi and password is raspberry. All we need is the RPi's IP address.

Connect the RPi to your router using an ethernet cable. If both are turned on, the router will assign the RPi an IP address. You can find out the value by logging into your router and looking at the details (or you can just guess the IP address by changing the number at the end of your laptop's IP address). Once you have the IP address, you can log in with Putty or Cygwin's ssh (remember, username pi).

The first time you log in, it asks you to run 'sudo raspi-config'. I did so, to set the timezone, expand the root filesystem (otherwise it doesn't use the whole SD card), and reduce the video memory to 32MB from 64MB (under "Memory split"). When you hit Finish it reboots, killing ssh, so you have to wait a minute before logging back in.

While some believe that the Unix command line is the perfect user interface, let's see what the Raspbian GUI looks like. To do so, we are going to use VNC (Virtual Network Computing), and specifically a piece of software called TightVNC. We will set up a server on the RPi, and a viewer on the laptop.

On the RPi:
$ sudo apt-get install tightvncserver
$ tightvncserver :1
If you set a password, make a note of it.

On the Windows laptop, install TightVNC. Please note that when you run the installer, you should untick the box that sets TightVNC running as a Windows service. This would be a BAD idea, as it would mean that your desktop is being broadcast over the network.

Now run "TightVNC Viewer" and connect to the RPi by entering the IP address of the RPi followed by ":5901", e.g. 128.128.0.1:5901. If you set a password, you will need to enter it. Finally, you should see something like this:


I still haven't done any chemistry but I guess that's all in Part III...

Notes: From time to time the router changes the IP address it allocates. If you want to assign a fixed IP address to the RPi, see the information here (untested). If you want the RPi to automatically start a TightVNC server on booting, see the information in the same article.