Thursday, 24 January 2013

You can QSAR that again - Reproducible research with IPython

I've mentioned the IPython Notebook before (here and here). It's an interactive Python session that runs in the web browser, and can capture and display the output including plots. It can be saved, loaded and exported to a static HTML page. Entries in the notebook can be edited, and the whole notebook can be run in order to regenerate the output.

In other words, it's the perfect tool for documenting and presenting an analysis of data, thus bringing us one step closer to the goal of reproducible research. There is one area in which it is a particularly good fit for cheminformatics, and that's QSAR.

Greg Landrum and Nikolas Fechner of Novartis have led the way here. Check out this series of IPython notebooks originally presented at the RDKit UGM in 2012, and in particular the one on Using SciKit-Learn and Descriptors to Build Regression Models. Here's an excerpt:
It's pretty much a complete record of how they went about analysing a particular dataset from start to finish. The only thing that I would add is that I would ask the software used (RDKit, ipython, matplotlib and scikits-learn) to print out their version numbers of the top of the notebook (and add some pretty pictures of outliers too of course).

Hopefully others will follow in these footsteps. It would certainly be something to see such a Notebook included as part of the Methods section in a QSAR paper. Almost makes me want to do some QSAR work again...(almost). :-)

Saturday, 19 January 2013

Chemistrify your Raspberry Pi Part III

Following on from Parts I and II, now for the chemistry bit. It turns out this is the easiest part:
apt-get install python-cinfony python-imaging python-imaging-tk openbabel openbabel-gui indigo-utils python-chemfp python-cclib gausssum pymol jmol rasmol avogadro
Pretty easy huh? A single line install for 13 or so chemistry packages. Note that this install command should work on any other Linux distribution based on Debian (e.g. Linux Mint or Ubuntu).

Specifically, this installs Cinfony 1.1 and all its dependencies (Open Babel, RDKit, CDK, Indigo, OPSIN). Then there's Andrew Dalke's chemfp. Not to mention the 'mols' (Jmol, PyMol, Rasmol) and Avogadro. And let's not forget shameless self-promotion of cclib and GaussSum.

After installation here are some examples of things you could do:
$ obabel -:"CC(=O)Cl" -O testOB.png
$ indigo-depict - "CC(=O)Cl" testIndigo.png
$ gpicview test*.png # Display images
$ obgui   # Runs fine
$ obabel -:"CC(=O)Cl MyMol" -O tmp.mol --gen2d
$ ob2fps tmp.mol # Run ChemFP
$ wget
$ pymol 1PTQ.pdb  # Fails to start (no OpenGL GLX extension on RPi)
$ rasmol 1PTQ.pdb # Runs fine
$ jmol 1PTQ.pdb   # Runs fine
$ avogadro        # Fails to start (OpenGL problem)
$ cclib-get --list mycompchemfile.log
$ gausssum
To use Cinfony, you need to set some variables first as the Java parts don't work out-of-the-box:
$ export JPYPE_JVM=/usr/lib/jvm/java-6-openjdk-armhf/jre/lib/arm/server/
$ export CLASSPATH=/usr/share/java/cdk-nonotify.jar:/usr/share/java/cdk-io.jar:/usr/share/java/cdk-formula.jar:/usr/share/java/cdk-forcefield.jar:/usr/share/java/cdk-atomtype.jar:/usr/share/java/cdk-pdb.jar:/usr/share/java/cdk-fingerprint.jar:/usr/share/java/cdk-qsar.jar:/usr/share/java/cdk-ionpot.jar:/usr/share/java/cdk-annotation.jar:/usr/share/java/cdk-builder3d.jar:/usr/share/java/cdk-libiocml.jar:/usr/share/java/cdk-libiomd.jar:/usr/share/java/cdk-pcore.jar:/usr/share/java/cdk-ioformats.jar:/usr/share/java/cdk-qsarmolecular.jar:/usr/share/java/cdk-qsaratomic.jar:/usr/share/java/cdk-valencycheck.jar:/usr/share/java/cdk-extra.jar:/usr/share/java/cdk-structgen.jar:/usr/share/java/cdk-dict.jar:/usr/share/java/cdk-smarts.jar:/usr/share/java/cdk-control.jar:/usr/share/java/cdk-render.jar:/usr/share/java/cdk-builder3dtools.jar:/usr/share/java/cdk-qsarprotein.jar:/usr/share/java/cdk-data.jar:/usr/share/java/cdk-charges.jar:/usr/share/java/cdk-qm.jar:/usr/share/java/cdk-qsarionpot.jar:/usr/share/java/cdk-standard.jar:/usr/share/java/cdk-interfaces.jar:/usr/share/java/cdk-core.jar:/usr/share/java/cdk-sdg.jar:/usr/share/java/cdk-isomorphism.jar:/usr/share/java/cdk-qsarbond.jar:/usr/share/java/cdk-reaction.jar:/usr/share/java/cdk-diff.jar:/usr/share/java/cdk-smiles.jar:/usr/share/java/jaxen.jar:/usr/share/java/opsin-1.2.0.jar:/usr/share/java/opsin.jar
$ python
>>> from cinfony import opsin, webel
>>> webel.readstring("name", "aspirin").write("iupac")
'2-acetyloxybenzoic acid'
>>> opsin.readstring("iupac", "2-acetyloxybenzoic acid").write("smi")
There are many other packages of interest; see under Science category in synaptic (see Notes below). Some examples include autodock, ballview, bkchem, and kalzium. Or to max out on chemistry just install the package science-chemistry.

(1) If using apt-get to install software is too hard-core for you, there's also a GUI called synaptic. To install, use "apt-get install synaptic".
(2) After installation, to actually see what has been installed, use "dpkg -L 'package-name'". For example, anything that was installed in /usr/bin is a new command.
(3) The list of CDK jars was created using the following Python script:
import glob

cdk = glob.glob("/usr/share/java/cdk-*.jar")
jar = [jar for jar in cdk if not jar.endswith("1.2.10.jar")]
print ":".join(jar)
(4) Compiling Open Babel oneself works fine but takes 3 or so hours. A similar experience for RDKit has been reported by Jan Holst Jensen.

Tuesday, 15 January 2013

Compiling RDKit with MSVC 2012

Compiling RDKit is a bit like the recipe for Elephant Soup. It's straightforward, but first we have to compile Boost (as there are no binaries provided for MSVC 2012). Unfortunately, the boost build instructions are very poor. The HTML instructions are full of text, none of which will simply tell you how to get the job done.

1. Just to be safe, as both Boost and RDKit compile against Python, I deleted all my Python install folders except C:\Python2.7.
2. Make sure that bison and flex are installed in Cygwin, and that they are on the PATH.

Compiling Boost 1.49
1. Choose the right version. Too new, and the API will have changed and RDKit will not compile; too old, and it won't compile with MSVC 2012. I'm using boost 1.49.
2. Unzip into C:\Boost\boost_1_49_0. Do not bother using a different folder as it will install into C:\Boost\lib in any case.
3. Compile bjam as follows:
cd C:\Boost\boost_1_49_0
4. Start the MSVC2012 command prompt (or else bjam won't find 'cl'). Now we're going to compile the bits of Boost that RDKit needs. Some of these are shared libraries and some are dynamically linked libraries.
bjam.exe --with-regex --with-python --with-date_time --with-thread link=shared toolset=msvc-11.0 release install -j4
bjam.exe --with-thread --with-date_time toolset=msvc-11.0 release stage -j4
5. Copy the files from C:\Boost\boost_1_49_0\stage\lib to C:\Boost\lib

Compiling RDKit Q3 2012
1. The setup is all done by one command:
C:\Tools\RDKit\newbuild>cmake -G "Visual Studio 11" ..\RDKit_2012_09_1
-- Check for working C compiler using: Visual Studio 11
-- Check for working C compiler using: Visual Studio 11 -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler using: Visual Studio 11
-- Check for working CXX compiler using: Visual Studio 11 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check if the system is big endian
-- Searching 16 bit integer
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of unsigned short
-- Check size of unsigned short - done
-- Using unsigned short
-- Check if the system is big endian - little endian
-- Found PythonLibs: C:/Python27/libs/python27.lib (found version "2.7.3")
-- Found PythonInterp: C:/Python27/python.exe (found version "2.7.3")
-- Boost version: 1.49.0
-- Found the following Boost libraries:
--   python
-- Found BISON: C:/cygwin/bin/bison.exe
-- Found FLEX: C:/cygwin/bin/flex.exe
-- Looking for include file pthread.h
-- Looking for include file pthread.h - not found.
-- Found Threads: TRUE
-- Boost version: 1.49.0
-- Found the following Boost libraries:
--   regex
-- Configuring done
-- Generating done
-- Build files have been written to: C:/Tools/RDKit/newbuild
2. Type "start RDKit.sln", change to Release build, and build the ALL_BUILD target, followed by the INSTALL target.

Running the RDKit tests
1. Close Visual Studio, and at the command line type:
set RDBASE=C:\Tools\RDKit\RDKit_2012_09_1
set PATH=%RDBASE%\lib;C:\Boost\lib;%PATH%
start RDKit.sln
2. Now you can run the tests by 'building' the RUN_TESTS target
1>  100% tests passed, 0 tests failed out of 76
1>  Total Test time (real) =  82.36 sec

Notes: For a debug build, you need the debug build of Boost. Just replace release by debug in the bjam command-lines above (to speed things up, use 'stage' for both).

Saturday, 5 January 2013

IPython notebook and animated FiPy simulations

I was back home this Christmas, and met up with a friend, Johan Hjelm. Naturally the conversation turned to the awesomeness of Python, and whether there was any way to create animated FiPy simulations directly in the IPython Notebook.

FiPy is "an object oriented, partial differential equation (PDE) solver, written in Python, based on a standard finite volume (FV) approach." Fair enough. More usefully, there are a couple of examples on the website that model diffusion, electrodeposition and convection. I focussed on the mesh20x20 diffusion example.

If you run the example at the command-line it pops up a matplotlib window showing the progress of the simulation. However, direct entry of the example into an IPython Notebook just results in a single graph for the simulation. To adapt it, I added a call to clear_output(), and used IPython's display() command to directly display the matplotlib figure associated with the simulation.

In short, here are the results as an IPython notebook, a Python script, and as an HTML page (and another HTML page, created on-the-fly from the notebook URL by

Notes: I used IPython 0.13.1 on Windows. To downgrade the notebook to earlier versions, see this discussion. Also, the statement "from IPython.display import clear_output" may need to be changed to "from IPython.core.display import clear_output".

Wednesday, 2 January 2013

Chemistrify your Raspberry Pi Part II

In Part I, we got this very lean mean machine up and running. Now we want to take a look at what's going on in its tiny tiny silicon brain. If you have a monitor/TV and USB keyboard and mouse it's easy - just plug them in. In my case I don't so...

...I'm going to log in remotely using my laptop over the network. The good news is that there's an ssh server running by default on the RPi. The username is pi and password is raspberry. All we need is the RPi's IP address.

Connect the RPi to your router using an ethernet cable. If both are turned on, the router will assign the RPi an IP address. You can find out the value by logging into your router and looking at the details (or you can just guess the IP address by changing the number at the end of your laptop's IP address). Once you have the IP address, you can log in with Putty or Cygwin's ssh (remember, username pi).

The first time you log in, it asks you to run 'sudo raspi-config'. I did so, to set the timezone, expand the root filesystem (otherwise it doesn't use the whole SD card), and reduce the video memory to 32MB from 64MB (under "Memory split"). When you hit Finish it reboots, killing ssh, so you have to wait a minute before logging back in.

While some believe that the Unix command line is the perfect user interface, let's see what the Raspbian GUI looks like. To do so, we are going to use VNC (Virtual Network Computing), and specifically a piece of software called TightVNC. We will set up a server on the RPi, and a viewer on the laptop.

On the RPi:
$ sudo apt-get install tightvncserver
$ tightvncserver :1
If you set a password, make a note of it.

On the Windows laptop, install TightVNC. Please note that when you run the installer, you should untick the box that sets TightVNC running as a Windows service. This would be a BAD idea, as it would mean that your desktop is being broadcast over the network.

Now run "TightVNC Viewer" and connect to the RPi by entering the IP address of the RPi followed by ":5901", e.g. If you set a password, you will need to enter it. Finally, you should see something like this:

I still haven't done any chemistry but I guess that's all in Part III...

Notes: From time to time the router changes the IP address it allocates. If you want to assign a fixed IP address to the RPi, see the information here (untested). If you want the RPi to automatically start a TightVNC server on booting, see the information in the same article.

Tuesday, 1 January 2013

Chemistrify your Raspberry Pi

So you've just gotten a Raspberry Pi Model B for Christmas. Yay - let's install all known cheminformatics software! But you don't have a monitor, TV or a USB keyboard. What are we going to do?

Well, it turns out we don't need them. We do need a couple of things though first:

1. A 5V power supply with a micro USB cable and supplying >700mA. £15 at Maplins - a bit overpriced. You'll probably do better on Amazon, or recycling a phone or Kindle charger.

2. An SDHC card. I bought a 32GB one at ASDA for £15, but you can get by with a 4GB one. Apparently they come in different speeds - I didn't realise this, but your Raspberry Pi might be faster if you get a class 10 SDHC (see list of compatible cards).

Getting the Linux distro image ("Raspbian" - a customised Debian) onto the SD card is the hardest part, at least from Windows, as the recommended method is using Win32DiskImager which (as of version 0.6) has a critical bug that might bork your hard-drive (YMMV). I created a Knoppix Live CD, booted off that, and used dd to copy the image to the SD card*. If you don't know what you're doing you might be better off buying an SD card with Raspbian pre-installed.

Your RPi is now good to go. Stick the SD card into the appropriate place, plug in the micro USB, and turn on the power. Eh walla!

Well, actually, the only thing you'll see are lights blinking on the RPi. But Part II will continue this epic narrative...

(* I used gparted to figure out which device was which.)