Wednesday 2 December 2015

Coin-tossing power

I've been doing some coin tossing. Well actually it's cheminformatics, but the problem reduces to coin tossing. Given that I toss a coin 1000 times, how many heads must I observe to believe that it is a biased coin. My usual ramshackle approach would be to estimate the standard deviation of the random distribution (have my computer do some coin tossing) and then say something about being two sigmas away from the mean. I've done that just to be sure - the sigma is just under 16.
import random
random.seed(1)
import matplotlib.pyplot as plt
from collections import defaultdict
import scipy
import scipy.optimize
import numpy as np

values = [01]

ans = defaultdict(int)
for N in range(100000):
    tot = 0
    for i in range(1000):
        tot += random.choice(values)
    ans[tot] += 1

start, end = 350650
x = np.arange(start, end)
y = []
for i in range(start, end):
    if i in ans:
        y.append(ans[i])
    else:
        if i > start:
            y.append(y[-1])
        else:
            y.append(0)

n = len(x)
mean = 500.
sigma = 10.

def gaus(x,a,sigma):
    return a*scipy.exp(-(x-mean)**2/(2*sigma**2))

popt,pcov = scipy.optimize.curve_fit(gaus,x,y,p0=[1.0,sigma])
print popt

plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
plt.show()
PMC points out that a slightly easier way of getting the same value would be to use an equation: variance = n(1-p)p where n = 1000 and p is 0.5. The standard deviation is then the square root of this (i.e. sqrt of 250), 15.8.

However, the proper way of handling the original question is a power calculation. Typically power calculations are associated with calculating the necessary sample size. Here I have a fixed sample size (n=1000) and I want to figure out, given that the alternative hypothesis is true (biased coin), what number of heads must I observe to be 99.9% sure (power=0.999) that it would be show up as significant at the 0.1% level (alpha=0.001). The y = 0.5 below refers to the proportion of heads expected under the null hypothesis (unbiased coin). Since n=1000, we can use the normal approximation to the binomial distribution.
import math
import scipy.stats

def qnorm(p):
    return scipy.stats.norm.ppf(p)

def solvequadratic(a, b, c):
    return [(-b+math.sqrt((b**2)-(4*(a*c))))/(2.*a),
            (-b-math.sqrt((b**2)-(4*(a*c))))/(2.*a)]

if __name__ == "__main__":
    alpha = 0.001 # Significance level
    gamma = 0.999 # Power
    qgamma = qnorm(gamma)
    qalpha = qnorm(1.-(alpha/2.))
    p = (qalpha+qgamma)**2
    n = 1000
    y = 0.5
    a = 2*n+p
    b = (2*p-4*n)*y - 2*p
    # Note that if y is 0.5, then, b = -2*n-p (i.e. -a)
    c = -2*p*y + (2*n+p)*y*y
    print solvequadratic(a, b, c)
...and the answer? 642. Thanks to ALC for help with this, though all errors are my own.

Notes: (added 04/12/2015)
Both the power and the significance level are conditional probabilities:
  • The significance level (alpha) is the probability of rejecting the null hypothesis given that it is true, ie. prob(rejecting null hypothesis | null hypothesis is correct).
  • The power (gamma) is the probability of detecting a difference, if indeed the difference does exist, ie. prob(rejecting null hypothesis | null hypothesis is incorrect).

Thursday 27 August 2015

The whole of cheminformatics best practice in 15 minutes

Here's the talk I co-presented with Rajarshi at the recent Boston ACS in a session about best practices in cheminformatics. It was entitled "So I have an SD File...What do I do next?" and went something like this...

Friday 10 April 2015

A reason for optimism - improvements in solar cells

Back in the day, I did some work in the area of dye-sensitised solar cells, and more recently on organic solar cells. So I like to keep an eye on how things are going. Since now and then I hear negative comments about renewable energy - "we're never going to get X% of our energy from it" - I thought I'd post this graph here as it shows that real progress continues to be made and we should be a bit more optimistic (click image for original):
This graph, "courtesy of the National Renewable Energy Laboratory, Golden, CO", contains efficiency measurements of test solar cells as measured in a standardised test by an independent test lab (e.g. the NREL in the US). Of course, there is quite a distance between an efficient test solar cell (a couple of square cm in size) and a viable product for the market, but improving efficiency is important (lifetime, processability and cost are other considerations).

It should also be noted that it may not make much sense to compare efficiencies of different cell types as the price/efficiency ratio may be much different (e.g. semiconductor-grade silicon crystals are more expensive than an organic polymer). Also, efficiency isn't even the full story, e.g. silicon solar cells are top of the pile, but if we all switched to this technology tomorrow, there would likely be a net increase in electricity usage (i.e. when you consider the energy cost of making them). Similarly any technology that relies on rare elements is not going to save the world (but may make a lot of money for some mobile phone makers).

Anyhoo, I like to look at this graph every so often to remind myself that things are improving, and that in the long run, we're going to make it. Or at least as far as efficient solar cells can take us.

For more info on the graph, check out the NREL's photovoltaic page.

Tuesday 7 April 2015

See you at EuroSciPyDataCon?

I've been checking up the various Python conferences this year. Using www.pycon.org as a starting point, I see that for me (based in Cambridge, UK) the possibilities include:
Some of these are already open for talks or registration, so check 'em out soon if interested.

While it would be nice to swan around Bilbao and Berlin, I'll probably aim to attend or speak at EuroSciPy which will be in my backyard. I'm kind of reluctant to speak though; I've never been to a conference that wasn't a straight-up scientific meeting so I'm not quite sure what the audience would expect. Who's interested in chemistry and Python that's not already using chemistry and Python? So I'm not quite sure about that.

Leave a comment if you're going to be there too!

Tuesday 10 March 2015

Rasmol.js

"That guy over there said he wrote RasMol." We were at the MGMS Meeting in Dublin in 2005, and one of the PhD students in our group came over to us excitedly. Obviously we thought the student's claim dubious to say the least; no-one wrote RasMol, it was just one of those pieces of software that had always been around.

Cut to 2015, and I'm working for the guy who wrote RasMol. Having seen Jmol take the leap to JavaScript (an impressive piece of work) some years ago already, I wondered could Emscripten make it possible for RasMol to run for the first time in the browser (sshh...don't mention Chime).

And so here's the result (note: only works in Firefox):

Notes:
1. Everything works pretty much identically to the original code, though I've not bothered to get export/save functionality working, nor reading scripts.
2. The port to the browser was done by first porting to libSDL, a graphics library used by many games (also by PyGame). Emscripten provides a library_sdl.js that emulates/replaces all of the libSDL calls. Just to note that I changed the configuration settings at the top of the library_sdl.js to speed things up.
3. The maximum speed is 60fps, which is controlled by the browser, which you may or may not hit depending on fullscreen or not and your hardware (and to a lesser extent the size of the protein model). I did some profiling (direct timings, not based on fps) to see whether I could shave some more time off the screen refresh, but it turned out that the time for a probably-avoidable screen buffer copy was dwarfed (about 15-20x) by the browser's own copy of the buffer to the canvas (putImageData) - I don't get why that is so slow.
4. While there may not be much point in porting Rasmol to the web in this day-and-age, given the rise of WebGL and indeed Jmol, it was a nice spare time project for me over the last while, and gave me practice with Emscripten. This latter lead me to figure out how to compile the cheminformatics toolkits I discussed in the previous posts.
5. Roger was particularly tickled by the fact that if you "set shadows on" for a spacefill or ball-and-stick model, you get ray tracing in JavaScript in the browser.
6. Here's the code.

Saturday 14 February 2015

cheminformatics.js: Conclusion

And that's brings me to the end of my emscripten adventures. I also looked into compiling Indigo, the core of which compiles fine (*), but as the SVG rendering requires Cairo, it was no-go (although others seem to have gotten it working).

I didn't discuss performance as I didn't look into it. Emscripten creates asm.js friendly JavaScript, which is where some of the extra speed boost comes from, and based on benchmarks the author claims that it runs between half and one tenth of the speed of native code...which I think is pretty good.

For converting Java toolkits to JS, an alternative approach using Google Web Toolkit has been adopted by JMol and JME. And finally, one could actually write a cheminformatics library directly in JavaScript - this is the approach used by the Kemia project.

* A minor patch to gzguts.h is needed (add #include <unistd.h>)

Image credit: "Alon Zakai lowres - Emscripten" by Anna Lena Schiller (CC BY-NC-ND 2.0)

Friday 13 February 2015

cheminformatics.js: Helium

The previous two posts covered some heavyweights; this time let's turn to a newer C++ toolkit called Helium, by Tim Vandermeersch. With this toolkit, as well as depiction, I also decided to showcase Tim's SMILES validator functionality (also to be found in Open Babel as the 'smy' SMILEY format).

And here's the result.

Coming up, how 'twas done.

Compile Boost with Emscripten

Unlike with RDKit where we managed to get around the use of Boost libraries, here we need to deal with Boost face-on as Helium links to five different Boost libraries.

Following this info on StackExchange, I ran bootstrap.sh 'normally' and then edited project-config.jam by replacing "using gcc" with:
using gcc : : "/home/noel/Tools/emscripten/em++" ;
Note that the spaces are significant (e.g. preceding the final semicolon). I also set the install directory as local:
option.set prefix : /home/noel/Tools/boostinstall ;
  option.set exec-prefix : /home/noel/Tools/boostinstall ;
  option.set libdir : /home/noel/Tools/boostinstall/lib ;
  option.set includedir : /home/noel/Tools/boostinstall/include ;
For building static libraries you need the following additional changes in tools/build/src/tools/gcc.jam. Change:
toolset.flags gcc.archive .AR $(condition) : $(archiver[1]) ;
     and
toolset.flags gcc.archive .RANLIB $(condition) : $(ranlib[1]) ;
to
toolset.flags gcc.archive .AR $(condition) : "/full/path/to/emscripten/emar" ;
     and
toolset.flags gcc.archive .RANLIB $(condition) : "/full/path/to/emscripten/emranlib" ;
Now use "./b2" to build the libraries (iostreams, chrono, timer, filesystem, system) as follows (the NO_BZIP2 was added because it was complaining about not being able to find the relevant header file):
./b2 link=static variant=release runtime-link=static threading=single --with-iostreams -sNO_BZIP2=1 --with-chrono --with-timer --with-filesystem --with-system  install

Compile Helium with Emscripten

CMake has some trouble finding the Boost libraries so you just have to set all of the variables manually:
cmake .. -DCMAKE_TOOLCHAIN_FILE=/home/noel/Tools/emscripten/cmake/Modules/Platform/Emscripten.cmake -DBOOST_ROOT=/home/noel/Tools/boostinstall -DBoost_INCLUDE_DIR=/home/noel/Tools/boostinstall/include -DBOOST_LIBRARYDIR=/home/noel/Tools/boostinstall/lib -DBoost_CHRONO_LIBRARY_RELEASE=/home/noel/Tools/boostinstall/lib/libboost_chrono.a -DBoost_SYSTEM_LIBRARY_RELEASE=/home/noel/Tools/boostinstall/lib/libboost_system.a -DBoost_TIMER_LIBRARY_RELEASE=/home/noel/Tools/boostinstall/lib/libboost_timer.a -DBoost_IOSTREAMS_LIBRARY_RELEASE=/home/noel/Tools/boostinstall/lib/libboost_iostreams.a -DBoost_FILESYSTEM_LIBRARY_RELEASE=/home/noel/Tools/boostinstall/lib/libboost_filesystem.a -DEIGEN3_INCLUDE_DIR=/usr/include/eigen3
make -j2
The default build has "-O2"; I changed this to "-O3" in the top-level CMakeLists.txt. It also (separately) had "-g" so I removed this.

To test, use nodejs to run helium.js.

Add Helium to a webpage

The easiest way to add an additional executable is to add another 'tool'. Place webdepict.cpp in the tools directory and edit tools/CMakeLists.txt to compile it. As before, looking at the output of "VERBOSE=1 make webdepict", we can modify it to generate the HTML page which is then tweaked as desired:
/home/noel/Tools/emscripten/em++   -O3  -pedantic -Wall -Wno-long-long -Wno-sign-compare     @CMakeFiles/webdepict.dir/objects1.rsp  -o webdepict.html  ../src/libhelium.so /home/noel/Tools/boostinstall/lib/libboost_chrono.a /home/noel/Tools/boostinstall/lib/libboost_timer.a /home/noel/Tools/boostinstall/lib/libboost_system.a /home/noel/Tools/boostinstall/lib/libboost_iostreams.a /home/noel/Tools/boostinstall/lib/libboost_filesystem.a -s EXPORTED_FUNCTIONS="['_ValidateSmiles', '_SmilesToSVG']" -s DISABLE_EXCEPTION_CATCHING=0  --closure 1

Thursday 12 February 2015

cheminformatics.js: RDKit

Following on from the previous post, the second toolkit we'll look at is RDKit. Again, we want to convert RDKit to JavaScript and use it to depict SMILES in the browser.

And here's the result.

Here's how the sausage was made...

Compile RDKit with Emscripten

I used the latest RDKit release (2014-09-2). When configuring, the key insight is that by turning off every option, we don't need any boost libs. Yippee! I can see myself doing this more often in future...
~/Tools/cmake-3.1.2/bin/cmake .. -DCMAKE_TOOLCHAIN_FILE=/home/noel/Tools/emscripten/cmake/Modules/Platform/Emscripten.cmake -DRDK_BUILD_PYTHON_WRAPPERS=OFF -DRDK_BUILD_CPP_TESTS=OFF -DRDK_BUILD_SLN_SUPPORT=OFF -DBoost_INCLUDE_DIR=/home/noel/Tools/boostinstall/include
make -j2

Add RDKit to a webpage

Create a webdepict.cpp file with the SmilesToSVG functionality and compile it by linking against RDKit. The following CMakeLists.txt may be useful:
project(webdepict)
  cmake_minimum_required(VERSION 3.0)
  include_directories(${RDKIT_INCLUDE_DIR} ${Boost_INCLUDE_DIR})
  add_executable(webdepict webdepict.cpp)
  target_link_libraries(webdepict ${RDKIT_LIBRARIES})
...which can be configured with...
export RDLIB=/home/noel/Tools/rdkit-2014-09-2/lib
  ~/Tools/cmake-3.1.2/bin/cmake .. -DCMAKE_TOOLCHAIN_FILE=/home/noel/Tools/emscripten/cmake/Modules/Platform/Emscripten.cmake -DRDKIT_INCLUDE_DIR=/home/noel/Tools/rdkit-2014-09-2/Code -DBoost_INCLUDE_DIR=/home/noel/Tools/boostinstall/include -DRDKIT_LIBRARIES=${RDLIB}/libRDGeometryLib.so;${RDLIB}/libDepictor.so;${RDLIB}/libSmilesParse.so;$RDLIB/libGraphMol.so;${RDLIB}/libRDGeneral.so"
If tested with nodejs, it will write out the SVG for a molecule.

Finally, as with the previous toolkit, modify the CMake build command to create the HTML page which we then edit as desired. Note that this time we need to turn on exception catching:
/home/noel/Tools/emscripten/em++ -O3 @CMakeFiles/webdepict.dir/objects1.rsp -o webdepict.html @CMakeFiles/webdepict.dir/linklibs.rsp -s EXPORTED_FUNCTIONS="['_SmilesToSVG']" -s DISABLE_EXCEPTION_CATCHING=0

Wednesday 11 February 2015

cheminformatics.js: Open Babel

Following on from the preamble, let's convert Open Babel into JavaScript and then use it to depict SMILES in the browser.

And here's the result.

The following gives an overview of how I did this...

Prerequisites

I did all of this on Linux, in a Xubuntu 14.04 VM running on Windows. To begin with, I'll assume you've installed Emscripten master (and all of its associated dependencies) as well as something like CMake 3.1.2. Note that you need a recent version of CMake; for example the one provided with Ubuntu 14.04 (2.8.x) won't handle the toolchain file correctly.

Compile Open Babel with Emscripten

Check out the latest code from github. For the record, I used revision 75414ad.

We need to compile Open Babel with the plugins included statically. Also, since we only need 4 plugins (ASCII, Smiles, SVG, 2D coordinate generation), we don't want to build and include the other 100 or so. To achieve this aim, some delicate customisation of the build files with a machete is required: apply the sharp edge of said instrument to include/openbabel/plugin.h, src/plugin.cpp, src/formats/formats.cmake and src/CMakeLists.txt (*).

Building now would cause some complaints about asciipainter.cpp, so change the #include for openbabel/obutil.h to <math.h> (note to self: push this upstream).;

Create a build directory 'embuild', and from it run cmake as follows:

~/Tools/cmake-3.1.2/bin/cmake -DCMAKE_TOOLCHAIN_FILE=/home/noel/Tools/emscripten/cmake/Modules/Platform/Emscripten.cmake .. -DBUILD_SHARED=OFF -DENABLE_TESTS=OFF 
make -j2
This builds everything, right down to an obabel.js, which you can test as follows with node.js:
nodejs obabel.js -:c1ccccc1 -oascii

Add Open Babel to a webpage

We're going to create a convenience function for use from the webpage, SmilesToSVG. The appropriate code is in webdepict.cpp. To simplify building, add webdepict alongside babel and obabel in tools/CMakeLists.txt so that it is built as part of the Open Babel build.

Running 'make webdepict/fast' will generate webdepict.js, but we want to create a HTML page instead and tweak some of the settings. I've found that the easiest way to do this is to find the command that CMake is using and then edit it. We can find this with "touch webdepict.cpp && VERBOSE=1 make webdepict/fast". I edited this to give the following
$EMSCRIPTEN/em++   -static  -O3  @CMakeFiles/webdepict.dir/objects1.rsp  -o webdepict.html @CMakeFiles/webdepict.dir/linklibs.rsp -s EXPORTED_FUNCTIONS="['_SmilesToSVG']" --closure 1

This generates webdepict.html which you can open in a webbrowser. I modified the result to add some JavaScript that uses the SmilesToSVG function, et voila.

* A tar.gz file, containing the files I refer to, can be found here.

Saturday 7 February 2015

cheminformatics.js: Preamble

Back in the day I attempted to convert the InChI toolkit to JavaScript with the help of a project called Emscripten. It didn't work great to be honest, but I was excited by the possibilities. Since then, both Emscripten and the project on which it builds, the LLVM/Clang compiler architecture, have become much more mature, with Emscripten being adopted as a Mozilla project (now employing the main developer for example), and Clang now the default compiler on MacOSX. And maybe I've changed too...or read the manual properly...or something.

Anyhoo, let's try to convert a few C++ open source cheminformatics toolkits to JavaScript and see how it goes. The goal in each case will be to create a web page where the user can type in a SMILES string which is depicted as they type.

The first post in the series will attempt to do this with Open Babel...