Tuesday 24 May 2011

(Almost) Translate the InChI code into JavaScript Part III

Following on from Parts I and II...

Ok, so we've converted the InChI library to JavaScript. There are two ways to go from here, either call it directly from JavaScript or write a C function to call it and convert that to JavaScript (and then call that). It might seem that the second plan is more work, but it actually makes things easier as calling these JavaScriptified C functions is a bit tricky especially if we need to pass anything beyond basic C types.

The following code uses the InChI API to read in an InChI, and list the atoms and bonds in a molecule:
#include <stdio.h>
#include <string.h>
#include "inchi_api.h"

int InChI_to_Struct(char* inchi)
{
    inchi_InputINCHI inp;
    inchi_OutputStruct out;
    int i, j, retval;

    inp.szInChI = inchi;
    memset(&out, 0, sizeof(out));

    retval = GetStructFromINCHI(&inp, &out);
    printf("number of atoms: %d\n" , out.num_atoms);
    
    for(i=0;i<out.num_atoms;++i)
    {
      inchi_Atom* piat = &out.atom[i];
      printf("Atom %d: %s\n", i, piat->elname);
      for(j=0;j<piat->num_bonds;++j)
      {
        printf("Bond from %d to %d of type %d\n", i, piat->neighbor[j], piat->bond_type[j]);
      }
    }

    FreeStructFromINCHI( &out );
    return retval;
}   

int test_InChI_to_Struct()
{
    int retval;
    char inchi [] = "InChI=1S/CHClO/c2-1-3/h1H";

    retval = myInChI_to_Struct(inchi);
    return retval;
}

I saved the above code along with the InChI library's own C files in inchi_dll/mycode.c, and added it in the two appropriate places in the Makefile so that the compilation as described in Part II created two extra functions in inchi.js.

To test at the command line, you need to edit the run() method to call InChI_to_Struct, and then call the run() method itself. When you do this, you will find that _strtod is not implemented (so you need to add an implementation - I just pass the call to _strtol) and that there is a call to some clock-related functions (I make this just return 0 - to sort this out properly you would need to look at the original C code and figure out what this function is used for in this context). So, here it is in action if I call run("InChI=1/S/CHClO/c2-1-3/h1H"):
user@ubuntu:~/Tools/inchidemo$ ~/Tools/v8-repo/d8 inchi.js 
number of atoms: 3
Atom 0: C
Bond from 0 to 1 of type 1
Bond from 0 to 2 of type 2
Atom 1: Cl
Bond from 1 to 0 of type 1
Atom 2: O
Bond from 2 to 0 of type 2

Once tested, you can make a webpage that incorporates it. Using Chrome, check out the InChI JavaScript demo.

So...does it work? Well, almost. For some simple InChIs it works perfectly. For others, it returns an error. There are a couple of ways of tracking down the problem but, you know, I have to draw the line somewhere so I'll leave that as an exercise for the reader. Also, the page needs to be refreshed after each InChI, so there's something wrong there with the way I've set it up. The file size is currently too big, but that can be reduced by leaving out unnecessary functions (for example) as well as by using the techniques discussed in the previous post. Perhaps the biggest problem is that the code maxes out the stack space on Firefox/Spidermonkey - this can probably only be addressed by discussion with the emscripten author and changes to the InChI source code.

So that's where I'll leave it for now. I'm very impressed with how well this works - the whole idea is really quite amazing and I didn't expect to get this far, especially with such a complex piece of code. I'll leave the interested reader with a few questions: can you track down all the problems and sort them out?, what other C/C++ libraries could usefully be converted to JavaScript?, and what other languages can be generated from LLVM bytecode?

Supporting info: Various versions of the InChI JavaScript code: vanilla, for running at command-line, ready for webpage, and finally minified.

Acknowledgement: Thanks to kripken, the main emscripten author, for the rapid fix to my reported bug.

Tuesday 17 May 2011

Excel with the Chemistry Development Kit

One of the projects that really astounded me at MIOSS was presented by Kevin Lawson of Syngenta. He has managed to integrate chemistry into Excel, and done so using the freely available and open source toolkit, the Chemistry Development Kit (CDK) (and only using three different programming languages!). The project is called the LICCS System (or Excel-CDK), and the website is at googlecode.

Big deal? Yes - big deal. We may say that R can do everything better than Excel, but the ubuiqity of Excel and the familiarity of everyone with the spreadsheet metaphor, means that targeting Excel brings basic cheminformatic analysis into the hands of non-cinfs like our colleagues, our bosses and our students.

So, what's this all about? Well, once a spreadsheet has been "chemistry-enabled" you can...
  • click on a SMILES strings to see the structure
  • click on a point in a graph and see the structure
  • filter the data by substructure searching (for large datasets you can speed this up by calculating fingerprints first)
  • cluster the data
  • create R group tables
  • calculate molecular properties
If you are at all interested in this, go to the website, and check out the flash video available as a download. This takes you through some of the capabilities of the system.

One interesting aspect is that the software has been cleverly designed for use in a corporate environment. First of all, no installation is required (i.e. admin access is not needed) and secondly, chemistry-enabled spreadsheets can be shared with users who haven't installed the chemistry add-in, so long as they have access to a shared network drive with the required files.

Just a note, to get it all working the first time, you may have to override some security settings in Excel. In Excel 2007, I had to go into the "Trust Center", and in "Macro Settings", enable "Trust access to the VBA project object model". [Update (18/05/11): Kevin says this is only necessary if creating the spreadsheet, not if using one that has already been created.]

Remember - it's an open source project, so get in there and give a hand if you have any ideas for additional features.

Friday 13 May 2011

(Almost) Translate the InChI code into JavaScript Part II

So, following on from Part I...

Let's download the InChI code and try to convert it to JavaScript. To put some sort of figure on the size of the codebase, the C code in INCHI_API/inchi_dll comes to 106K lines (including everything via "wc -l") or 4.8M.

The usual procedure to compile the InChI code is to type "make" in INCHI_API/gcc_so_makefile. Instead, comment out line 2 of the Makefile and then do the following to run make:
export EMMAKEN_COMPILER=/home/user/Tools/llvm-2.9/cbuild/Release/bin/clang
LINKER=/home/user/Tools/emscripten-git/tools/emmaken.py SHARED_LINK=/home/user/Tools/emscripten-git/tools/emmaken.py C_COMPILER=/home/user/Tools/emscripten-git/tools/emmaken.py make
cd result
export PATH=~/Tools/llvm-2.9/cbuild/Release/bin:$PATH
llvm-dis -show-annotations libinchi.so.1.03.00
This creates libinchi.so.1.03.00.ll, composed of LLVM disassembled bytecode, which we now convert to JavaScript in the same way as with "Hello World" previously:
# Run emscripten
$EMSCRIPTEN/emscripten.py libinchi.so.1.03.00.ll $V8/d8 > inchi.js

# Run the Javascript using v8
$V8/d8 inchi.js
Running it, of course, does nothing - it's just a library. Well, I say "just", but it's about 400K lines of code weighing in at 15M. With minification (YUI Compressor) we can get that down to ~7.5M, which zips down to 2MB. The emscripten author recommends passing it through Google Closure (which optimises and minifies the code) but it crashes out with some complaint about hitting a recursion limit. I don't know if it's a problem with the JavaScript code, a bug in Closure or just a feature of InChI generation. It also causes Spidermonkey (and hence Firefox) to complain about maxing out on stack space. Again, I don't know whether there's a way around this.

The next step is to write some code that does something useful with the library. That's all covered in Part III of course.

Tuesday 10 May 2011

Cinfony presentation at MIOSS

I presented Cinfony 1.1 at the recent Wellcome Trust Workshop on Molecular Informatics Open Source Software (MIOSS) at the EBI near Cambridge, UK. Cinfony is a Python library that makes it easy to access several cheminformatics resources through a common and simple API.

This new version of Cinfony is currently in beta while I wait for the release of Open Babel 2.3.1, but is available for download from the Cinfony website (install instructions). The main new features are support for the Indigo toolkit (the general cheminformatics toolkit from GGA Software) and OPSIN (IUPAC name -> structure convertor from Daniel Lowe in PMR's group).

The following code shows an example of using OPSIN to read an IUPAC name (this is taken from the TOC graphic for the OPSIN paper) and then using Indigo to calculate the molecular weight:
>>> from cinfony import indy, opsin
>>> opsinmol = opsin.readstring("iupac",
        "(1R,2R,3R,4S)-11-diazo-2,3,4,9-tetrahydroxy-2-"
        "methyl-5,10-dioxo-2,3,4,5,10,11-hexahydro-1H-"
        "benzo[b]fluoren-1-yl acetate")
>>> print indy.Molecule(opsinmol).molwt
412.349639893
Here's the talk I gave:
View more presentations from baoilleach

Monday 9 May 2011

(Almost) Translate the InChI code into JavaScript

If you follow Rich Apodaca's Signals blog (for example), you will be aware that more and more chemistry applications are being implemented in JavaScript. Wouldn't it be nice to be able to take an existing Java or C++ cheminformatics library and convert it to JavaScript?

Well, guess what - in Oct of last year, a project appeared called emscripten that will do just that for C/C++. So without further ado, let's convert the InChI code.

Actually, maybe it'd make more sense to begin with "Hello World":
#include <stdio.h>

int main()
{
   printf("Hello World!\n");
   return 0;
}
To start with, compile llvm, clang, spidermonkey and v8 as described in the install instructions.

Then convert to javascript as follows:
#!/bin/sh
LLVM_BINDIR=~/Tools/llvm-2.9/cbuild/Release/bin
EMSCRIPTEN=~/Tools/emscripten-git
V8=~/Tools/v8-repo

$LLVM_BINDIR/clang hello.c -o hello
$LLVM_BINDIR/clang hello.c -S -emit-llvm

$LLVM_BINDIR/llvm-as hello.s
$LLVM_BINDIR/llvm-dis hello.s.bc -show-annotations

# Run emscripten
$EMSCRIPTEN/emscripten.py hello.s.ll $V8/d8 > hello.js

# Run the Javascript using v8
$V8/d8 hello.js
After some trivial edits to the code, we can run hello.js in the browser.

Part II shows my attempt to repeat this procedure with the InChI code.

Sunday 8 May 2011

MIOSS - Open Source in Chemistry workshop

I'm just back from the EBI where I participated in a Wellcome Trust workshop "Molecular Informatics Open Source Software", organised by Mark Forster of Syngenta. I gave a talk on Cinfony (slides to follow), while fellow Open Babel developers Tim Vandermeersch and Chris Morley spoke about Open Babel.

What a great meeting.

It was 50% Open Source Software developers, and 50% pharma industry. Or so I thought at the start. It quickly became apparent that the scientists from pharma were also developing or supporting open source software through a variety of methods. Rajarshi's excellent overview gives a list of some of the talks from industry in this space. I will be following up some of these over the next while.

Of the projects from academia, some which were new to me were Bio-Linux (bio-focused Linux distro available on request as bootable USB sticks from NERC [one of the UK research agencies] - especially useful for teaching), ChemT (GUI for chemical library generation), MOLA (use USB sticks to turn a typical computer lab into a cluster for running AutoDock jobs), and OpenStructure (somewhat similar to PyMol).

The workflow softwares Taverna and KNIME are both doing well. Taverna is widely adopted in the bioinformatics community, while KNIME has considerable mindshare in pharma. An interesting aspect of Taverna is that workflows can be stored at http://myexperiment.org, and users can combine other workflows with full attribution. KNIME now has nodes for RDKit and Indigo (as well as most commercial vendors), toolkits which should be familiar to regular readers here.

Silicos now has several open source offerings for library design and general computer-aided drug design: Sieve (filtering on properties), Pharao (pharmacophore alignment), Piramid (shape-based alignment), Stripper (extract molecular scaffolds according to a variety of schemes), Spectrophores (3D fingerprint for QSAR). An interesting development is the ProfiLib web application which returns a PDF that characterises an uploaded molecular library (in SDF form).

There may be some outcomes from this meeting over the next while, so stay tuned...

Sunday 1 May 2011

Questions for On-line Cheminformatics Tutorial

Some time ago I set up an on-line cheminformatics tutorial based on the "Try Python" software of Michael Foord (see my blog post for details).

When a reader mentioned in passing that he found this tutorial useful for introducing people to cheminformatics concepts, it made me realise that it might actually be useful for something. And so, when I had to put together a practical on the subject of Cheminformatics a few weeks ago, I decided to use this tutorial as the basis. I had the students go through the tutorial and answer a series of questions based around each of the three main chapters (Introduction to Cinfony, Descriptors, and Fingerprints).

Does anyone have any suggestions for other cheminformatics practicals?

Notes: This is Windows and Mac only. "Try Python" should (but does not) run on Linux under Moonlight. The problem was reported to the Moonlight devs back in 2009.