Noel O'Blog: 2011

Wednesday, 21 December 2011

Open Babel: 10 Years and Future Directions

On the Open Babel mailing list, Geoff looks back on 10 years of the project, and looks forward to the future:

As 2011 draws to a close, Open Babel is over 10 years old! At this point, it's used by over 40 open source projects, downloaded over 200,000 times, and been used in over 400 academic papers. And of course, there have been 15 releases and dozens of contributors.

...Read the rest

See you in 2012!

Monday, 5 December 2011

Poll over and discuss Goslar

So maybe my poll question was not very difficult (given that 50% of you got it right), but I thought it was quite surprising nonetheless when I came across a guest editorial by Steve Heller* in Anal. Chim. Acta in 1982, entitled "Where have all the data gone?":

"Unless something real and practical is done in the near future, it will become impossible to find or use scientific data with the resulting loss of time and money for those who need to repeat experiments."

The future alluded to is the one in which we now live. A follow-up letter, "Computer readable analytical chemical data - comments on a critical need" in Trends in Anal. Chem. discusses this further.

I was put in mind of these articles at the recent German Conference on Chemoinformatics (GCC2011) in Goslar, when (a) I met Steve Heller, and (b) in Prof Johnny Gasteiger's talk, he highlighted this same problem as one of the outstanding challenges that we should be sorting out. PMR of course has been discussing this issue for some time, but it's the first time I'd heard Prof Gasteiger mention it.

Since I'm on the subject of the GCC, it was good to meet several people who I know through the Open Babel mailing lists, and in particular Michael Banck, who plays a major role in curating chemistry software for Debian. For example, see this list of packages. His talk is available on Lanyrd.

In a recent blogpost I mentioned that Open Access makes it easy to redistribute copies of papers, and I wondered why OA journals don't take advantage of this. Well, it turns out they are - Jan Kuras of Chemistry Central was giving out nice colour copies of the Open Babel paper printed in booklet form, along with similar booklets summarising the three series they have recently published on RDF, PubChem3D and PMR's Symposium.

And finally here's a picture of me trying to steal a pretzel from one of the FIZ-Chemie Berlin Award winners, Dr. Volker Dirk Hähnke, who gave a very interesting talk on using sequence alignment methods to align a string representation of a chemical graph:

Footnote:
* You may know of Steve from such string representations as the InChI. Incidentally, I thought I was blazing a trail putting my talks on the web, but check out Steve's page.

Thursday, 1 December 2011

Cinfony 1.1 released

Cinfony presents a common API to several cheminformatics toolkits. It uses the Python programming language, and builds on top of Open Babel, the RDKit, the CDK, Indigo, OPSIN and cheminformatics webservices.

Cinfony 1.1 is now available for download.

The two major additions in this release are support for using the Indigo cheminformatics toolkit (the indy module) and support for OPSIN (IUPAC name to structure, the opsin module).

As usual, Cinfony has been updated to use the latest stable releases of each toolkit: Open Babel 2.3.1, CDK 1.4.5, RDKit 2011.09, Indigo 1.0 and OPSIN 1.1. Installation on Windows has also been simplified somewhat as Open Babel 2.3.1 now includes the necessary .jar file and .NET libraries (for use from Jython and IronPython).

The Cinfony website has a somewhat condensed (and only slightly contrived :-) example showing the use of all of these resources in just 12 lines of Python. Here's a small example showing that roundtripping of IUPAC names is now easy to play with:

>>> from cinfony import opsin, webel
>>> mol = opsin.readstring("iupac",
                           "1-chloro-2-bromopropane")
>>> print webel.Molecule(mol).write("iupac")
2-Bromo-1-chloropropane

To support Cinfony, please cite:

N.M. O'Boyle, G.R. Hutchison, Chem. Cent. J., 2008, 2, 24. [link]

Friday, 25 November 2011

Your turn - Poll up and answer

It's been a while, so here's a poll (see the sidebar on the left).

In which decade was the following statement made? Guess before you google it, and extra marks if you know who the author was also.

With the ever-increasing number of publications, coupled with higher printing costs, there is great pressure brought on journal editors to keep manuscripts as short as possible. While this is quite understandable, it has, in my opinion, lead to a very serious problem. The vast majority of (hopefully) good analytical data, such as spectroscopic, kinetic and thermodynamic measurements, is never readily made available to the scientific community. Published data are often so "compressed" that one is unable to examine alternative interpretations, as the published data are not sufficient. Partial data are preferred to complete data...

Why doesn't [______] make it a policy to require authors to submit full data on spectroscopic and other data for which there are existing data centres? Furthermore, I propose that the editors of this journal and other such journals establish criteria for collecting relevant data for which no data center exists today in order to prepare for the future. Perhaps it is time for a conference of journal editors to meet and propose a solution to this problem. Unless something real and practical is done in the near future, it will become impossible to find or use scientific data with the resulting loss of time and money for those who need to repeat experiments.

Note:
(1) I've mixed up the spelling of center/re to protect the innocent.
(2) Poll closes in 7 days.
(3) Please - no spoilers in the comments.

Friday, 18 November 2011

Displaying pharmacophores from Pharao in Avogadro

Following on from the equivalent post for Pharmer, here is an extended Avogadro plugin that works for both Pharmer and Pharao generated pharmacophores.

The plugin allows the user to click through all of the pharmacophores defined in a file and display them one by one. See install instructions in the previous post. The code is kinda rough-and-ready so be prepared to hack around if you need to sort something out:

Wednesday, 16 November 2011

Plotting accesses on the axis

It's quite interesting to monitor the page accesses for publications in J. Cheminf. (which incidentally has just gotten a much improved interface). While page accesses are not citations, they do provide a measure of general interest in a paper. Also, if you reach a certain level (relative to other publications in the journal), your paper gets the "Highly Accessed" badge, which helps to highlight it. PMR has been talking about this too.

I monitored the page accesses in the first month for two recent papers, the Blue Obelisk paper and the Open Babel paper. I missed a day or two here or there, but the results are shown in this Google spreadsheet graph:
It is worth noting that about half of the accesses in the month occur in the first week. At that stage only the preliminary manuscript is available (I think it was two weeks later that the final HTML and PDF was produced). Furthermore, the DOI was not registered until the HTML went live and thus you need provide the direct URL rather than DOI if promoting the paper (I've complained about this to the journal).

Thursday, 3 November 2011

My new book, made of 100% recycled papers

There are many advantages associated with publishing with an Open Access journal...but I'm not going to go into these now (see recent posts on PMR's blog [e.g. this one] for a comprehensive background if the advantages are not self-evident). Here the point I want to make is that the authors of OA papers can do things with their papers which are not allowed by traditional publishers.

As a trivial example, authors of OA publications can legally distribute copies of their papers - sort of handy if you're a scientist, eh? :-) With most Open Access publishers, the author does not transfer copyright to the publisher but rather licenses it to them under a Creative Commons License. In other words, the author retains the right to do whatever they want with the paper: they can copy+paste the text into their blog (insert example here if someone can send me one), they can insert the paper as an appendix to another publication (as I did with the Open Babel book), they can magnify it to A0 size and present it as a poster or art installation or whatever... (I've sometimes wondered why Open Access journals don't make more capital out of this difference between themselves and closed journals - for example, they could hand out the top 10 most accessed papers at conferences.)

Well, in the spirit of exercising my right to recycle my OA papers as I like, I've put together a book (well, a PDF at least) consisting of the various PDFs stitched together with a Table of Contents. Here is the result (a 15MB PDF):

If I wanted to, I could upload this to Lulu and allow people to order a copy in the post, all nicely bound with a cover. Maybe another day.
Notes: The PDF was created based on comments on the Debian message board. To avoid confusion over page numbers, I added the short form of the appropriate reference to the footer of all of the papers. The following Python script was used to automate some of the steps:

import os

latex="""\\documentclass[12pt,a4paper]{book}
\\usepackage{multido}
\\usepackage[hmargin=.8cm,vmargin=0.5cm,nohead,nofoot]{geometry}
\\usepackage[explicit]{titlesec}

\\newcommand*\\Hide{%%
\\titleformat{\\chapter}[display]
  {}{}{0pt}{\\Huge\\thispagestyle{empty}}
\\titleformat{\\part}
  {}{}{0pt}{}
}

\\begin{document}
\\pagestyle{empty}

\\title{Open Access Publications of\\\\Noel O'Boyle}
\\maketitle
\\tableofcontents

%s

\\end{document}
"""

data = {
    "BO.pdf": [15,
"""Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on

N. M. O'Boyle, R. Guha, E. L. Willighagen, S. E. Adams, J. Alvarsson, J.-C. Bradley, I. V. Filippov, R. M. Hanson, M. D. Hanwell, G. R. Hutchison, C. A. James, N. Jeliazkova, A. S. I. D. Lang, K. M. Langner, D. C. Lonie, D. M. Lowe, J. Pansanel, D. Pavlov, O. Spjuth, C. Steinbeck, A. L. Tenderholt, K. J. Theisen and P. Murray-Rust.
J. Cheminf. 2011, 3, 37."""],
    "GM.pdf": [12,
"""Userscripts for the life sciences

E. L. Willighagen, N. M. O'Boyle, H. Gopalakrishnan, D. Jiao, R. Guha, C. Steinbeck and D. J. Wild.
BMC Bioinformatcs. 2007, 8, 487."""],
    "Cinfony.pdf": [10, """Cinfony - combining Open Source cheminformatics toolkits behind a common interface

N. M. O'Boyle and G. R. Hutchison.
Chem. Cent. J. 2008, 2, 24."""],
    "Confab.pdf": [9, """Confab - Systematic generation of diverse low-energy conformers

N. M. O'Boyle, T. Vandermeersch, C. J. Flynn, A. R. Maguire and G. R. Hutchison.
J. Cheminf. 2011, 3, 8."""],
    "AntColony.pdf": [15, """Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction

N. M. O'Boyle, D. S. Palmer, F. Nigsch and J. B. O. Mitchell.
Chem. Cent. J. 2008, 2, 21."""],
    "DataAnalysis.pdf": [2, """Review of ``Data Analysis with Open Source Tools"

N. M. O'Boyle.
J. Cheminf. 2011, 3, 10."""],
    "MACIE2005.pdf": [2, """MACiE: a database of enzyme reaction mechanisms

G. L. Holliday, G. J. Bartlett, D. E. Almonacid, N. M. O'Boyle, P. Murray-Rust, J. M. Thornton and J. B. O. Mitchell.
Bioinformatics. 2005, 21, 4315-4316."""],
    "MACIE2007.pdf": [6, """MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms

G. L. Holliday, D. E. Almonacid, G. J. Bartlett, N. M. O'Boyle, J. W. Torrance, P. Murray-Rust, J. B. O. Mitchell and J. M. Thornton.
Nucleic Acid Res. 2007, 35, D515-D520."""],
    "OB.pdf": [14, """Open Babel: An open chemical toolbox

N. M. O'Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch and G. R. Hutchison.
J. Cheminf. 2011, 3, 33."""],
    "PyChem.pdf": [2, """PYCHEM: a multivariate analysis package for python

R. M. Jarvis, D. Broadhurst, H. Johnson, N. M. O'Boyle and R. Goodacre.
Bioinformatics. 2006, 22, 2565-2566."""],
    "Pybel.pdf": [7, """Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit

N. M. O'Boyle, C. Morley and G. R. Hutchison.
Chem. Cent. J. 2008, 2, 5."""]
    }

order = [["Cheminformatics toolkits", ['Pybel', 'Cinfony', 'OB']],
         ["Enzyme reaction mechanisms", ['MACIE2005', 'MACIE2007']],
         ["QSAR", ['PyChem', 'AntColony']],
         ["The Rest", ["GM", "Confab", "DataAnalysis", "BO"]]
         ]

def formatjournal(text):
    """Format the journal metadata

    >>> formatjournal("Chem. Cent. J. 2008, 2, 5.")
    '\\\\textit{Chem. Cent. J.} \\\\textbf{2008}, \\\\textit{2}, 5.'
    """
    broken = text.split(" ")
    pages = broken[-1][:-1]
    year = broken[-3][:-1]
    volume = broken[-2][:-1]
    journal = " ".join(broken[:-3])
    return "\\textit{%s} \\textbf{%s}, \\textit{%s}, %s." % (journal, year, volume, pages)
    
def test():
    import doctest
    doctest.testmod()

if __name__ == "__main__":
    # Sanity checks
    N = 0
    for x, y in data.iteritems():
        assert os.path.isfile(x)
        N += y[0]
    assert len(data) == sum(len(y) for x, y in order)

    # Write latex
    output = []
    for a, b, in order:
        output.append("\\part{%s}{\\Hide" % a)
        for paper in b:
            pages, biblio = data["%s.pdf" % paper]
            broken = biblio.split("\n")
            title = broken[0]
            output.append("\\chapter{%s}" % title)
            authors = broken[2]
            journaldata = formatjournal(broken[3])
            output.append("""\\multido{}{%d}{\\null\\vfill
    %s
    \\newpage}""" % (pages, journaldata))
        output.append("}")

    with open("generated.tex", "w") as f:
        f.write(latex % "\n".join(output))

    # Join Papers together
    pages = 1
    paper_idx = 1
    names = "A=generated.pdf"
    cat = "A2 A2 A2 A2"
    pages += 4
    for a, b in order:
        cat += " A2 A2"
        pages += 2
        for paper in b:
            # New chapters always on odd-number pages
            if pages % 2 == 0:
                cat += " A2"
                pages += 1
##            print pages
            paper_idx += 1
            papername = chr(64+paper_idx)
            names += " %s=%s.pdf" % (papername, paper)
            cat += " %s" % papername
            pages += data["%s.pdf" % paper][0]
    print >> open("run.bat", "w"), "pdftk %s cat %s output combined.pdf" % (names, cat)

    print """Now run...
pdflatex generated
pdflatex generated
run.bat
pdftk combined.pdf burst output tmp\\file_%03d.pdf
pdftk generated.pdf burst output tmp\\numbers_%03d.pdf

bash-3.2$ cd tmp && for i in `seq -w 1 109`; do pdftk file_$i.pdf background numbers_$i.pdf output new-$i.pdf; done

pdftk tmp\\new-???.pdf output new.pdf
"""

Thursday, 20 October 2011

Open Babel 2.3.1 released

Coming close on the heels of the Open Babel paper is the release of Open Babel 2.3.1.

As announced by Geoff on the mailing list:

I am very happy to finally announce the release of Open Babel 2.3.1, a major update release of the open source chemistry toolbox.

Open Babel has been downloaded nearly 200,000 times and is used in over 45 projects and over 400 publications.
http://www.jcheminf.com/content/3/1/33

This release represents a major bug-fix-release and should be a stable upgrade, strongly recommended for all users of Open Babel. Many bugs and enhancements have been added in the last year since the 2.3.0 release.

What's new? See the full release notes

See the new user guide.

See the updated developer documentation.

To download, see:
http://sourceforge.net/projects/openbabel/files/

For more information, see the project website.

I'd particularly like to thank Chris Morley, Noel O'Boyle, and many others who put large amounts of time testing and improving this release.

This is a community project and we couldn't have made this release without you. Many thanks to all the contributors to Open Babel including those of you who submitted feedback, bug reports, and code.

Cheers,
-Geoff

---
Prof. Geoffrey Hutchison
Department of Chemistry
University of Pittsburgh
email: geoffh@pitt.edu
web: http://hutchison.chem.pitt.edu/

Monday, 17 October 2011

The Blue Obelisk - An update after 5 years

The Blue Obelisk group was established at the Spring ACS Meeting in 2005. Following on from the presentation/poster I gave at PMR's Symposium in January, I put together an overview of the activities of the group over the past 5 years with help from Rajarshi Guha, Egon Willighagen and Peter Murray-Rust, and further contributions on particular projects from many more.

This paper has just appeared in Journal of Cheminformatics as part of the PMR Symposium themed issue:
Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on Noel M O'Boyle, Rajarshi Guha, Egon L Willighagen, Samuel E Adams, Jonathan Alvarsson, Jean-Claude Bradley, Igor V Filippov, Robert M Hanson, Marcus D Hanwell, Geoffrey R Hutchison, Craig A James, Nina Jeliazkova, Andrew SID Lang, Karol M Langner, David C Lonie, Daniel M Lowe, Jerome Pansanel, Dmitry Pavlov, Ola Spjuth, Christoph Steinbeck, Adam L Tenderholt, Kevin J Theisen, Peter Murray-Rust.
Journal of Cheminformatics 2011, 3:37.

Here's the abstract:

Background

The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by promoting interoperability between chemistry software, encouraging cooperation between Open Source developers, and developing community resources and Open Standards.

Results
This contribution looks back on the work carried out by the Blue Obelisk in thjavascript:void(0)e past 5 years and surveys progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry.

Conclusions
We show that the Blue Obelisk has been very successful in bringing together researchers and developers with common interests in ODOSOS, leading to development of many useful resources freely available to the chemistry community.

Check out the other papers in the themed series over at Journal of Cheminformatics.

Thursday, 13 October 2011

Recognise this? Roundtripping chemical images

With the imminent release of Open Babel 2.3.1, I thought I'd come up with some examples of use for a new feature, PNG depiction.

To generate a PNG with Open Babel you just use the PNG output format:

obabel -:CC(=O)Cl -O tmp.png

Open Babel actually allows you to embed the chemical structure (in any format) directly into a new or existing PNG file. If you do this, then you can roundtrip as follows:

> obabel -:CC(=O)Cl -O tmp.png -xO smi
> obabel tmp.png -osmi
CC(=O)Cl

If you haven't embedded a chemical structure in the image, you'll have to use optical chemical recognition software such as the open source OSRA (Igor Filippov) or Imago (GGA Software). Both of these can output a MOL/SDF file, which contains the 2D coordinates of the perceived structure, and this can be depicted. I did this for a set of 450 images from the Japanese Patent Office as follows:

> for %a in (*.tif) do "C:\Program Files (x86)\osra\1.3.8\osra.bat" %a --format sdf | obabel -isdf -O %~na_osra.png -d
> for %a in (*_chem.png) do "C:\Program Files\GGA Software\Imago Toolkit\alter_ego.exe" %a -o tmp.mol -q && obabel tmp.mol -O %~na_imago.png -d

The results are here: Subset 1 2 3.

Notes:
1. Open Babel depiction for large molecules needs to be fixed, as the lines get faint and disappear in some cases. [Update (26/03/2012): Now fixed]
2. The tiff files needed to be converted to pngs for Imago (used a "for" loop with Imagemagick convert).
3. In the case of multiple molecules in the OSRA output, only the first molecule is depicted (I think).
4. Several structure gave error messages when depicting the Imago structures due to unrecognised labels. I think there's a way around this but I didn't look into it.

Friday, 7 October 2011

Open Babel paper published in Journal of Cheminformatics

After almost 10 years as an independent project, the paper is finally here...

Open Babel: An open chemical toolbox N. M. O'Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch and G. R. Hutchison. Journal of Cheminformatics 2011, 3:33.

Here's the abstract:

Background
A frequent problem in computational modeling is the interconversion of chemical structures between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing problem due to the multitude of different application areas for chemistry data, differences in the data stored by different formats (0D versus 3D, for example), and competition between software along with a lack of vendor-neutral formats.

Results
We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics algorithms, from partial charge assignment and aromaticity detection, to bond order perception and canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and outline a variety of uses both in terms of software products and scientific research, including applications far beyond simple format interconversion.

Conclusions
Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and substructure and similarity searching. For developers, it can be used as a programming library to handle chemical data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely available under an open-source license from http://openbabel.org.

Monday, 26 September 2011

chemfp 1.0 - Get your fingerprints off this

Andrew Dalke has just released V1.0 of his chemical fingerprint package, chemfp. The goal of the project is to produce a standardised file format called FPS for chemical fingerprints, as well as a set of tools around them. The project itself is a Python module with some superfast C code.

Why is this needed? Well, right now, if you wanted to use several different toolkits to generate fingerprints and then compare and contrast their use for some application (e.g. 2D similarity searching), you would have to go to quite some lengths to figure out how to handle the file format from each toolkit, and so forth. You could of course avoid the file format completely and use a standard API such as Cinfony, but it's often useful to precalculate the fingerprint and store as a file (also, you may not have access to the toolkit, only a file of fingerprints).

So, if you have the appropriate toolkit, you can use chemfp to generate fingerprints in the FPS format. For example, if you have the Python bindings for Open Babel, you can use generate the FP2, FP3, FP4 and MACCS fingerprints in FPS format. Other toolkits are supported, namely OEChem and RDKit, with their own fingerprints. Of course, it would make sense for this format to be supported by the toolkits themselves, and indeed Cactvs already supports the FPS format, as does Rajarshi's fingerprint R package. Direct FPS support by Open Babel is also on the cards.

What about the tools around this format? Right now, there's only a similarity search tool, but that already is very useful as (for example) it supports "many against many" searches, a feature which I have heard requested by several Open Babel users. More tools are on the way though. It's also possible to write your own tools using the Python API. For example, Andrew has written up examples on generating a distance matrix and drawing a cluster dendrogram (in about 20 lines of code, albeit with the help of a few libraries), and on Taylor-Butina clustering (this one maybe 40 lines).

So check it out at http://chem-fingerprints.googlecode.com. In particular, the documentation is excellent.

Image credit: Jack of Spades

Tuesday, 13 September 2011

Comments on 5th Meeting on U.S. Government Chemical Databases and Open Chemistry

As previously discussed, I attended and presented at the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry in Frederick, Maryland. Talks were 20 min plus 5 for questions and so quite a lot of ground was covered over the two days.

I understand that Marc is going to put up many of the talks on the web somewhere (insert link here from future), but in the meanwhile you can check out several of the talks over at Lanyrd. The image above is from John Overington's talk (included with permission), and is meant as a joke as in reality the various databases really do cooperate quite well. In fact there was some talk (-ing) of just having a single database maintained by several partners (like the PDB does it).

Now some random notes on the meeting...

It's interesting to note that many of the ideas championed by PMR have now mainstreamed. As an example, the database people were very clear that chemical data should be available directly from publications (in this day and age) rather than have to manually (or automatically) read the literature and enter it in as ChEMBL (and many others) are doing. Attendees on various editorial boards were planning to try to make this happen, although it was noted that submissions to non-prestige journals decrease with every additional hurdle for authors. The counter argument was made that providing these data would make the publication more discoverable, something academia is very keen on.

Jonathan Goodman took a break from finding InChIKey collisions to describe some work on the development of RInChI, a variant on InChI for reaction indexing.

Jasmine Young from the PDB pointed out that reviewers have a role in ensuring data quality too; the PDB is an archival site, not a policing site, and journal reviewers should request the validation report from the PDB when reviewing papers with protein crystal structures.

I learnt, at least a bit, about computational toxicology programmes. Ann Richard from the National Center for Computational Toxicology (part of the EPA) described Tox21 and ToxCast, two large screening programmes. Tox21 has 10,000 chemicals with 50->100 assays, while ToxCast has 960 chemical and 500 assays (E&OE). The overall idea is to generate high quality tox data that can be used to develop tox models and essentially push the whole field forward. QC review of the chemicals bought from suppliers was an important and essential part of the process; 20->30% of purchased chemicals had "issues" (purity but also identity).

Nina Jeliazkova described the EU OpenTox programme (see slides at link above). Apparently, this programme is just complete but it has successfully developed a set of resources, principally a REST interface to a series of webservices for QSAR. Essentially, these webservices can be used to build a model and make predictions. Applications can be built using these webservices, e.g. ToxPredict and ToxCreate.

On the way back from the meeting, I decided to stop off in Keflavik airport in Iceland due to smoke in the cockpit. Any recommendations for must-have Icelandish merch for next time? I held back from buying a jar of volcanic ash from Eyjafjallajökull, and went for the liquorice instead.

Monday, 29 August 2011

Presentation on Open Babel and Databases at 5th Meeting on U.S. Government Chemical Databases and Open Chemistry

I was fortunate to be able to attend and present at last week's 5th Meeting on U.S. Government Chemical Databases and Open Chemistry in Frederick, Maryland. This was really an excellent meeting, and I have to thank Marc Nicklaus for the invitation and support to attend, and especial thanks to Julia Lam for the superb organisation.

I'll blog separately about the meeting itself, but first here's my talk, "Improving the quality of chemical databases with community-developed tools (and vice versa)", which discussed improvements in Open Babel over the last 2 years in the handling of chemical data, and using Open Babel to find errors in databases:

View more presentations from baoilleach

In case it's not obvious, the work presented in the first half involved all of the Open Babel developers, each working on different aspects of the toolkit.

Wednesday, 3 August 2011

The electron density is not an isosurface

There's no need for us to keep drawing isosurfaces for the electron density and other volumetric data generated by QM calculations. Driven by the engineering field, there are a couple of really amazing open source visualisation toolkits out there that we can use to do so much more. One of these is the VTK.

David Lonie, as part of Google Summer of Code with Marcus Hanwell of VTK/Avogadro as mentor, is integrating aspects of the VTK into Avogadro (see this post for example). But I can't wait, so I've busted out the VTK myself and now present the ELF of water:

To generate this type of image, install Mayavi2 (not exactly trivial to be honest), and then run the following Python script to generate PNG files (this script is adapted from a Mayavi example, which you may want to run first):

"""
In this example, we display the H2O molecule, and use volume rendering to
display the electron localization function.

The atoms and the bounds are displayed using mlab.points3d and
mlab.plot3d, with scalar information to control the color.

The electron localization function is displayed using volume rendering.
Good use of the `vmin` and `vmax` argument to
`mlab.pipeline.volume` is critical to achieve a good visualization: the
`vmin` threshold should placed high-enough for features to stand out.

The original is an electron localization function from Axel Kohlmeyer.
"""
# Author: Gael Varoquaux <gael.varoquaux@normalesup.org>
# Copyright (c) 2008, Enthought, Inc.
# License: BSD Style.
# Edited by: Noel O'Boyle <baoileach@gmail.com>

# Retrieve the electron localization data for H2O ##############################
import os
if not os.path.exists('h2o-elf.cube'):
    # Download the data
    import urllib
    print 'Downloading data, please wait'
    opener = urllib.urlopen(
        'http://code.enthought.com/projects/mayavi/data/h2o-elf.cube'
        )
    open('h2o-elf.cube', 'wb').write(opener.read())


# Plot the atoms and the bonds #################################################
import numpy as np
from mayavi import mlab
mlab.options.offscreen = True
mlab.figure(1, bgcolor=(0, 0, 0), size=(350, 350))
mlab.clf()

# The position of the atoms
atoms_x = np.array([2.9, 2.9, 3.8])*40/5.5
atoms_y = np.array([3.0, 3.0, 3.0])*40/5.5
atoms_z = np.array([3.8, 2.9, 2.7])*40/5.5

O = mlab.points3d(atoms_x[1:-1], atoms_y[1:-1], atoms_z[1:-1],
                  scale_factor=3,
                  resolution=20,
                  color=(1, 0, 0),
                  scale_mode='none')

H1 = mlab.points3d(atoms_x[:1], atoms_y[:1], atoms_z[:1],
                   scale_factor=2,
                   resolution=20,
                   color=(1, 1, 1),
                   scale_mode='none')

H2 = mlab.points3d(atoms_x[-1:], atoms_y[-1:], atoms_z[-1:],
                   scale_factor=2,
                   resolution=20,
                   color=(1, 1, 1),
                   scale_mode='none')

# The bounds between the atoms, we use the scalar information to give
# color
mlab.plot3d(atoms_x, atoms_y, atoms_z, [1, 2, 1],
            tube_radius=0.4, colormap='Reds')

# Display the electron localization function ###################################

## Load the data, we need to remove the first 8 lines and the '\n'
str = ' '.join(file('h2o-elf.cube').readlines()[9:])
data = np.fromstring(str, sep=' ')
data.shape = (40, 40, 40)

source = mlab.pipeline.scalar_field(data)
min = data.min()
max = data.max()
vol = mlab.pipeline.volume(source, vmin=min+0.65*(max-min),
                                   vmax=min+0.9*(max-min))

mlab.view(132, 54, 45, [21, 20, 21.5])

##mlab.show()

f = mlab.gcf()
for i in range(36):
    f.scene.camera.azimuth(10)
    f.scene.render()
    mlab.savefig('example_%02d.png' % i)

To create an AVI from these PNGs, use mencoder:

C:\Tools\mayavi>..\MPlayer-p4-svn-33883\mencoder.exe "mf://example_%02d.png" -mf fps=10 -o anim.avi -ovc lavc -lavcopts vcodec=msmpeg4v2:vbitrate=500

To create an animated GIF use ImageMagick:

C:\Tools\mayavi>convert -delay 10 -loop 0 example_*.png animation.gif

Sunday, 24 July 2011

Depict a chemical structure...without graphics Part II

In an earlier post, I showed how PNGs depicting 2D molecular structures could be viewed in a text terminal. This can be useful when you're logged into a remote server and don't have access to a graphical viewer.

Just for fun, I've thrown together an ASCII depiction format for Open Babel that uses aalib. But this time, rather than create a PNG depiction and then convert that to ASCII, this "Painter" (in the language of the Open Babel API) draws the structure directly onto an ASCII canvas. This should lead to better resolution, although I haven't done a proper head-to-head comparison. Here's how it looks in action (compare to the images in my earlier post):

Note that I haven't checked this code into SVN, and I'm not sure I will. For my own reference, I've stored some of the relevant files at this link (diff against r4530). Includes code taken from aa-helper.

Thursday, 21 July 2011

Presentation on Solar Cells at Goslar (Nov 2010)

Now that the paper has come out, I think it's safe to post the corresponding talk I gave at the 6th German Conference on Chemoinformatics at Goslar last November:

De novo design of molecular wires with optimal properties for solar energy conversion

View more presentations from baoilleach

Tuesday, 19 July 2011

Displaying pharmacophores from Pharmer in Avogadro

Avogadro has a plugin architecture that allows you to drop a Python script into a particular folder to add additional functionality, whether it be a new menu item, a new display method or whatever. I've previously used this to write an extension to read comp chem log files using cclib. This time I need to add a new display method (or Engine in the lingo of Avogadro) that will depict a pharmacophore.

So what's Pharmer? Well, until the release of Pharao by Silicos last year there were no open source pharmacophore programs. And the recent release of Pharmer by David Koes bumps the number up to two. Pharmer focuses on pharmacophore search - ultra-fast pharmacophore search, that is. It does this by precomputing an index that allows very efficient searching. The common denominator of Pharmer and Pharao is not only "Phar", but also that both use the Open Babel library.

The file format used by Pharmer is simply JSON, which is easily edited by a text editor. Here's an example that was created using the ZINC Pharmer interface: HSP90.txt. If you install Avogadro 1.0.1 with Python support on Windows, also have Python 2.6 installed in C:\Python26, and copy the Python script shown below to C:\Program Files (x86)\Avogadro\bin\engineScripts, then you can add the Pharmacophore engine (click "Add" in the "Display Types" panel; note: you can add more than one Pharmacophore engine if you want to display more than one pharmacophore at the same time). Once added, click on Settings and Browse to open the Pharmer JSON file. If nothing appears on the screen, try rotating the display slightly.

Note: The image above was created by joining the output of the following Pharmer command using "obabel --join" and depicting with Avogadro:

pharmer dbsearch -dbdir db_712_actives -in HSP90.txt -out results.sdf -reduceconfs 1 -sort-rmsd

pharmacophore.py:

import os
from PyQt4.Qt import *
import Avogadro as avo
from numpy import array

# Get JSON from my own Python installation
import sys
sys.path.append("C:\\Python26\\Lib\\site-packages")
sys.path.append("C:\\Python26\\lib")
import json

colorlookup = {
  'Hydrophobic': [0, 139, 69],
  'HydrogenAcceptor': [255, 165, 0],
  'HydrogenDonor': [211, 211, 211],
  'Aromatic': [205, 0, 205],
  'PositiveIon': [0, 0, 238],
  'NegativeIon': [255, 0, 0],
  }

class Engine(QObject):
  # declare the changed() signal
  __pyqtSignals__ = ("changed()",)

  # constructor
  def __init__(self):
    QObject.__init__(self)
    self.widget = None
    self.textbox = None
    self.filename = ""
    self.pharma = []

  def name(self):
    return "Pharmacophore"

  def flags(self):
    return avo.EngineFlags.NoFlags

  def debug(self, text):
    return
    filename = os.path.join("C:\\", "Users", "Noel", "Desktop", "tmp.txt")
    output = open(filename, "a")
    print >> output, text
    output.close()

  @pyqtSignature("")
  def browse(self):
    self.filename = str(QFileDialog.getOpenFileName())
    if not self.filename:
      return
    
    self.textbox.setText(self.filename)
    with open(self.filename) as f:
      query = json.load(f)
      self.pharma = [x for x in query['points'] if x['enabled']]
    if len(self.pharma) > 0:
      for p in self.pharma:
        x += p['x']
        y += p['y']
        z += p['z']
      x /= len(self.pharma)
      y /= len(self.pharma)
      z /= len(self.pharma)
      for p in self.pharma:
        p['x'] -= x
        p['y'] -= y
        p['z'] -= z
    self.debug("Pharma: %s" % self.pharma)
    self.emit(SIGNAL("changed()"))

  @pyqtSignature("")
  def clear(self):
    self.filename = ""
    self.textbox.setText(self.filename)
    self.pharma = []
    self.emit(SIGNAL("changed()"))
    
  def settingsWidget(self):
    self.widget = QWidget()

    layout = QVBoxLayout(self.widget)

    self.textbox = QLineEdit(self.filename)
    self.textbox.setReadOnly(True)
    layout.addWidget(self.textbox)
    
    buttons = QHBoxLayout(self.widget)
    layout.addLayout(buttons)

    self.browsebtn = QPushButton("Browse")
    self.clearbtn = QPushButton("Clear")
    buttons.addWidget(self.browsebtn)
    buttons.addWidget(self.clearbtn)
    
    QObject.connect(self.browsebtn, SIGNAL("clicked()"), self, SLOT("browse()"))
    QObject.connect(self.clearbtn, SIGNAL("clicked()"), self, SLOT("clear()"))

    return self.widget

  def renderOpaque(self, pd):
    self.debug("Len of pharma: %d" % len(self.pharma))
    if len(self.pharma) == 0:
      return
    
    # Painter 
    painter = pd.painter
    # Molecule
    molecule = pd.molecule

    # Color
    color = pd.colorMap

    for point in self.pharma:
      if point['name'] in colorlookup:
        colvals = [float(x) / 255. for x in colorlookup[point['name']]] + [1.0]
        color.setFromRgba(*colvals)
      else:
        color.setFromRgba(0.3, 0.6, 1.0, 1.0)
      painter.setColor(color)        
      
      r = point['radius']
      self.debug("Radius: %.1f" % r)
      x, y, z = point['x'], point['y'], point['z']
      painter.drawSphere(array([x, y, z]), r)

Tuesday, 12 July 2011

The alpha and omega of SMILES strings Part II

In the previous post I described why it's useful to be able to generate SMILES strings starting and ending with particular atoms. This post describes how to do it.

To begin with, one way to do this is using ring closure notation, as Rajarshi and Andrew pointed out in replies to a question of mine at the Blue Obelisk Q&A early this year. Andrew went on to write a Python script that allowed complete reordering of all of the atoms of a molecule using this method.

I was hoping to find a more elegant method than using ring closures. Also, wholesale use of ring closures can cause problems with stereochemistry as this is a corner case that not all toolkits handle correctly. In any case, I kept thinking about this on and off, and eventually got around to trying some ideas out. In the end, the solution was easier than I'd thought.

A SMILES string is generated from a depth-first tree traversal of a graph. Every atom (except for the root) has a parent atom, and 0 or more child atoms. Setting the start atom is trivial; just make that the root. It turns out that setting the end atom requires only two rules: (1) parenthesise all of the child trees of the end atom, and all but the last child trees of other atoms (this latter should be the default in any case), (2) visit child trees that do not have a route (through 'unvisited' atoms) to the end atom first.

That's it...except for the corner cases. Click Comp Chem involves replacing an implicit H of the start and end atoms; if there is an explicit bracketed H present, then it needs to be removed to free up the valence. For example, if the endatom is [nH] or [C@@H] then the H needs to go. Otherwise you have weirdoddities like the following 5-coordinate C:

C[C@@H](Br)(Cl) + I gives C[C@@H](Br)(Cl)I

Note however that this may result in a SMILES string that does not accurately represent the original structure (but that's not the point of the exercise) e.g.

c1c[nH]cc1 -> c1c[n](cc1)

What about the stereochemistry? Well, I toss that out too at the moment; an alternative would be to allow the user to specify the resulting stereochemistry at the start and end atoms.

Image credit: crashoverreason

Saturday, 9 July 2011

The alpha and omega of SMILES strings

I've just added a feature to the SMILES output in Open Babel that allows the user to specify a particular start and end atom for a SMILES string. No, no, come back! - don't go away! - it's really useful...

If you're still here, let me explain why. It allows you to take advantage of a really nice feature of SMILES strings: concatenating two SMILES strings allows you to easily generate new molecules. I call this (as of two minutes ago) Click Comp Chem or CCC (which of course is a SMILES string itself!).

With Click Comp Chem, the first atom of the second SMILES string will become joined to the "last atom" of the first SMILES string. When I say "last atom", I mean the last atom not within parentheses (i.e. not on a branch), e.g. if I add an atom to "CC(I)(Cl)" by appending "Br" to the end, then the Br will be attached to the second C atom (not the Cl).

CCC is particularly useful if generating polymers from a set of monomers (which is what we were doing in our solar cell paper), or generating a virtual library of combinatorial chemistry products (such as Jean-Claude Bradley did for his CombiUgi project, or as Duffy et al did to a create virtual library of peptides).

But a requirement for CCC is the ability to write a SMILES string such that it starts with a particular atom and ends with another. I'll explain the details of how I implemented this in a separate post. For now, here it is in action for a SMILES string for aspirin, "c1c(C(=O)O)c(OC(=O)C)ccc1". Let's suppose that we want to have the methyl of the ester (atom 10) at the end of the SMILES, and to start the SMILES with the ring carbon para to the ester (atom 13). The output options are "f" and "l" for "first" and "last":

>obabel -:"c1c(C(=O)O)c(OC(=O)C)ccc1" -osmi
c1c(C(=O)O)c(OC(=O)C)ccc1
1 molecule converted

>obabel -:"c1c(C(=O)O)c(OC(=O)C)ccc1" -osmi -xf 13
c1cc(C(=O)O)c(OC(=O)C)cc1
1 molecule converted

>obabel -:"c1c(C(=O)O)c(OC(=O)C)ccc1" -osmi -xl 10
c1c(C(=O)O)c(ccc1)OC(=O)C
1 molecule converted

>obabel -:"c1c(C(=O)O)c(OC(=O)C)ccc1" -osmi -xf 13 -xl 10
c1cc(C(=O)O)c(cc1)OC(=O)C
1 molecule converted

Image credit: Alan Cleaver

Thursday, 7 July 2011

We can design molecular wires for you wholesale

As part of a project led by Dr. Geoffrey Hutchison (University of Pittsburgh), I have been collaborating on the development of a method to design so-called "molecular wires" (conducting organic polymers) for use in organic solar cells. Using a combination of computational chemistry and cheminformatics, we have been able to find organic polymers with the required electronic structure predicted to perform well as the electron donor in solar cells.

This work is now available as an ASAP at J. Phys. Chem. C:

N.M. O'Boyle, C.M. Campbell, G.R. Hutchison.
Computational Design and Selection of Optimal Organic Photovoltaic Materials
J. Phys. Chem. C 2011. In press.

Rather than repeat the abstract here, I'll summarise the general idea. Current solar cells are based on semiconductor technology and are expensive to make both in terms of materials and energy. Organic solar cells, while never going to be as efficient at converting light to electricity, offer the possibility of cheap solar energy due to the ease of manufacture and low cost of materials. In 2006, Scharber et al described how to calculate the efficiency of an organic solar cell based on the electronic structure (i.e. HOMO, LUMO) of the solar cell components, so the question was: could we find organic polymers with the required structure to maximise efficiency?

To cut a long story short, we could. What we did was combine Open Babel/Pybel (for structure generation and forcefield optimisation), Gaussian09 (for ZINDO/S//PM6 calculations), cclib (for extracting the results) and a Python script implementing a genetic algorithm, and out popped molecular wires predicted to be highly efficient. The top candidates were then filtered according to additional criteria et voilà. Overall, over 90,000 polymers were tested - still only a small fraction (about 4%) of what we'd have had to test without the genetic algorithm.

If you're interested in a copy of the paper and don't have access to the journal, get in touch - we can distribute a number of copies through the ACS website.

Thursday, 16 June 2011

Using Zotero for Chemistry

Zotero keeps improving, and I was thinking it was time I started using it for my own papers. But how well do the translators work for Chemistry journals?

I tested the ability of Zotero to extract the correct metadata, the abstract, the journal abbreviation, the PDF, and the full-text HTML from the abstract page of a paper from a current issue of a journal from various publishers in Chemistry.

The following results are ranked by how well the translator works, starting with the best:

ACS Journals - Missing journal abbreviation.
BMC Journals - Misses author initials, doesn't recognise J Cheminf, missing journal abbreviation.
Elsevier (J Mol Struct THEOCHEM) - Metadata has slightly wrong DOI, markup included in abstract, missing journal abbreviation.
Wiley (J Comput Chem) - Only metadata (no PDF, or full-text HTML)
Springer (JCAMD) - Only metadata (no PDF, or full-text HTML)
Oxford (Nucleic Acids Res) - Only metadata (no PDF, or full-text HTML)
RSC (Chem Comm) - Only metadata, and that missing the page numbers (RSC must not be providing it to CrossRef).

I'm going to have a go at improving these by-and-by (keep an eye on my bitbucket account), but feel free to sort them out yourself if you want (leave a comment below if you do).

I was initially hesitant doing this as there was no test framework in place for translators, and there didn't seem to be much point in writing a translator that might break at any point without anyone knowing. But Avram Lyon is currently adding support for a test framework to Scaffold (the tool you use for writing Zotero translators), and so this should soon be available.

That's not normal for a correlation - Pearson vs Spearman

Let me present some random data, 100 data points with x and y values chosen from a uniform distribution between 0.8 and 0.9:

Correlation coefficients: -0.074 (Pearson), -0.073 (Spearman)

Now let's add a single data point at 0.1, 0.1:

Correlation coefficients: 0.866 (Pearson), -0.042 (Spearman)

Here's another interesting situation, with two random datasets of 50 datapoints chosen from the intervals (0.0, 0.5) and (0.5, 1.0):

Correlation coefficients: 0.668 (Pearson), 0.711 (Spearman)

What happens when we contract the areas though?

Correlation coefficients: 0.995 (Pearson), 0.719 (Spearman)

Tuesday, 7 June 2011

A bunch of stuff: Chemfps, cinfcasts, spreadchems, and MIOSS talks

Two of Andrew Dalke's projects have finally come out of stealth mode: chem-fingerprints (announcement and project page) and the world's first cheminformatics podcast, Molecular Coding.

Rich Apodaca has also been busy adding chemistry to spreadsheets, this time to Google Spreadsheets. Check out his video which sums up the whole thing.

And all of the talks from MIOSS are now available.

Thursday, 2 June 2011

Molecular zooming with Open Babel SVG

How cool is SVG? Way, I say. Combine it with some sweet sweet JavaScript (indeed, is there any other kind?) and you can visualise large numbers of molecules with low memory footprint (vectors don't use much memory) even in the tiny 600px margin of this blog post: (scroll mouse wheel to zoom, drag to move around)

The above shows a depiction of the first 100 molecules in PubChem, as drawn by Open Babel:

obabel PubChem3D.sdf -l100 -d -xC -O mols_100.svg

Want to see 1000 molecules?

Notes:
1. If you don't see anything above, you may want to upgrade your browser.
2. Open Babel is not the only Open Source chemistry software with SVG support: there's also the CDK, Indigo, OASA/BKChem and probably others.
3. In the Open Babel GUI, if you tick the box "Display in firefox" it shows these depictions when you convert.
4. One nice thing about SVG is that it can be styled with CSS, so that you could have a button on website that allows you to instantly change the colours of the bonds or labels, or the line-width. I don't think that the Open Babel SVG is set up for these shenanigans at the moment, but it could easily be done.

Tuesday, 24 May 2011

(Almost) Translate the InChI code into JavaScript Part III

Following on from Parts I and II...

Ok, so we've converted the InChI library to JavaScript. There are two ways to go from here, either call it directly from JavaScript or write a C function to call it and convert that to JavaScript (and then call that). It might seem that the second plan is more work, but it actually makes things easier as calling these JavaScriptified C functions is a bit tricky especially if we need to pass anything beyond basic C types.

The following code uses the InChI API to read in an InChI, and list the atoms and bonds in a molecule:

#include <stdio.h>
#include <string.h>
#include "inchi_api.h"

int InChI_to_Struct(char* inchi)
{
    inchi_InputINCHI inp;
    inchi_OutputStruct out;
    int i, j, retval;

    inp.szInChI = inchi;
    memset(&out, 0, sizeof(out));

    retval = GetStructFromINCHI(&inp, &out);
    printf("number of atoms: %d\n" , out.num_atoms);
    
    for(i=0;i<out.num_atoms;++i)
    {
      inchi_Atom* piat = &out.atom[i];
      printf("Atom %d: %s\n", i, piat->elname);
      for(j=0;j<piat->num_bonds;++j)
      {
        printf("Bond from %d to %d of type %d\n", i, piat->neighbor[j], piat->bond_type[j]);
      }
    }

    FreeStructFromINCHI( &out );
    return retval;
}   

int test_InChI_to_Struct()
{
    int retval;
    char inchi [] = "InChI=1S/CHClO/c2-1-3/h1H";

    retval = myInChI_to_Struct(inchi);
    return retval;
}

I saved the above code along with the InChI library's own C files in inchi_dll/mycode.c, and added it in the two appropriate places in the Makefile so that the compilation as described in Part II created two extra functions in inchi.js.

To test at the command line, you need to edit the run() method to call InChI_to_Struct, and then call the run() method itself. When you do this, you will find that _strtod is not implemented (so you need to add an implementation - I just pass the call to _strtol) and that there is a call to some clock-related functions (I make this just return 0 - to sort this out properly you would need to look at the original C code and figure out what this function is used for in this context). So, here it is in action if I call run("InChI=1/S/CHClO/c2-1-3/h1H"):

user@ubuntu:~/Tools/inchidemo$ ~/Tools/v8-repo/d8 inchi.js 
number of atoms: 3
Atom 0: C
Bond from 0 to 1 of type 1
Bond from 0 to 2 of type 2
Atom 1: Cl
Bond from 1 to 0 of type 1
Atom 2: O
Bond from 2 to 0 of type 2

Once tested, you can make a webpage that incorporates it. Using Chrome, check out the InChI JavaScript demo.

So...does it work? Well, almost. For some simple InChIs it works perfectly. For others, it returns an error. There are a couple of ways of tracking down the problem but, you know, I have to draw the line somewhere so I'll leave that as an exercise for the reader. Also, the page needs to be refreshed after each InChI, so there's something wrong there with the way I've set it up. The file size is currently too big, but that can be reduced by leaving out unnecessary functions (for example) as well as by using the techniques discussed in the previous post. Perhaps the biggest problem is that the code maxes out the stack space on Firefox/Spidermonkey - this can probably only be addressed by discussion with the emscripten author and changes to the InChI source code.

So that's where I'll leave it for now. I'm very impressed with how well this works - the whole idea is really quite amazing and I didn't expect to get this far, especially with such a complex piece of code. I'll leave the interested reader with a few questions: can you track down all the problems and sort them out?, what other C/C++ libraries could usefully be converted to JavaScript?, and what other languages can be generated from LLVM bytecode?

Supporting info: Various versions of the InChI JavaScript code: vanilla, for running at command-line, ready for webpage, and finally minified.

Acknowledgement: Thanks to kripken, the main emscripten author, for the rapid fix to my reported bug.

Tuesday, 17 May 2011

Excel with the Chemistry Development Kit

One of the projects that really astounded me at MIOSS was presented by Kevin Lawson of Syngenta. He has managed to integrate chemistry into Excel, and done so using the freely available and open source toolkit, the Chemistry Development Kit (CDK) (and only using three different programming languages!). The project is called the LICCS System (or Excel-CDK), and the website is at googlecode.

Big deal? Yes - big deal. We may say that R can do everything better than Excel, but the ubuiqity of Excel and the familiarity of everyone with the spreadsheet metaphor, means that targeting Excel brings basic cheminformatic analysis into the hands of non-cinfs like our colleagues, our bosses and our students.

So, what's this all about? Well, once a spreadsheet has been "chemistry-enabled" you can...

click on a SMILES strings to see the structure
click on a point in a graph and see the structure
filter the data by substructure searching (for large datasets you can speed this up by calculating fingerprints first)
cluster the data
create R group tables
calculate molecular properties

If you are at all interested in this, go to the website, and check out the flash video available as a download. This takes you through some of the capabilities of the system.

One interesting aspect is that the software has been cleverly designed for use in a corporate environment. First of all, no installation is required (i.e. admin access is not needed) and secondly, chemistry-enabled spreadsheets can be shared with users who haven't installed the chemistry add-in, so long as they have access to a shared network drive with the required files.

Just a note, to get it all working the first time, you may have to override some security settings in Excel. In Excel 2007, I had to go into the "Trust Center", and in "Macro Settings", enable "Trust access to the VBA project object model". [Update (18/05/11): Kevin says this is only necessary if creating the spreadsheet, not if using one that has already been created.]

Remember - it's an open source project, so get in there and give a hand if you have any ideas for additional features.

Friday, 13 May 2011

(Almost) Translate the InChI code into JavaScript Part II

So, following on from Part I...

Let's download the InChI code and try to convert it to JavaScript. To put some sort of figure on the size of the codebase, the C code in INCHI_API/inchi_dll comes to 106K lines (including everything via "wc -l") or 4.8M.

The usual procedure to compile the InChI code is to type "make" in INCHI_API/gcc_so_makefile. Instead, comment out line 2 of the Makefile and then do the following to run make:

export EMMAKEN_COMPILER=/home/user/Tools/llvm-2.9/cbuild/Release/bin/clang
LINKER=/home/user/Tools/emscripten-git/tools/emmaken.py SHARED_LINK=/home/user/Tools/emscripten-git/tools/emmaken.py C_COMPILER=/home/user/Tools/emscripten-git/tools/emmaken.py make
cd result
export PATH=~/Tools/llvm-2.9/cbuild/Release/bin:$PATH
llvm-dis -show-annotations libinchi.so.1.03.00

This creates libinchi.so.1.03.00.ll, composed of LLVM disassembled bytecode, which we now convert to JavaScript in the same way as with "Hello World" previously:

# Run emscripten
$EMSCRIPTEN/emscripten.py libinchi.so.1.03.00.ll $V8/d8 > inchi.js

# Run the Javascript using v8
$V8/d8 inchi.js

Running it, of course, does nothing - it's just a library. Well, I say "just", but it's about 400K lines of code weighing in at 15M. With minification (YUI Compressor) we can get that down to ~7.5M, which zips down to 2MB. The emscripten author recommends passing it through Google Closure (which optimises and minifies the code) but it crashes out with some complaint about hitting a recursion limit. I don't know if it's a problem with the JavaScript code, a bug in Closure or just a feature of InChI generation. It also causes Spidermonkey (and hence Firefox) to complain about maxing out on stack space. Again, I don't know whether there's a way around this.

The next step is to write some code that does something useful with the library. That's all covered in Part III of course.

Tuesday, 10 May 2011

Cinfony presentation at MIOSS

I presented Cinfony 1.1 at the recent Wellcome Trust Workshop on Molecular Informatics Open Source Software (MIOSS) at the EBI near Cambridge, UK. Cinfony is a Python library that makes it easy to access several cheminformatics resources through a common and simple API.

This new version of Cinfony is currently in beta while I wait for the release of Open Babel 2.3.1, but is available for download from the Cinfony website (install instructions). The main new features are support for the Indigo toolkit (the general cheminformatics toolkit from GGA Software) and OPSIN (IUPAC name -> structure convertor from Daniel Lowe in PMR's group).

The following code shows an example of using OPSIN to read an IUPAC name (this is taken from the TOC graphic for the OPSIN paper) and then using Indigo to calculate the molecular weight:

>>> from cinfony import indy, opsin
>>> opsinmol = opsin.readstring("iupac",
        "(1R,2R,3R,4S)-11-diazo-2,3,4,9-tetrahydroxy-2-"
        "methyl-5,10-dioxo-2,3,4,5,10,11-hexahydro-1H-"
        "benzo[b]fluoren-1-yl acetate")
>>> print indy.Molecule(opsinmol).molwt
412.349639893

Here's the talk I gave:

View more presentations from baoilleach

Monday, 9 May 2011

(Almost) Translate the InChI code into JavaScript

If you follow Rich Apodaca's Signals blog (for example), you will be aware that more and more chemistry applications are being implemented in JavaScript. Wouldn't it be nice to be able to take an existing Java or C++ cheminformatics library and convert it to JavaScript?

Well, guess what - in Oct of last year, a project appeared called emscripten that will do just that for C/C++. So without further ado, let's convert the InChI code.

Actually, maybe it'd make more sense to begin with "Hello World":

#include <stdio.h>

int main()
{
   printf("Hello World!\n");
   return 0;
}

To start with, compile llvm, clang, spidermonkey and v8 as described in the install instructions.

Then convert to javascript as follows:

#!/bin/sh
LLVM_BINDIR=~/Tools/llvm-2.9/cbuild/Release/bin
EMSCRIPTEN=~/Tools/emscripten-git
V8=~/Tools/v8-repo

$LLVM_BINDIR/clang hello.c -o hello
$LLVM_BINDIR/clang hello.c -S -emit-llvm

$LLVM_BINDIR/llvm-as hello.s
$LLVM_BINDIR/llvm-dis hello.s.bc -show-annotations

# Run emscripten
$EMSCRIPTEN/emscripten.py hello.s.ll $V8/d8 > hello.js

# Run the Javascript using v8
$V8/d8 hello.js

After some trivial edits to the code, we can run hello.js in the browser.

Part II shows my attempt to repeat this procedure with the InChI code.

Sunday, 8 May 2011

MIOSS - Open Source in Chemistry workshop

I'm just back from the EBI where I participated in a Wellcome Trust workshop "Molecular Informatics Open Source Software", organised by Mark Forster of Syngenta. I gave a talk on Cinfony (slides to follow), while fellow Open Babel developers Tim Vandermeersch and Chris Morley spoke about Open Babel.

What a great meeting.

It was 50% Open Source Software developers, and 50% pharma industry. Or so I thought at the start. It quickly became apparent that the scientists from pharma were also developing or supporting open source software through a variety of methods. Rajarshi's excellent overview gives a list of some of the talks from industry in this space. I will be following up some of these over the next while.

Of the projects from academia, some which were new to me were Bio-Linux (bio-focused Linux distro available on request as bootable USB sticks from NERC [one of the UK research agencies] - especially useful for teaching), ChemT (GUI for chemical library generation), MOLA (use USB sticks to turn a typical computer lab into a cluster for running AutoDock jobs), and OpenStructure (somewhat similar to PyMol).

The workflow softwares Taverna and KNIME are both doing well. Taverna is widely adopted in the bioinformatics community, while KNIME has considerable mindshare in pharma. An interesting aspect of Taverna is that workflows can be stored at http://myexperiment.org, and users can combine other workflows with full attribution. KNIME now has nodes for RDKit and Indigo (as well as most commercial vendors), toolkits which should be familiar to regular readers here.

Silicos now has several open source offerings for library design and general computer-aided drug design: Sieve (filtering on properties), Pharao (pharmacophore alignment), Piramid (shape-based alignment), Stripper (extract molecular scaffolds according to a variety of schemes), Spectrophores (3D fingerprint for QSAR). An interesting development is the ProfiLib web application which returns a PDF that characterises an uploaded molecular library (in SDF form).

There may be some outcomes from this meeting over the next while, so stay tuned...

Sunday, 1 May 2011

Questions for On-line Cheminformatics Tutorial

Some time ago I set up an on-line cheminformatics tutorial based on the "Try Python" software of Michael Foord (see my blog post for details).

When a reader mentioned in passing that he found this tutorial useful for introducing people to cheminformatics concepts, it made me realise that it might actually be useful for something. And so, when I had to put together a practical on the subject of Cheminformatics a few weeks ago, I decided to use this tutorial as the basis. I had the students go through the tutorial and answer a series of questions based around each of the three main chapters (Introduction to Cinfony, Descriptors, and Fingerprints).

Does anyone have any suggestions for other cheminformatics practicals?

Notes: This is Windows and Mac only. "Try Python" should (but does not) run on Linux under Moonlight. The problem was reported to the Moonlight devs back in 2009.