Noel O'Blog: September 2012

Monday, 24 September 2012

Crowd-sourced malaria drug discovery

Here's a quick shout out for the FightMalaria@Home project run by Anthony Chubb in UCD, Dublin, which is currently looking for more people to lend compute time to the project.

The goal of the project is to identify the drug targets of the Novartis/GSK malaria datasets (about 19K molecules) through large-scale distributed protein-ligand docking experiments. The details are on the project website, but involve identification of hits using Autodock Vina running on BOINC. Experimental follow-up of the top hits will be carried out.

Consider donating some compute time to this project. The dockings are run when your computer is idle so it shouldn't affect your normal use. The results of the study will be integrated into ChEMBL and be available to everyone.

To get involved, head over to www.fight-malaria.org. Details on the BOINC server can be found here.

Wednesday, 19 September 2012

Using the InChI to canonicalise SMILES

I believe that the Open chemistry community will wish to move towards InChI as the definitive approach for all canonicalisation in their codes. We have found that "unique SMILES" is not precisely defined and there is no accepted reference implementation that is freely available. For example a given molecule (e.g. caffeine) has at least 9 representations on the public Web.

- Peter Murray-Rust, Feb 2005, Open Babel mailing list

Different software generates different canonical SMILES. The reason for this is simple; no-one has described a canonicalisation scheme for SMILES that includes stereochemistry. Even if we wanted to generate the same SMILES, we cannot do so. Back in 2005, PMR pointed out that the InChI could be used for this purpose. As ever, PMR was way ahead of the times, and to my knowledge no one took up this idea until...

A paper of mine has just been published in J. Cheminf.:

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI

NM O'Boyle, Journal of Cheminformatics 2012, 4:22. doi:10.1186/1758-2946-4-22

I describe two approaches to generate a canonical SMILES, one based on roundtripping through the InChI (and so it incorporates the InChI normalisation as a side-effect), and one that just takes the canonical labels from the InChI (so the structure is unchanged). These approaches are available in the development version of Open Babel as options to SMILES output, and should soon be available in Open Babel 2.3.2.

I'm hoping that other toolkits will see merit in this approach and add similar capability. This will allow, for the first time, different toolkits to generate the same SMILES, and for the first time, it will finally be clear how different toolkits disagree on aspects of their chemical model. Only then we will have some progress on sorting out standard algorithms for stereocentre detection, aromatic models and so forth. And all this will be good for toolkits, and good for users.

Monday, 17 September 2012

A bit of a SMILES - Canonical fragments

A well-hidden feature of OB's SMILES writer is support for writing SMILES strings that represent fragments of a molecule. For example, if we read the SMILES string "CC(=O)Cl" but on writing specify the fragment containing the first two atoms, we get just "CC".

In OB 2.3.2 (coming soon), this can be done with the "F" SMILES output option:

obabel -:"CC(=O)Cl" -osmi -xF "1 2"

...but is a bit more awkward with OB 2.3.1:

obabel -:"CC(=O)Cl" -osmi
       --property SMILES_Fragment "[ 1 2 ]"

If you specify atoms that are not connected, you get a dot-disconnected representation:

> obabel -:"CC(=O)Cl" -osmi -xF "1 4"
C.Cl

So far that's pretty much as expected. But now, let's push it a bit. How about fragments that involve an "aromatic" atom?

> obabel -:"c1ccccc1F" -osmi -xF "6 7"
cF

Mmmm....interesting. Clearly this isn't a valid SMILES string. In fact, none of these "fragment SMILES" are proper SMILES strings - well, they may be valid SMILES but those SMILES do not have the same meaning. In short, the SMILES format does not support fragments.

So what's the point of these? Well, let's consider the canonicalised version, e.g.

>obabel -:O=C(Cl)C
        --property SMILES_Fragment "[ 1 2 ]" -ocan
C=O

Now imagine that you want to create a fragment-based fingerprint; all you need to do is generate the corresponding canonical fragment SMILES and hash them. Job done.

Another potential use would be to...oh oh...dinner time...you'll have to use your imagination. Before I go, just to note that credit for this feature, and most of the SMILES writer indeed, goes to Craig James.

Thursday, 13 September 2012

Plotting accesses on the axis Part II

In an earlier post, I showed the accesses in the first month for the Blue Obelisk and Open Babel papers from late last year.

I should have stopped there.

Instead I decided to see how the recent Avogadro paper compared (Update 15/09/12: thanks to Geoff for filling in the missing points):

Hmphhh. BTW, Avogadro 1.1.0 was just released yesterday. Check out the new features.

Monday, 10 September 2012

Moving to pastures new, but still in the same field Part II

Following on from my previous post on the topic, last Thursday was my last day at University College Cork, and indeed my last day in academia.

I've just moved back to Cambridge (UK) where I have joined the growing team at Roger Sayle's cheminformatics company, NextMove Software. I've mentioned the company before - it has several software products in the area of chemical text mining and name recognition.

I guess some of you are wondering what this means for my involvement in the various Open Source projects to which I contribute. No? Oh well, and I had a whole spiel prepared too...:-) Anyway, the good news is that I'll now be able to attend some conferences again, so hopefully you'll see me around some time in the future.

Tuesday, 4 September 2012

The Imp Act Factor strikes again

Here's a quick question, for what shape distribution does the mean convey the least useful information? Well, there are many answers, but a prime candidate is a one-sided long-tail distribution of the type exhibited by journal citations. The mean, standard deviation and Pearson correlation, are all summary statistics developed for the two-sided normal distribution (exercise for the reader: what are their non-parametric equivalents?). Applying them to anything else is like putting lipstick on a pig (ok, a poor analogy, but it sounds funny :-), but this porcine paintjob is exactly the method used to calculate a journal's Impact Factor.

So, what's the problem? In the context of a one-sided long-tail distribution, the mean is highly sensitive to outliers, and thus almost useless. Let's take an example. Let's suppose there were 99 papers published and each was cited once giving an Impact Factor of 1.0 (99*1/99). Now let's suppose a single additional paper was published which garnered 100 citations. The Impact Factor of the journal is now (99*1 + 1*100)/100 = 2.0. So a single paper, an outlier, has doubled the Impact Factor.

But that wouldn't happen in practice, right? No - you don't get it. The distribution of journal citations has a shape that guarantees this to happen; all those impact factors you read are just measures of outliers. How about instead "the Outlier Factor", or better still the "Extreme Value Factor"?

Still don't believe me? Well, let's take a concrete example. Thomson ISI has just deigned to give J. Cheminf. its first impact factor with a value of 3.42. Let's say that 65 papers have been taken into account, so that's about 222 citations in total. Now let's enter an outlier into the mix, say the Open Babel paper published in Oct of last year. I would expect about 30 to 60 citations a year once it gets going (based on prior citations of the software, as well as experience with the GaussSum paper) - let's just say 50 for a round number, so 100 citations in the 2-year period included in an impact factor. This means that all else being equal, in one year's time the journal's Impact Factor will rise to 4.1, and in two years to 4.9.

I just hope those Avogadro guys don't publish another outlier. :-)

Monday, 3 September 2012

2012 - The Year of Open Access

This year will be remembered as the year that Open Access went mainstream. This whole movement represents a significant change in the field of science. It has led to a widespread realisation that closed access publishers by definition do not have scientists' interests at heart (maximum dissemination of their work) along with a recognition that scientists needs to engage with the wider community, not just their peers in ivory towers. In the end, it all came down to the funders.

For my own part, over the past number of years I have become increasingly convinced that all scientific work should be freely available and preferably Open Access, CC-BY* and copyright me (hey - if I'm going to pay, I want it AALLLLLL!!). I want people to read my work and I want people to be free to remix and reuse it in any way they want - this is the essence of this thing we call science. Burying work in non-OA journals, or worse still in books that few will have access to let alone read, seems to me to be a baaad idea, and especially so now that the writing is on the wall (not literally of course, it's usually on the web).

Others have recounted the events so far this year but here they are again:
Jan - The year started at a low point, a motion in the US to repeal the NIH's public access policy, the Research Works Act. It turned out that Elsevier were behind this.
Feb - After a major outcry, and Tim Gower's announcement that he would boycott Elsevier (subsequently supported by 12K others), Elsevier dropped the bill.
Jun - The Wellcome Trust starts cracking the whip on its OA compliance (only 55% of funded publications were compliant). In future, non-compliance will mean grant money will be withheld, and non-compliant publications will be ignored for the purposes of applying for further funding. Furthermore, publications must be effectively CC-BY.
Jul - The Research Councils of the UK (RCUK) mandate OA, and specifically CC-BY.
Jul - The EU is talking about mandating OA for the 2014-2020 framework funding.
Aug - Wiley announces that its OA journals will now adopt CC-BY.

Historic times. For more, see this link.

*Note: CC-BY is a copyright license developed by the Creative Commons. It's very simple. It means you can legally do whatever you want with the article, so long as you acknowledge the copyright holder. For more info, see the license.

Bibliometricking J Cheminf

Just for fun, a week or two ago I decided to check out who are the most prolific authors in J. Cheminf. I downloaded the TOC in Endnote format from the journal website, and analysed it with the short script below.

There were 295 authors in total (after merging of I think a single duplicate), and all of the authors with 2 or more papers are as follows (by no. of publications, then reverse alphabetical by surname):

14 Peter Murray-Rust
8 Stephen Bryant
8 Evan Bolton
7 Sam Adams
6 Egon Willighagen
6 Joe Townsend
6 Sunghwan Kim
5 Andreas Zell
5 Antony Williams
5 Andreas Jahn
5 Georg Hinselmann
4 David Wild
4 Christoph Steinbeck
4 Noel O'Boyle
4 Geoffrey Hutchison
3 Tim Vandermeersch
3 Henry Rzepa
3 Lars Rosenbaum
3 Matthias Rarey
3 Stefan Kramer
3 David Jessop
3 Nina Jeliazkova
3 Marcus Hanwell
3 Nikolas Fechner
3 Peter Ertl
2 Erik van Mulligen
2 Peter Willett
2 Valery Tkachenko
2 Jens Thomas
2 Ola Spjuth
2 Christopher Southan
2 Weerapong Phadungsukanan
2 Ben O'Steen
2 Sorel Muresan
2 David Lonie
2 Andrew Lang
2 Jan Kors
2 Jos Kleinjans
2 Andreas Karwath
2 Jochen Junker
2 Vedrin Jeliazkov
2 Craig James
2 Jonathan Hirst
2 Kristina Hettne
2 Lezan Hawizy
2 Martin Gutlein
2 Rajarshi Guha
2 Mikhail Elyashberg
2 Michel Dumontier
2 Ying Ding
2 Open Source Drug Discovery Consortium
2 Leonid Chepelev
2 Fabian Buchwald
2 Jean-Claude Bradley
2 Kirill Blinov

...and here's the script used to calculate this: