Tuesday 18 December 2012

My year in review - Reviewing reviewing

Like most scientists, from time to time I get asked to review papers, and a few years ago I decided to keep track of the reviews I did for various journals. In 2010, 2011, and 2012 I was asked to review 7, 6 and 7 times respectively.

I've been trying to figure out is this a reasonable level of reviewing? I guess the question is, am I doing more reviewing for the chemistry community than they are doing for me? Not that I mind those freeloading slackers dumping it all on me - I'm just curious.

So what's the other side of the equation? In the same three years I've had my name attached to 8 peer-reviewed publications. If each was reviewed 2.5 times (a reasonable guesstimate), that's also 20 reviews.

And so the balance is maintained.

Note: I know what you're thinking...(eerie isn't it!)...you want to adjust the figures for multiple authors. But both of the values would have to be adjusted in the same way and so it just cancels out. (I think.)

Update (02/01/2013): The previous note is a load of rubbish. As Felix points out in the comments, I should be correcting for multiple authors. Oh well...

Image credit: After the Edit by Laura Ritchie (LMRitchie on Flickr)

Tuesday 11 December 2012

Cinfony 1.2 released

Cinfony presents a common API to several cheminformatics toolkits. It uses the Python programming language, and builds on top of Open Babel, the RDKit, the CDK, Indigo, OPSIN, JChem and cheminformatics webservices.

Cinfony 1.2 is now available for download.

The two major additions in this release are support for the JChem commercial cheminformatics toolkit, and the ability to specify options (via an 'opt' dictionary) for format conversion and some other operations. There were also some under-the-hood changes to consolidate source files for ease of maintenance.

These additions were principally contributed by Adrià Cereto Massagué (of Universitat Rovira i Virgili and DecoyFinder) who joined the Cinfony team (i.e. me) earlier this year.

As usual, Cinfony has been updated to use the latest stable releases of each toolkit: Open Babel 2.3.2, CDK 1.4.15, RDKit 2012.09, Indigo 1.1, JChem 5.11 and OPSIN 1.3.

The Cinfony website has a somewhat condensed example showing the use of many of these resources in a dozen lines of Python. Here's a smaller example showing part of the new functionality:
>>> from cinfony import pybel, jchem
>>> mol = pybel.readstring("smi", "CC(=O)Cl")
>>> print mol.write("smi", opt={"f": 2, "l": 1}) # Make atom 2 the first and atom 1 the last
>>> fp = jchem.Molecule(mol).calcfp("ECFP")
>>> fp.bits
[39, 47, 55, 246, 397, 429, 700, 908]

To support Cinfony, please cite:
N.M. O'Boyle, G.R. Hutchison, Chem. Cent. J., 2008, 2, 24. [link]

Tuesday 4 December 2012

Intro to Open Babel

Recently I was asked (for the first time!) to provide introductory training for Open Babel. Here are the slides I put together:
If you're interested in this, you should follow up with the hands-on tutorial in the docs which I've mentioned previously. For further cheminformatics teaching material, see this post.

Tuesday 20 November 2012

What's taking so long? - Profiling Open Babel

Profiling code shows where all the runtime is spent. In the case of Open Babel, profiling is a bit awkward due to its use of dynamically-loaded libraries (the format plugins, etc.). So here's how you do it on Linux...
[openbabel/build]$ rm CMakeCache.txt 
[openbabel/build]$ make
This should successfully compile the library and plugins with profiling information, but will fail when it comes to linking one of the executables:
[openbabel/build]$ make obabel
[100%] Built target openbabel
Scanning dependencies of target obabel
[100%] Building CXX object tools/CMakeFiles/obabel.dir/obabel.o
Linking CXX executable ../bin/obabel
/usr/bin/ld: dynamic STT_GNU_IFUNC symbol `strcmp' with pointer equality in `/usr/lib/../lib64/libc.a(strcmp.o)' can not be used when making an executable; recompile with -fPIE and relink with -pie
collect2: error: ld returned 1 exit status
make[3]: *** [bin/obabel] Error 1
make[2]: *** [tools/CMakeFiles/obabel.dir/all] Error 2
make[1]: *** [tools/CMakeFiles/obabel.dir/rule] Error 2
make: *** [obabel] Error 2
Yikes! Using VERBOSE=1 we can see the offending command:
[openbabel/build]$ VERBOSE=1 make obabel 
cd /home/noel/Tools/openbabel/profile/build/tools && /usr/local/bin/cmake -E cmake_link_script CMakeFiles/obabel.dir/link.txt --verbose=1
/usr/local/bin/c++   -static -pg  -g -g3 -fno-inline   -pg CMakeFiles/obabel.dir/obabel.o  -o ../bin/obabel -rdynamic ../src/libopenbabel.a -Wl,-Bstatic -lpthread -Wl,-Bdynamic -lm -lz -Wl,-Bstatic 
/usr/bin/ld: dynamic STT_GNU_IFUNC symbol `strcmp' with pointer equality in `/usr/lib/../lib64/libc.a(strcmp.o)' can not be used when making an executable; recompile with -fPIE and relink with -pie
collect2: error: ld returned 1 exit status
With Roger's help, I was able to change this to something simpler which will compile:
[build/tools]$ /usr/local/bin/c++ -pg  -g -g3 -fno-inline   -pg CMakeFiles/obabel.dir/obabel.o  -o ../bin/obabel ../src/libopenbabel.a -lpthread -lm -lz
Success is mine AT LAST!! Now let's profile it:
[build/bin]$ export BABEL_DATADIR=wherever
[build/bin]$./obabel bigfile.smi -onul
[build/bin]$ gprof ./obabel > gprof.out
Now time to read the gprof manual.

We can design molecular wires for you wholesale Part II

As described previously, last year I published a paper in J. Phys. Chem. C with Geoff Hutchison on Computational Design and Selection of Optimal Organic Photovoltaic Materials.

A week or two ago someone emailed me for a copy of the paper. Fortunately, after a one year embargo, if you have an open access mandate you can request permission from the editor of an ACS journal to deposit the PDF in an institutional repository.

I went through this process a little while ago, and I am pleased to say that a copy of the PDF can now be found in University College Cork's institutional repository at http://hdl.handle.net/10468/748.

I'm not sure how people are going to find this though - it doesn't seem to be very prominent in Google. In particular, the PDF is not indexed (google search with filetype:pdf).

Friday 9 November 2012

Tricks with SMILES and SMARTS Part II

A much-underused feature of SMILES is the ability to apply 'atom classes' to atoms using a colon (inside square brackets). So, for example, CC and C[CH3:6] both represent ethane, but in the latter case one of the carbons is labelled as being a member of atom class 6.

So what's the meaning of atom class 6? Well, it's whatever you want - it's simply a label that you use to indicate some related information. For example, you might want to record reaction locations, or locations of common substitutions, or mappings between different molecules (reactant/product, or sub/superstructures).

Anyhoo, here's how you access the atom class information in Open Babel from Python:
>>> import pybel
>>> ob = pybel.ob
>>> mol = pybel.readstring("smi", "C[CH3:6]")
>>> print mol.write("smi")

>>> print mol.write("smi", opt={"a":True})

>>> data = ob.toAtomClassData(mol.OBMol.GetData("Atom Class"))
>>> data.HasClass(1)
>>> data.GetClassString(1)
>>> data.HasClass(2)
>>> data.GetClassString(2)

Thursday 8 November 2012

Plotting accesses on the axis Part III

Following on from Parts I and II, this is the last in a series of posts exploring how journal access statistics provided by Open Access journals such as Journal of Cheminformatics give an insight into relative impact of papers.

It's now over a year since the Blue Obelisk and Open Babel papers were published so let's look at the accesses over that period: ...close to straight lines with a defined slope.

And here's an update of the view over the first month, this time including the Universal SMILES paper: ...it seems like the accesses to this paper mirror almost exactly those of the Blue Obelisk paper.

In conclusion, it would be nice if journals provided these sorts of graphs, or if some third-party website (e.g. one of altmetrics ones) did it. All of the data is on the website; it just needs to be collated as I've done here.

Wednesday 7 November 2012

Tricks with SMILES and SMARTS

Because of the relationship between SMILES and SMARTS, there are some fun tricks you can do (for some value of fun). For example, over at the NextMove blog I have written about creating a substructure hierarchy (here and here).

Here's another example I came up with in response to a recent question on the OB mailing list from Pascal Muller:
Having a molecule (let's say ethylpyridine CCc1cccnc1) and its scaffold (pyridine c1ccncc1), I would like to create a generic scaffold (smarts) for substructure searches: considered atoms become "any" atom (*), and bonds becomes "any" bond (~).
I.e., the smarts should be  CC*~1~*~*~*~*~*~1 (parts not belonging to the scaffold don't change).

Is there a way in Pybel to mutate atom / bond into "*~" apart from string replacement in the smiles? I anticipate problems with brackets doing so.
And my reply:
Atoms with atomic number 0 are written as *. Bonds with BO 4 are
written as $. So...the following hack may work for you in most cases.

>>> import pybel
>>> mol = pybel.readstring("smi", "CC(=O)Cl")
>>> mol.atoms[0].OBAtom.SetAtomicNum(0)
>>> print mol
>>> mol.atoms[2].OBAtom.SetAtomicNum(0)
>>> print mol
>>> bond = mol.OBMol.GetBond(2, 3)
>>> bond.GetBO()
>>> bond.SetBO(4)
>>> print mol
>>> print mol.write("smi").replace("$", "~")
It probably isn't a general solution, but tricks like these can go a long way to solving a problem in many cases.

Wednesday 31 October 2012

A non-random method to improve your QSAR results - works every time! Part II

In an earlier post, I argued that when developing a predictive QSAR model that non-random division of a QSAR dataset into training and test is a bad idea, and in particular that diversity selection should be avoided.

Not everyone agrees with me (surprise!). See for example this excellent review by Scior et al in 2009 on "How to Recognize and Workaround Pitfalls in QSAR Studies: A Critical Review". Well, the paper is excellent except for the bit about training/test set selection which explicitly pours cold water on random selection in favour of, for example, diversity selection, hand selection or *cough* rational selection.

I had vague thoughts about writing a paper on this topic, but now there's no need. A paper by Martin et al has just appeared in JCIM: "Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?"

And the answer? No.

Combining their discussion with my own randomly-selected thoughts, the reasons it's a bad idea can be summarised as follows:
  1. You are violating the first rule of statistical testing - the training and test sets must be chosen in the same way (this was pointed out to me by a bioinformatician on FF - sorry I can't recall whom).
  2. All the weird molecules are sucked into your training set
  3. Every item in the test set is going to be close to an item in your training set, and your internal predictions are going to be overoptimistic compared to reality. (You do care about reality, don't you?)
Dr Kennard-Stone has a lot to answer for, but hopefully this is the final nail in the coffin (it being Halloween after all) for diversity selection in QSAR.

Update (7/11/12): In the comments, George Papadatos points out that "There's also related evidence that sophisticated diversity selection methods do not actually perform better than random picking".

Thursday 25 October 2012

Open Access and Ireland

It's Open Access week, so let's see how Ireland (ROI) is shaping up.

The good news first. The Government has just this week announced a national OA mandate, presumably following on the recent happenings in the UK. Indeed, most of the national funding agencies have had their own OA mandates for several years, but this formalises this policy at a national level.

So let's check out the OA output of the various universities and see whether taxpayers are getting their money's worth. Each university has its own institutional repository, and these are collected in a national repository, Rian. I generated the following bar charts using the statistics on deposited journal articles in Rian, combined with academic staff numbers taken from the Wikipedia pages for each university.

Are academics in TCD really 9.5 times more productive than their peers in UCC? Possibly not. I think it likely that OA deposition rates across most Irish universities could do with improvement.

Sunday 21 October 2012

Learn scikit and never use R again

Hallelujah, my brothers and sisters. Free yourselves from the <- and the $, the "how do I select a row again? I only did this last week!", and the arcane differences between a table and a data.frame. Join with me in using scikit-learn, and never use R again.

Could it be true? No need to ever use R again? Well, that's how it looks to me. Scikit-learn is a Python module for machine learning which seems to replicate almost all of the multivariate analysis modules I used to use in R. Thanks to Nikolas Fechner at the RDKit UGM for tuning me into this.

Let's see it in action for a simple example that uses SVM to classify irises (not the eyeball type). First, the R:

mysvm <- svm(Species ~ ., iris)
mysvm.pred <- predict(mysvm, iris)
# mysvm.pred   setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     48          2
#   virginica   0      2         48

And now the Python:
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
iris = datasets.load_iris()

mysvm = svm.SVC().fit(iris.data, iris.target)
mysvm_pred = mysvm.predict(iris.data)
print confusion_matrix(mysvm_pred, iris.target)
# [[50  0  0]
#  [ 0 48  2]
#  [ 0  0 50]]
This library is quite new, but there seems to be quite a bit of momentum in the data processing space right now in Python. See also Statsmodels and Pandas. These videos from PyData 2012 give an overview of some of these projects.

Saturday 20 October 2012

Old software never dies, it just...

Ah, the old ones are the best. Or so some think.

During my PhD, I wrote software for linking the output of comp chem calculations to experimental results (e.g. IR and UV-Vis spectra, and so forth). This software, GaussSum, has now been around for quite some time (Aug 2003 in fact). For the most part I no longer add features, beyond updating the parsing library (from cclib) to handle new file formats. At this point most of the bugs have been identified and removed...but does everyone use the latest version? Let's find out...

If you search Google Scholar for papers published in 2012 with the word "GaussSum" (but not "Gauss") you get about 121 citations (well, mentions, at least). In 45 of these cases, the extract in the Google Scholar results unambiguously lists the version number used.

The majority of the citations, a total of 27, are for the current version 2.2, released in Nov 2009 with several updates since. Four specify the precise version: 2.2.0 (Nov 2009), 2.2.2 (Nov 2009), 2.2.4 (July 2010) and 2.2.5 (the latest, in Jan 2011).

Of the remainder, the majority, a total of 13, are for version 2.1, released first in June 2007 and replaced by version 2.2 in Nov 2009. Three specify the precise version: 2.1.4 twice (Dec 2007) and 2.1.6 once (Apr 2009, the final update).

No one cites version 2.0 (Oct 2006 to June 2007). (Sad face.) But that still leaves five citations unaccounted for.

Version 1.0 was released in Oct 2005 and held sway until Oct 2006. It has been cited 4 times this year. In three of those cases they cite the precise version, version 1.0.5 (the final update, Aug 2006).

But that is not all. Oh no. I still have a loyal user of GaussSum 0.8 out there. Although released in Jan 2004 and superseded in Apr 2005 by version 0.9, this particular version with its quaint bugs and lovable lack of features, still has its adherents (well, adherent, anyway).

And the moral of the story is..."don't ever use Google Scholar to check what versions of software people are using". No that's not it - I mean "update your scientific software or some of your results are going to be dubious." At least check for a new version every 8 years.

Indeed, it seems old software never dies...it just gets fewer citations.

Wednesday 17 October 2012

Wherein I get Roger to blog

My cunning plan to get Roger Sayle to write down some of his findings has come to fruition. NextMove Software now has its own blog with posts from Daniel Lowe, myself and Roger on ChemSketch to Mol conversion, visualising a hierarchy of substructures, improving InChIs in Wikipedia, and how using mmap can speed up accessing binary files.

I haven't yet worked out how to decide where I post a particular article but I guess some things are more appropriate for the one blog over the other (e.g. work stuff for work blog, non-work stuff for non-work blog).

I've also taken the opportunity to add a few links to the sidebar. If you're interested in the kind of things I post here, you may want to check out my Google Plus page where I highlight articles in other blogs that I find interesting.

And finally, since this is turning into a meta-post, I encourage you to write your own blog. It takes about 30 seconds to set one up on Blogger (for example), and if it's about cheminformatics, I will read it.

Tuesday 16 October 2012

Pretty pictures of conformers

For most of this afternoon, Roger has been explaining to me how the byte-code compiler in Open Babel's protein perception works (he wrote it). Well, first of all he had to explain why there was a byte-code compiler in there. No, that's not true - actually first he had to explain what a byte-code compiler was.

Now I need some quiet time while it all sinks in.

So here are some pretty pictures of Confab-generated conformers courtesy of Jean-Paul Ebejer of InhibOx which were generated for his recent paper comparing conformer generators (see also here).

JP was also kind enough to provide the PyMOL scripts used to generate these if anyone wants to do so themselves: [cartoon] [raytraced].

Thursday 11 October 2012

Nothing added but time - ABI and API stability

Open Babel has a version numbering system that requires us to maintain API and ABI stability between bugfix releases (2.3.x), API stability between minor releases (2.x) although the API can be extended, and all bets are off for major releases (x).

Very exciting stuff altogether, I'm sure you'll agree.

What it all means is that just before release we need to make sure we hold to this promise. In the past, we've done this with varying degrees of success. API stability is fairly easy to hold to, but ABI stability is trickier - knowing the rules for what makes or breaks the ABI is half the battle. To be fair, the reason it has been so difficult is because there was no automated tool for sorting this all out for us.

Finally, a couple of years ago, the ABI Compliance Checker appeared which makes the whole thing a doddle. So today I compiled OB 2.3.1 and OB 2.3.2 and installed locally on a Linux system. Then, for each, I made an XML file like so:



And finally, I just ran the checker:
$abi-compliance-checker-1.98.3/abi-compliance-checker.pl -lib openbabel -old 2.3.1.xml -new 2.3.2.xml
preparation, please wait ...
Using GCC 4.8.0 (x86_64-unknown-linux-gnu)
checking header(s) 2.3.1 ...
ERROR: some errors occurred when compiling headers
ERROR: see log for details:

checking header(s) 2.3.2 ...
ERROR: some errors occurred when compiling headers
ERROR: see log for details:

comparing ABIs ...
comparing APIs ...
creating compatibility report ...
total "Binary" compatibility problems: 0, warnings: 4
total "Source" compatibility problems: 0, warnings: 8
see detailed report:

You can check out the detailed report here.

(1) I also had to delete openbabel/math/align.h as I had compiled without Eigen in each case. This header file probably shouldn't have been installed in that case.
(2) The error messages in the log files were related to the use of abs(double). Not sure why.

Monday 8 October 2012

1st RDKit User Group Meeting

Just back from the first ever RDKit User Group Meeting, organised by Nathan "Broon" Brown of the Institute for Cancer Research in London, and presided over by Greg Landrum, the lead developer of RDKit.

Really great meeting - a mixture of round-table discussions, scientific presentations, tutorials and RDKit on Raspberry Pi compilation tips. Got to meet a lot of new people. Check out Nathan's tweets for an overview.

The talks and other material will appear in a few weeks so there's no point me talking about them too much now. One thing I will note is that Greg kicked off his history of RDKit by blaming me for prodding him into action with the one of the first emails to the rdkit-discuss mailing list. It seems that I am responsible for collapsing that wavefunction. Also, look out for Andrew Dalke's Roche-sponsored MCS code in the next release.

As an aside, it was interesting to note the background of the 30 or so participants. Apart from the ICR attendees, there were only a few academics, with the remainder from pharma or pharma consulting or software vendors (i.e. me now) or George Papadatos (are you an academic too George?).

And talking of George, the next RDKit UGM will be hosted by the EBI, so I'll have even less to travel next time.

Monday 24 September 2012

Crowd-sourced malaria drug discovery

Here's a quick shout out for the FightMalaria@Home project run by Anthony Chubb in UCD, Dublin, which is currently looking for more people to lend compute time to the project.

The goal of the project is to identify the drug targets of the Novartis/GSK malaria datasets (about 19K molecules) through large-scale distributed protein-ligand docking experiments. The details are on the project website, but involve identification of hits using Autodock Vina running on BOINC. Experimental follow-up of the top hits will be carried out.

Consider donating some compute time to this project. The dockings are run when your computer is idle so it shouldn't affect your normal use. The results of the study will be integrated into ChEMBL and be available to everyone.

To get involved, head over to www.fight-malaria.org. Details on the BOINC server can be found here.

Wednesday 19 September 2012

Using the InChI to canonicalise SMILES

I believe that the Open chemistry community will wish to move towards InChI as the definitive approach for all canonicalisation in their codes. We have found that "unique SMILES" is not precisely defined and there is no accepted reference implementation that is freely available. For example a given molecule (e.g. caffeine) has at least 9 representations on the public Web.
- Peter Murray-Rust, Feb 2005, Open Babel mailing list

Different software generates different canonical SMILES. The reason for this is simple; no-one has described a canonicalisation scheme for SMILES that includes stereochemistry. Even if we wanted to generate the same SMILES, we cannot do so. Back in 2005, PMR pointed out that the InChI could be used for this purpose. As ever, PMR was way ahead of the times, and to my knowledge no one took up this idea until...

A paper of mine has just been published in J. Cheminf.:
Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI
NM O'Boyle, Journal of Cheminformatics 2012, 4:22. doi:10.1186/1758-2946-4-22 

I describe two approaches to generate a canonical SMILES, one based on roundtripping through the InChI (and so it incorporates the InChI normalisation as a side-effect), and one that just takes the canonical labels from the InChI (so the structure is unchanged). These approaches are available in the development version of Open Babel as options to SMILES output, and should soon be available in Open Babel 2.3.2.

I'm hoping that other toolkits will see merit in this approach and add similar capability. This will allow, for the first time, different toolkits to generate the same SMILES, and for the first time, it will finally be clear how different toolkits disagree on aspects of their chemical model. Only then we will have some progress on sorting out standard algorithms for stereocentre detection, aromatic models and so forth. And all this will be good for toolkits, and good for users.

Monday 17 September 2012

A bit of a SMILES - Canonical fragments

A well-hidden feature of OB's SMILES writer is support for writing SMILES strings that represent fragments of a molecule. For example, if we read the SMILES string "CC(=O)Cl" but on writing specify the fragment containing the first two atoms, we get just "CC".

In OB 2.3.2 (coming soon), this can be done with the "F" SMILES output option:
obabel -:"CC(=O)Cl" -osmi -xF "1 2"
...but is a bit more awkward with OB 2.3.1:
obabel -:"CC(=O)Cl" -osmi
       --property SMILES_Fragment "[ 1 2 ]"
If you specify atoms that are not connected, you get a dot-disconnected representation:
> obabel -:"CC(=O)Cl" -osmi -xF "1 4"
So far that's pretty much as expected. But now, let's push it a bit. How about fragments that involve an "aromatic" atom?
> obabel -:"c1ccccc1F" -osmi -xF "6 7"
Mmmm....interesting. Clearly this isn't a valid SMILES string. In fact, none of these "fragment SMILES" are proper SMILES strings - well, they may be valid SMILES but those SMILES do not have the same meaning. In short, the SMILES format does not support fragments.

So what's the point of these? Well, let's consider the canonicalised version, e.g.
>obabel -:O=C(Cl)C
        --property SMILES_Fragment "[ 1 2 ]" -ocan
Now imagine that you want to create a fragment-based fingerprint; all you need to do is generate the corresponding canonical fragment SMILES and hash them. Job done.

Another potential use would be to...oh oh...dinner time...you'll have to use your imagination. Before I go, just to note that credit for this feature, and most of the SMILES writer indeed, goes to Craig James.

Thursday 13 September 2012

Plotting accesses on the axis Part II

In an earlier post, I showed the accesses in the first month for the Blue Obelisk and Open Babel papers from late last year.

I should have stopped there.

Instead I decided to see how the recent Avogadro paper compared (Update 15/09/12: thanks to Geoff for filling in the missing points):

Hmphhh. BTW, Avogadro 1.1.0 was just released yesterday. Check out the new features.

Monday 10 September 2012

Moving to pastures new, but still in the same field Part II

Following on from my previous post on the topic, last Thursday was my last day at University College Cork, and indeed my last day in academia.

I've just moved back to Cambridge (UK) where I have joined the growing team at Roger Sayle's cheminformatics company, NextMove Software. I've mentioned the company before - it has several software products in the area of chemical text mining and name recognition.

I guess some of you are wondering what this means for my involvement in the various Open Source projects to which I contribute. No? Oh well, and I had a whole spiel prepared too...:-) Anyway, the good news is that I'll now be able to attend some conferences again, so hopefully you'll see me around some time in the future.

Tuesday 4 September 2012

The Imp Act Factor strikes again

Here's a quick question, for what shape distribution does the mean convey the least useful information? Well, there are many answers, but a prime candidate is a one-sided long-tail distribution of the type exhibited by journal citations. The mean, standard deviation and Pearson correlation, are all summary statistics developed for the two-sided normal distribution (exercise for the reader: what are their non-parametric equivalents?). Applying them to anything else is like putting lipstick on a pig (ok, a poor analogy, but it sounds funny :-), but this porcine paintjob is exactly the method used to calculate a journal's Impact Factor.

So, what's the problem? In the context of a one-sided long-tail distribution, the mean is highly sensitive to outliers, and thus almost useless. Let's take an example. Let's suppose there were 99 papers published and each was cited once giving an Impact Factor of 1.0 (99*1/99). Now let's suppose a single additional paper was published which garnered 100 citations. The Impact Factor of the journal is now (99*1 + 1*100)/100 = 2.0. So a single paper, an outlier, has doubled the Impact Factor.

But that wouldn't happen in practice, right? No - you don't get it. The distribution of journal citations has a shape that guarantees this to happen; all those impact factors you read are just measures of outliers. How about instead "the Outlier Factor", or better still the "Extreme Value Factor"?

Still don't believe me? Well, let's take a concrete example. Thomson ISI has just deigned to give J. Cheminf. its first impact factor with a value of 3.42. Let's say that 65 papers have been taken into account, so that's about 222 citations in total. Now let's enter an outlier into the mix, say the Open Babel paper published in Oct of last year. I would expect about 30 to 60 citations a year once it gets going (based on prior citations of the software, as well as experience with the GaussSum paper) - let's just say 50 for a round number, so 100 citations in the 2-year period included in an impact factor. This means that all else being equal, in one year's time the journal's Impact Factor will rise to 4.1, and in two years to 4.9.

I just hope those Avogadro guys don't publish another outlier. :-)

Monday 3 September 2012

2012 - The Year of Open Access

This year will be remembered as the year that Open Access went mainstream. This whole movement represents a significant change in the field of science. It has led to a widespread realisation that closed access publishers by definition do not have scientists' interests at heart (maximum dissemination of their work) along with a recognition that scientists needs to engage with the wider community, not just their peers in ivory towers. In the end, it all came down to the funders.

For my own part, over the past number of years I have become increasingly convinced that all scientific work should be freely available and preferably Open Access, CC-BY* and copyright me (hey - if I'm going to pay, I want it AALLLLLL!!). I want people to read my work and I want people to be free to remix and reuse it in any way they want - this is the essence of this thing we call science. Burying work in non-OA journals, or worse still in books that few will have access to let alone read, seems to me to be a baaad idea, and especially so now that the writing is on the wall (not literally of course, it's usually on the web).

Others have recounted the events so far this year but here they are again:
Jan - The year started at a low point, a motion in the US to repeal the NIH's public access policy, the Research Works Act. It turned out that Elsevier were behind this.
Feb - After a major outcry, and Tim Gower's announcement that he would boycott Elsevier (subsequently supported by 12K others), Elsevier dropped the bill.
Jun - The Wellcome Trust starts cracking the whip on its OA compliance (only 55% of funded publications were compliant). In future, non-compliance will mean grant money will be withheld, and non-compliant publications will be ignored for the purposes of applying for further funding. Furthermore, publications must be effectively CC-BY.
Jul - The Research Councils of the UK (RCUK) mandate OA, and specifically CC-BY.
Jul - The EU is talking about mandating OA for the 2014-2020 framework funding.
Aug - Wiley announces that its OA journals will now adopt CC-BY.

Historic times. For more, see this link.

*Note: CC-BY is a copyright license developed by the Creative Commons. It's very simple. It means you can legally do whatever you want with the article, so long as you acknowledge the copyright holder. For more info, see the license.

Bibliometricking J Cheminf

Just for fun, a week or two ago I decided to check out who are the most prolific authors in J. Cheminf. I downloaded the TOC in Endnote format from the journal website, and analysed it with the short script below.

There were 295 authors in total (after merging of I think a single duplicate), and all of the authors with 2 or more papers are as follows (by no. of publications, then reverse alphabetical by surname):
14 Peter Murray-Rust
8 Stephen Bryant
8 Evan Bolton
7 Sam Adams
6 Egon Willighagen
6 Joe Townsend
6 Sunghwan Kim
5 Andreas Zell
5 Antony Williams
5 Andreas Jahn
5 Georg Hinselmann
4 David Wild
4 Christoph Steinbeck
4 Noel O'Boyle
4 Geoffrey Hutchison
3 Tim Vandermeersch
3 Henry Rzepa
3 Lars Rosenbaum
3 Matthias Rarey
3 Stefan Kramer
3 David Jessop
3 Nina Jeliazkova
3 Marcus Hanwell
3 Nikolas Fechner
3 Peter Ertl
2 Erik van Mulligen
2 Peter Willett
2 Valery Tkachenko
2 Jens Thomas
2 Ola Spjuth
2 Christopher Southan
2 Weerapong Phadungsukanan
2 Ben O'Steen
2 Sorel Muresan
2 David Lonie
2 Andrew Lang
2 Jan Kors
2 Jos Kleinjans
2 Andreas Karwath
2 Jochen Junker
2 Vedrin Jeliazkov
2 Craig James
2 Jonathan Hirst
2 Kristina Hettne
2 Lezan Hawizy
2 Martin Gutlein
2 Rajarshi Guha
2 Mikhail Elyashberg
2 Michel Dumontier
2 Ying Ding
2 Open Source Drug Discovery Consortium
2 Leonid Chepelev
2 Fabian Buchwald
2 Jean-Claude Bradley
2 Kirill Blinov
...and here's the script used to calculate this:

Wednesday 22 August 2012

Transforming molecules into...well...other molecules

It's fairly straightforward to filter structures by SMARTS patterns using Open Babel, but how about if you want to transform all occurences of a particular substructure into something else? This may be useful for structure cleanup, tautomer normalisation, or even to carry out a reaction in silico.

Anyhoo, that's enough background. Let's suppose we want to hydrogenate all instances of C=C and C=O. Just copy and paste the following into the end of plugindefines.txt:
hydrogenate      # ID used for commandline option
*                # Asterisk means "no datafile specified"
Hydrogenate C=C and C=O double bonds
TRANSFORM [C:1]=[C:2] >> [C:1][C:2]
TRANSFORM [C:1]=[O:2] >> [C:1][O:2]
This gives a new obabel option, --hydrogenate, that will do the job:
C:\Tools\tmp>obabel -L ops
hydrogenate    Hydrogenate C=C and C=O double bonds
C:\Tools\tmp>obabel -:C=C -osmi --hydrogenate
C:\Tools\tmp>obabel -:O=CSC=CC=S -osmi --hydrogenate
(1) This works best in the latest SVN as Chris sorted out some longstanding bugs.
(2) I've just enabled this in Python where it works as follows:
import pybel
transform = pybel.ob.OBChemTsfm()
success = transform.Init("[C:1]=[C:2]", "[C:1][C:2]")
assert success
mol = pybel.readstring("smi", "C=C")
assert mol.write("smi").rstrip() == "CC"

Wednesday 15 August 2012

Using cheminformatics to guide your career path

Leaving aside the fact that the term scientific career is a bit of an oxymoron, I propose to show how considering the distribution of alpine flora can help guide your career choices. When Jaccard first headed for the alps to count edelweiss, little did he know he would make a stunning discovery that would change the face of cheminformatics forever - the Tanimoto coefficient. Here I show how to extend this coefficient to career path planning.

Simply put, you should choose your next institution based on whether it has a high Tanimoto coefficient with your current. Let's take my career as an example and show how this works:

1997-2001 University College Galway, UCG
2001-2004 Dublin City University, DCU (Tanimoto coefficient of 2/4 = 0.5)
2004-2005 University College Dublin, UCD (3/3 = 1.0)
2005-2006 Unilever Centre Cambridge, UCC (2/4 = 0.5)
2006-2009 Cambridge Crystallographic Data Centre, CCDC (2/5 = 0.4)
2009-         University College Cork, UCC (2/5 = 0.4)

As with any career there were some highs (Tanimoto of 1.0) and some lows (values of 0.4). Now the question is, what about my next move?

Notes: UCG is now NUIG (National University of Ireland, Galway).

Tuesday 14 August 2012

Grant me that - My non-funding history

A successful PI here in UCC made the point to me (after a grant application of mine was bounced) that only about 1 in 6 applications of his are funded. This may have been a white lie, but I thought it'd be interesting to look back on all of my applications since I finished my PhD and see how it's gone for me (items in bold were funded):

 2005 - Marie Curie Intra-European Fellowship
 2006 - Royal Commission for the Exhibition of 1851
 2006 - EPSRC Project Grant (minor contribution)
 2007 - President of Ireland Young Researcher Award
 2008 - SFI Principal Investigator
 2008 - Wellcome Trust Research Career Development Fellowship
 2008 - Health Research Board Career Development Fellowship
 2009 - CSA Trust Jacques-Émile Dubois Grant
 2009 - SFI Starting Investigator Research Grant
 2009 - Wellcome Trust Research Career Development Fellowship
 2010 - CSA Trust Jacques-Émile Dubois Grant
 2011 - European Research Council Starting Grant
 2011 - IRCSET Enterprise Partnership Scheme
 2011 - SFI Starting Investigator Research Grant

Note that a couple of researchers out there are going a bit further than me, and actually posting up their grant applications. Check them out (via +Jan Jensen).

Image credit: The above image of Grant's Tomb (geddit?) is by ScottOldham (CC-BY-SA).

It all adds up to a new descriptor Part II

As described in an earlier post, it's pretty easy to implement a new group contribution descriptor with Open Babel just by editing a text file. Well, that post inspired Andy Lang to use this feature to develop a melting point model which you can download right now and apply using Open Babel.

You can read the announcement of the model on the OB mailing list or get the model and read all the details including exactly how it was developed over at Jean-Claude Bradley's Open Notebook Science wiki. The image above by Andy Lang shows the performance of the model on a test set ([Update 02/2018]: see image at Figshare instead).

Wednesday 8 August 2012

Presentations and Practicals from Resources for Computational Drug Discovery

As discussed previously, I recently presented on the topics of cheminformatics and protein-ligand docking. The talks were slightly updated for the occasion, and are included below.

I also supervised a practical on cheminformatics involving the Open Babel GUI. This will be available as part of the regular OB documentation, but in the meanwhile, you can access it at ReadTheDocs. It covers file format conversion (surprise!), depiction, filtering, and similarity and substructure searching.

Here's my cheminformatics talk:

...and my protein-ligand docking talk:

Monday 9 July 2012

Rocking with ChEMBL at Resources for Computational Drug Discovery

Last week I was one of the trainers at the Joint EMBL-EBI/Wellcome Trust Course: Resources for Computational Drug Discovery organised by the ChEMBL team. I covered the topics of cheminformatics and protein-ligand docking (see next post).

As with previous occasions, I really enjoy participating in this course as it's an excellent opportunity to meet a really diverse bunch of people from all over the world. And that's just the ChEMBL group (just kiddin'). I also found the talks on early-early stage drug discovery very interesting (Mike Barnes - target validation, Anna Gaulton - druggability) as the holistic approach used in industry couldn't be more different from the approach often seen in academia, where a researcher just falls in love with a particular target (as Bissan Al-Lazikani put it) and will pursue it in the face of evidence that it may be a non-runner. Andreas Bender and John Irwin spoke about contrasting methods of target prediction for a particular ligand, one based on physicochemical properties of the binding site, and the other completely ligand-based (SEA) - this is an area that seems to have only developed in recent years. Another interesting session was when John Overington led the participants in a review of recently proposed repurposed drugs for neglected disease - it seems that half the battle is identifying suitable candidates (i.e. not just those with efficacy).

On the last day we all received a parting gift from the organisers, a stick of rock candy. John Overington recommended checking with customs before bringing it through as some of the compounds may be illegal in some jurisdictions:

Thursday 28 June 2012

Drawing vector diagrams with Python

What's the easiest way to draw nice vector diagrams with Python? I was asked this question some time ago by a bioinformatician who was moving from Perl to Python, and previously used the gd library. gd was widely used back in the day, but looks a bit clunky now.

Here's the solution I came up with, to use SVG through PySVG. It's pure-Python and so doesn't require any extra libraries, which is always a pain when dealing with graphics.

I used the code below to generate an SVG file, which was converted to the above PNG with Inkscape:
Anyone else got any suggestions?

Wednesday 27 June 2012

How to store stereochemistry in Mol files II

Following on from earlier work, I decided to resurrect Open Babel's support for storing stereo in 0D Mol files. I always liked this idea, but at the time, I removed it before release because no-one else did. Well, now Craig James tells me he likes it too, so it's back in and will be available in OB 2.3.2.

Simply put, in OB 2.3.2 tetrahedral and cis/trans stereo can be roundtripped through a 0D Mol file by using an extension to the defacto standard, as follows:

obabel -:"I/C=C/C[C@](F)(Br)Cl" -omol | obabel -imol -osmi

With the current release, cis/trans stereo is only preserved if you generate 2D coordinates ("--gen2D") although tet stereo can be read from the chiral flags for 0D files if you specify "-as".

How does the extension work? Well, for tet stereo it just uses the chiral flags (in defiance of the spec which says to ignore chiral flags on reading - take that spec!!). For cis/trans stereo it uses Up/Down markings equivalent to those used by SMILES; all of the (at most) 4 stereobonds are given Up/Down markings; if two bonds at either end are both Up or both Down this implies cis.

A better way to do it would have been to use a double bond flag equivalent to the chiral flag - e.g. 1 means that the first two bonds that appear in the bond section are cis - this would be both easier to compute and to interpret; however it would push the spec a bit too far.

Thursday 7 June 2012

Holy moley - A blessed ordering of atoms

I know I shouldn't cast the first stone but sometimes only a saint could turn the other cheek.

Here's some background from Wikipedia:
In computer science, canonicalization (...also sometimes standardization or normalization) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form.
Not to be confused with Canonization...
Canonization (or canonisation) is the act by which a Christian church declares a deceased person to be a saint, upon which declaration the person is included in the canon, or list, of recognized saints.
...and here are some papers from JCICS:
MOLGEN-CID - A Canonizer for Molecules and Graphs Accessible through the Internet 
The Signature Molecular Descriptor. 4. Canonizing Molecules Using Extended Valence Sequences

A search of ACS publications shows seven references to canonizer in total, and 32 references to canonization.

Okay, so I'm a native speaker, and yes, I mispell words too. I still think it's funny. :-)

Wednesday 6 June 2012

Arrr, I be at the EBI

I am currently visiting the EMBL-EBI (courtesy of Christoph Steinbeck), and listening to the sound of an English summer pelting the roof like no-one's business. If you're around (or even a square) and interested in meeting up to discuss something cheminformaticky, feel free to drop me a line. (Update: I'm back home now)

Wednesday 23 May 2012

When Mol files go wrong III

Let's play spot the difference. Are the following structures the same? (Mol files from CHEMBL186139 and CHEMBL1180158.)
But if they're the same, then how come there are two distinct entries for this in the database? Well guess what - they don't have the same InChI:
The nitrogen attached to the ring is treated as a C=N once the two protons are added to neutralise the charge. The InChI code then considers the stereochemistry across that double bond to be defined in one case (177.2°) but undefined in the other (179.1°). Here are the pictures from winchi (click to enlarge):
I'm not quite sure where the problem is. Is the InChI correct to make the distinction? Any thoughts?

Monday 21 May 2012

When Mol files go wrong II

With time I've become more convinced that the SMILES format is more capable of faithfully storing stereochemistry than a 2D format such as Mol. Here is another tale of woe, related to tetrahedral stereocentres with one implicit bond, illustrated and annotated by Symyx Draw (am I the only one who thinks this is better than ChemDraw?).

Did you know that the stereochemistry of the wedge is interpreted differently in the two following cases?
Easy peasy, eh? But what about the in-between case where the angle between the two plane bonds is close to 180 (see below on left)? Guess what - you're in trouble if you do this. Some software will regard this as undefined, some will continue on regardless. If you look at the InChI string you can see that it regards the stereo as undefined, whereas the SMILES string does contain a stereocentre. In short, you've got a problem; if you're in charge of a database, you should identify such cases and fix them (manually), for example as shown on the bottom right (if that is the correct stereo) or by adding the implicit hydrogen.
In the course of other work, I've come across some instances of this problem in ChEMBL and will be talking to the team about sorting it out. Does anyone have other examples of potential stereo problems in Mol files and how to identify them?

Wednesday 2 May 2012

Speedup repeated calls to Python functions

If your Python script has repeated calls to a function with the same parameters each time, you can speed things up by caching the result. This is called memoization. It's not rocket science.

What is rocket science is that with a little bit of Python magic (see the code here), you can simply add memoization to any function with the @memoized decorator, e.g.
def calcHOMO(smiles):
   # Generate Gaussian input file with Open Babel
   # and run Gaussian to find the HOMO.
   return homo
Calling this function with a SMILES string the first time would return the HOMO after 10 minutes. Calling it a second time would return the result instantly.

Update (11/05/2012): This feature is available in the Python standard library as of Python 3.2. See Andrew's comment below

Tuesday 17 April 2012

Painting molecules your way - Introducing the paint format

If the range of depiction output formats (PNG, SVG and ASCII currently) provided by Open Babel is not enough for you, you can easily draw molecules yourself using the information provided by the new paint utility format.

The paint format simply describes lists of the actions required to generate a depiction of the molecule. For example, I used this information to prototype the ASCII output format in Python. Here it is in action:
>obabel -:c1ccccc1C(=O)Cl -opaint

NewCanvas 218.6 200.0
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
DrawLine 109.3 100.0 to 143.9 120.0
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
DrawLine 146.9 120.0 to 146.9 147.0
DrawLine 140.9 120.0 to 140.9 147.0
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
DrawLine 143.9 120.0 to 167.3 106.5
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
DrawLine 74.6 120.0 to 40.0 100.0
DrawLine 73.0 110.8 to 48.8 96.8
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
DrawLine 40.0 100.0 to 40.0 60.0
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
DrawLine 40.0 60.0 to 74.6 40.0
DrawLine 48.8 63.2 to 73.0 49.2
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
DrawLine 74.6 40.0 to 109.3 60.0
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
DrawLine 109.3 60.0 to 109.3 100.0
DrawLine 102.1 66.0 to 102.1 94.0
SetPenColor 0.0 0.0 0.0 1.0 (rgba)
DrawLine 109.3 100.0 to 74.6 120.0
SetPenColor 0.4 0.4 0.4 1.0 (rgba)
SetPenColor 0.4 0.4 0.4 1.0 (rgba)
SetPenColor 0.4 0.4 0.4 1.0 (rgba)
SetPenColor 0.4 0.4 0.4 1.0 (rgba)
SetPenColor 0.4 0.4 0.4 1.0 (rgba)
SetPenColor 0.4 0.4 0.4 1.0 (rgba)
SetPenColor 0.4 0.4 0.4 1.0 (rgba)
SetPenColor 1.0 0.1 0.1 1.0 (rgba)
SetFontSize 16
SetFontSize 16
SetFontSize 16
DrawText 143.9 160.0 "O"
SetPenColor 0.1 0.9 0.1 1.0 (rgba)
SetFontSize 16
SetFontSize 16
SetFontSize 16
SetFontSize 16
DrawText 178.6 100.0 "Cl"
To create an example depiction I naturally turned to bananas. Why bananas? Well, they're yellow, sweet if properly ripe, and of course are full of potassium. Using the information above and the magic of SVG, we can generate the following image (click to access zoomable SVG):

Other fruit are available, and if you would like me to put together a similar image for you, contact me and maybe we can work something out.

As an exercise for the reader, it would be cool to see the same thing done in 3D using say Blender or Povray.

The banana SVG image is from OpenClipArt. It was modified, then I ran the Python script below, and finally trivially modified the output to give the final SVG. Here's that Python script:

Tuesday 10 April 2012

Depict a chemical structure...without graphics Part III

Following on from earlier posts (here and here), I got to thinking again about molecular depiction using text. The solution I arrived at in Part II had a couple of drawbacks:
  1. It relied on an external library, aalib
  2. aalib doesn't seem to be available on Windows (at least not in a way I can use with MSVC)
  3. aalib is really aimed at bitmap depiction as ASCII, and I'm interested in vectors (lines)
So obviously there was nothing for it but to write some code myself for depicting lines as ASCII, which is essentially what I've done with ASCIIPainter, now part of Open Babel. You could use this to draw an arbitary image, but let's check out how it works for molecules:
It didn't work quite so well for text pasted directly into this blog as the aspect ratio was quite high (i.e. low resolution in the y direction), but once I hit upon adding style="line-height:100%" to the pre tag it was much improved:
obabel -:c1cc(C(=O)Cl)ccc1 -oascii -xw60 -xa1.6


             | |
             | |
             | |
             | |
             | |
             |_|                         __
            _/  \__                   __/  \__
          _/       \__             __/        \_
        _/            \_         _/     __      \__
     __/                \__   __/    __/           \__
Cl                         \_/    __/                 \__
                            |   _/                      |
                            |                           |
                            |                        |  |
                            |                        |  |
                            |                        |  |
                            |                        |  |
                            |                        |  |
                            |                        |  |
                            |                        |  |
                            |   __                      |
                            |_    \__                   _
                              \__    \__             __/
                                 \__    \_         _/
                                    \_          __/
                                      \__    __/
I've added an output option to help tune the aspect ratio (-xs). Also, multimolecule output is supported, and a fun pastime is to watch ASCII depictions of large libraries fly by at the command line.

Wednesday 4 April 2012

Getting your double bonds in a twist - How to depict unspecified stereo

I've just been adding depiction support for double bonds with unspecified stereo, and thinking about how this should be done: a squiggly bond for a substituent on the double bond, or make the double bond itself twisted? Actually, I didn't have to think too much as Rich and others (also mcule) have already worked through these same issues. In short, the IUPAC recommendation (from 2006) is best avoided, and a twisted bond should be used instead.

So, here's the result after implementing the twisted bond:
obabel -:"Cl/C=C/Br" -:"Cl/C=C\Br" -:"ClC=CBr"
       -O tmp.svg -xr 1

...and with an asymmetric double bond for extra pizazz:
obabel -:"Cl/C=C/Br" -:"Cl/C=C\Br" -:"ClC=CBr"
       -O tmp.svg -xr 1 -xs

Credits: Twisted double bond by me. Everything else of depiction by Chris Morley and Tim Vandermeersch. Structure layout by Sergei Trepalin.

Monday 2 April 2012

Cheer up your LaTeX with SMILES support II

In Part I, I showed how to embed PNGs, automatically generated from SMILES by obabel, into LaTeX documents. An alternative approach is to use the SVG output from obabel.

In a comment to my earlier post, Billy suggests running the SVG through Inkscape or rsvg-convert:
You can also embed the material as a vector graphic, of course. Inkscape doesn't seem to support pipes, and rsvg-convert gives ugly output, but I'm sure there's other options.
\immediate\write18{obabel -:'#1' -osvg -p | rsvg-convert -f pdf -o smilesimg\arabic{smilescounter}.pdf}
\immediate\write18{obabel -:'#1' -O smilesimg\arabic{smilescounter}.svg -p ; inkscape -f smilesimg\arabic{smilescounter}.svg -A smilesimg\arabic{smilescounter}.pdf}
Also, if you don't want to call these applications when the graphic files aren't out of date, then use the code snippet found at the top of the 3rd page from this article.
I found a third approach on the interwebs soon after writing the initial post. It's by Jakob Lykke Andersen who converts the SVG to PDF with ImageMagick's convert. If you download the file graphviz.tex from his website, you can just include it and use it as in the following example:
The resulting PDF looks better than the original (though it could be because I didn't handle the PNGs properly in the first PDF). A nice little touch in Jakob's version is that an error box appears in the PDF if there is a problem generating the image.

Exercise for the reader:
A bit more polish is needed before these methods can be used wholesale by others. If you know a bit about LaTeX, have a go at an obabel package for CTAN.

Friday 30 March 2012

Molecular eye candy for IPython

IPython is an enhanced Python shell that does magic and is wonderful. Apart from me, everyone else uses it. You can install it with "easy_install IPython" on Windows, Linux, etc. Recent versions have a graphical console (start with "ipython-qtconsole") to which we can add some molecular eye candy. Greg Landrum has already done this for RDKit. Here's how to do it for Pybel.

First of all, here's the before picture:
Now, copy and paste the following into the console:
pybel.Molecule._repr_png_ = lambda x: x.write("png")
And here's how it looks after:
Greg has gone a bit further with RDKit, and has enhanced substructure matches, but that's where I'll draw the line for now.

Thursday 29 March 2012

Reduce, recycle and reuse bond closure symbols?

Time for a new poll. I'd like to know what you think about reusing bond closure symbols in SMILES strings. Simply put, when writing SMILES there is a choice between reusing the bond closure symbols or using a new number each time. That is, the following molecule with two rings could be written as:
C1CC1c1ccccc1 or

Here are the pros and cons of choosing not to reuse bond closure symbols (anything missing?):

  • Easier to count up the number of rings
  • Easier for a human to find corresponding bond openings and closures
  • Easier to implement (but not a lot)
  • Reuse of bond symbols can cause confusion

  • Use of % notation for two-digit bond closures can cause confusion (people are not so familiar with it)
  • Leads to syntax like C%1145, meaning that bonds 11, 4 and 5 close here (not very intuitive)
  • Limits bond closures to 99 (but more than 99 is unlikely)
  • Not as concise as alternative (for molecules with at least 10 bond closures)

See if you can answer this poll before checking what does software A do, or standard B suggest. What do you think should be the standard approach?

Update (17/04/2012): Poll results: 8 for reuse, 3 against

Tuesday 27 March 2012

Cheer up your LaTeX with SMILES support

No-one likes an unhappy document preparation system, so let's add support for SMILES to LaTeX. LaTeX has already seen a lot of love from chemists - check out the list of chemistry packages at CTAN. Perhaps even more surprising is the fact that a couple of these packages are being actively maintained and updated (e.g. chemmacros and chemfig).

So, let's get down to business. This is a simple hack that will insert a 2D diagram of a molecule if you use the \smiles command. Here's the necessary code (at the top of the .tex file) and an example of use in the main body. The resulting PDF is here.

So how does this work? Well, it simply calls obabel at the command-line to create a .PNG file, and then adds an \includegraphics command (the idea is taken from texments). The only catch is that command-line calling of executables is only allowed if you run pdflatex with the -shell-escape option ("pdflatex -shell-escape myfile.tex"). (For the record, I did this all on Windows, with pdflatex provided by Cygwin.)

While this is, um, neat, to be honest you'd probably be better off creating the PNGs in advance if you have a lot of SMILES strings. A much cooler way to embed chemistry in LaTeX would be to write a QSAR paper using Sweave/R/CDK via Rajarshi's CDK/R packages. This is the idea of literate programming, where the document functions as a complete description of the analysis because it is the analysis; it is the ultimate in reproducible research. Has this been done? Any takers?

Tuesday 20 March 2012

Share ami

Some time ago Google shut down the Share feature of Google Reader to push everyone onto Google Plus. Unfortunately, Google Plus is a bit of a walled garden (and also it sucks compared to FriendFeed, but that's not the issue here). Unlike Google Reader, it doesn't provide an RSS feed for shared items so there's no way I can list them in the sidebar of this blog (as I used to). Instead, this post contains a selection of items I shared on Google Plus since I joined in Nov last year.

I don't know if this is going to be a regular feature, so to follow me on Google Plus, check out my profile page.

The following items are listed somewhat chronologically, youngest first:

    Monday 12 March 2012

    Sunday 11 March 2012

    Accessing the CDK from Jython in Knime

    I've been trying to find out how to use the CDK and/or Open Babel from Jython in Knime. Unfortunately, there's no documentation beyond a single example script. It's a bit strange, as having this working would enable a whole host of developers to write nodes, particularly in the area of chemistry where Python has made considerable inroads.

    After asking and not receiving any answer, I've tried to hack something together. I haven't yet figured out how to do it properly, but at least I've learnt a bit, and the end result was that I was able to calculate the molecular weights with the CDK as in the following diagram:

    Here's the code for the node. It was debugged by keeping the console log file open in gvim, and reloading (:edit) after every run. Also I had to configure the node by selecting the "Script Output" tab, choosing "Append columns to input table spec", and adding an Output column of type Double:

    The way I did this was by hacking the source code for the Jython node itself. First I figured out how to compile it (YMMV). The source is available, but for some reason various files are missing that prevent it compiling out of the box (e.g. the plugin.xml). So I ran the Create Node Wizard (manipulator), copied over various files from the source, copied over various files from the binary (the files in lib and schema), added the lib/jython.jar (as described here), and installed the org.python.plugin node (not forgetting to remove the duplicate org.knime.ext.jython).

    To add the CDK jar to the Classpath, I placed it in the lib directory, repeated the procedure described above for lib/jython.jar, and then added the following hack to PythonScriptNodeModel.execute:
    If there's someone from Knime reading this, please let me know the correct way of doing this. I know that there's an extension point for extending the Classpath of the Jython node, but how is this used?