Wednesday 31 October 2012

A non-random method to improve your QSAR results - works every time! Part II

In an earlier post, I argued that when developing a predictive QSAR model that non-random division of a QSAR dataset into training and test is a bad idea, and in particular that diversity selection should be avoided.

Not everyone agrees with me (surprise!). See for example this excellent review by Scior et al in 2009 on "How to Recognize and Workaround Pitfalls in QSAR Studies: A Critical Review". Well, the paper is excellent except for the bit about training/test set selection which explicitly pours cold water on random selection in favour of, for example, diversity selection, hand selection or *cough* rational selection.

I had vague thoughts about writing a paper on this topic, but now there's no need. A paper by Martin et al has just appeared in JCIM: "Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?"

And the answer? No.

Combining their discussion with my own randomly-selected thoughts, the reasons it's a bad idea can be summarised as follows:
  1. You are violating the first rule of statistical testing - the training and test sets must be chosen in the same way (this was pointed out to me by a bioinformatician on FF - sorry I can't recall whom).
  2. All the weird molecules are sucked into your training set
  3. Every item in the test set is going to be close to an item in your training set, and your internal predictions are going to be overoptimistic compared to reality. (You do care about reality, don't you?)
Dr Kennard-Stone has a lot to answer for, but hopefully this is the final nail in the coffin (it being Halloween after all) for diversity selection in QSAR.

Update (7/11/12): In the comments, George Papadatos points out that "There's also related evidence that sophisticated diversity selection methods do not actually perform better than random picking".

Thursday 25 October 2012

Open Access and Ireland

It's Open Access week, so let's see how Ireland (ROI) is shaping up.

The good news first. The Government has just this week announced a national OA mandate, presumably following on the recent happenings in the UK. Indeed, most of the national funding agencies have had their own OA mandates for several years, but this formalises this policy at a national level.

So let's check out the OA output of the various universities and see whether taxpayers are getting their money's worth. Each university has its own institutional repository, and these are collected in a national repository, Rian. I generated the following bar charts using the statistics on deposited journal articles in Rian, combined with academic staff numbers taken from the Wikipedia pages for each university.

Are academics in TCD really 9.5 times more productive than their peers in UCC? Possibly not. I think it likely that OA deposition rates across most Irish universities could do with improvement.

Sunday 21 October 2012

Learn scikit and never use R again

Hallelujah, my brothers and sisters. Free yourselves from the <- and the $, the "how do I select a row again? I only did this last week!", and the arcane differences between a table and a data.frame. Join with me in using scikit-learn, and never use R again.

Could it be true? No need to ever use R again? Well, that's how it looks to me. Scikit-learn is a Python module for machine learning which seems to replicate almost all of the multivariate analysis modules I used to use in R. Thanks to Nikolas Fechner at the RDKit UGM for tuning me into this.

Let's see it in action for a simple example that uses SVM to classify irises (not the eyeball type). First, the R:

mysvm <- svm(Species ~ ., iris)
mysvm.pred <- predict(mysvm, iris)
# mysvm.pred   setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     48          2
#   virginica   0      2         48

And now the Python:
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
iris = datasets.load_iris()

mysvm = svm.SVC().fit(,
mysvm_pred = mysvm.predict(
print confusion_matrix(mysvm_pred,
# [[50  0  0]
#  [ 0 48  2]
#  [ 0  0 50]]
This library is quite new, but there seems to be quite a bit of momentum in the data processing space right now in Python. See also Statsmodels and Pandas. These videos from PyData 2012 give an overview of some of these projects.

Saturday 20 October 2012

Old software never dies, it just...

Ah, the old ones are the best. Or so some think.

During my PhD, I wrote software for linking the output of comp chem calculations to experimental results (e.g. IR and UV-Vis spectra, and so forth). This software, GaussSum, has now been around for quite some time (Aug 2003 in fact). For the most part I no longer add features, beyond updating the parsing library (from cclib) to handle new file formats. At this point most of the bugs have been identified and removed...but does everyone use the latest version? Let's find out...

If you search Google Scholar for papers published in 2012 with the word "GaussSum" (but not "Gauss") you get about 121 citations (well, mentions, at least). In 45 of these cases, the extract in the Google Scholar results unambiguously lists the version number used.

The majority of the citations, a total of 27, are for the current version 2.2, released in Nov 2009 with several updates since. Four specify the precise version: 2.2.0 (Nov 2009), 2.2.2 (Nov 2009), 2.2.4 (July 2010) and 2.2.5 (the latest, in Jan 2011).

Of the remainder, the majority, a total of 13, are for version 2.1, released first in June 2007 and replaced by version 2.2 in Nov 2009. Three specify the precise version: 2.1.4 twice (Dec 2007) and 2.1.6 once (Apr 2009, the final update).

No one cites version 2.0 (Oct 2006 to June 2007). (Sad face.) But that still leaves five citations unaccounted for.

Version 1.0 was released in Oct 2005 and held sway until Oct 2006. It has been cited 4 times this year. In three of those cases they cite the precise version, version 1.0.5 (the final update, Aug 2006).

But that is not all. Oh no. I still have a loyal user of GaussSum 0.8 out there. Although released in Jan 2004 and superseded in Apr 2005 by version 0.9, this particular version with its quaint bugs and lovable lack of features, still has its adherents (well, adherent, anyway).

And the moral of the story is..."don't ever use Google Scholar to check what versions of software people are using". No that's not it - I mean "update your scientific software or some of your results are going to be dubious." At least check for a new version every 8 years.

Indeed, it seems old software never just gets fewer citations.

Wednesday 17 October 2012

Wherein I get Roger to blog

My cunning plan to get Roger Sayle to write down some of his findings has come to fruition. NextMove Software now has its own blog with posts from Daniel Lowe, myself and Roger on ChemSketch to Mol conversion, visualising a hierarchy of substructures, improving InChIs in Wikipedia, and how using mmap can speed up accessing binary files.

I haven't yet worked out how to decide where I post a particular article but I guess some things are more appropriate for the one blog over the other (e.g. work stuff for work blog, non-work stuff for non-work blog).

I've also taken the opportunity to add a few links to the sidebar. If you're interested in the kind of things I post here, you may want to check out my Google Plus page where I highlight articles in other blogs that I find interesting.

And finally, since this is turning into a meta-post, I encourage you to write your own blog. It takes about 30 seconds to set one up on Blogger (for example), and if it's about cheminformatics, I will read it.

Tuesday 16 October 2012

Pretty pictures of conformers

For most of this afternoon, Roger has been explaining to me how the byte-code compiler in Open Babel's protein perception works (he wrote it). Well, first of all he had to explain why there was a byte-code compiler in there. No, that's not true - actually first he had to explain what a byte-code compiler was.

Now I need some quiet time while it all sinks in.

So here are some pretty pictures of Confab-generated conformers courtesy of Jean-Paul Ebejer of InhibOx which were generated for his recent paper comparing conformer generators (see also here).

JP was also kind enough to provide the PyMOL scripts used to generate these if anyone wants to do so themselves: [cartoon] [raytraced].

Thursday 11 October 2012

Nothing added but time - ABI and API stability

Open Babel has a version numbering system that requires us to maintain API and ABI stability between bugfix releases (2.3.x), API stability between minor releases (2.x) although the API can be extended, and all bets are off for major releases (x).

Very exciting stuff altogether, I'm sure you'll agree.

What it all means is that just before release we need to make sure we hold to this promise. In the past, we've done this with varying degrees of success. API stability is fairly easy to hold to, but ABI stability is trickier - knowing the rules for what makes or breaks the ABI is half the battle. To be fair, the reason it has been so difficult is because there was no automated tool for sorting this all out for us.

Finally, a couple of years ago, the ABI Compliance Checker appeared which makes the whole thing a doddle. So today I compiled OB 2.3.1 and OB 2.3.2 and installed locally on a Linux system. Then, for each, I made an XML file like so:



And finally, I just ran the checker:
$abi-compliance-checker-1.98.3/ -lib openbabel -old 2.3.1.xml -new 2.3.2.xml
preparation, please wait ...
Using GCC 4.8.0 (x86_64-unknown-linux-gnu)
checking header(s) 2.3.1 ...
ERROR: some errors occurred when compiling headers
ERROR: see log for details:

checking header(s) 2.3.2 ...
ERROR: some errors occurred when compiling headers
ERROR: see log for details:

comparing ABIs ...
comparing APIs ...
creating compatibility report ...
total "Binary" compatibility problems: 0, warnings: 4
total "Source" compatibility problems: 0, warnings: 8
see detailed report:

You can check out the detailed report here.

(1) I also had to delete openbabel/math/align.h as I had compiled without Eigen in each case. This header file probably shouldn't have been installed in that case.
(2) The error messages in the log files were related to the use of abs(double). Not sure why.

Monday 8 October 2012

1st RDKit User Group Meeting

Just back from the first ever RDKit User Group Meeting, organised by Nathan "Broon" Brown of the Institute for Cancer Research in London, and presided over by Greg Landrum, the lead developer of RDKit.

Really great meeting - a mixture of round-table discussions, scientific presentations, tutorials and RDKit on Raspberry Pi compilation tips. Got to meet a lot of new people. Check out Nathan's tweets for an overview.

The talks and other material will appear in a few weeks so there's no point me talking about them too much now. One thing I will note is that Greg kicked off his history of RDKit by blaming me for prodding him into action with the one of the first emails to the rdkit-discuss mailing list. It seems that I am responsible for collapsing that wavefunction. Also, look out for Andrew Dalke's Roche-sponsored MCS code in the next release.

As an aside, it was interesting to note the background of the 30 or so participants. Apart from the ICR attendees, there were only a few academics, with the remainder from pharma or pharma consulting or software vendors (i.e. me now) or George Papadatos (are you an academic too George?).

And talking of George, the next RDKit UGM will be hosted by the EBI, so I'll have even less to travel next time.