Wednesday 31 October 2012

A non-random method to improve your QSAR results - works every time! Part II

In an earlier post, I argued that when developing a predictive QSAR model that non-random division of a QSAR dataset into training and test is a bad idea, and in particular that diversity selection should be avoided.

Not everyone agrees with me (surprise!). See for example this excellent review by Scior et al in 2009 on "How to Recognize and Workaround Pitfalls in QSAR Studies: A Critical Review". Well, the paper is excellent except for the bit about training/test set selection which explicitly pours cold water on random selection in favour of, for example, diversity selection, hand selection or *cough* rational selection.

I had vague thoughts about writing a paper on this topic, but now there's no need. A paper by Martin et al has just appeared in JCIM: "Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?"

And the answer? No.

Combining their discussion with my own randomly-selected thoughts, the reasons it's a bad idea can be summarised as follows:
  1. You are violating the first rule of statistical testing - the training and test sets must be chosen in the same way (this was pointed out to me by a bioinformatician on FF - sorry I can't recall whom).
  2. All the weird molecules are sucked into your training set
  3. Every item in the test set is going to be close to an item in your training set, and your internal predictions are going to be overoptimistic compared to reality. (You do care about reality, don't you?)
Dr Kennard-Stone has a lot to answer for, but hopefully this is the final nail in the coffin (it being Halloween after all) for diversity selection in QSAR.

Update (7/11/12): In the comments, George Papadatos points out that "There's also related evidence that sophisticated diversity selection methods do not actually perform better than random picking".

7 comments:

Egon Willighagen said...

I always found bootstrapping a nice way to estimate model prediction error due to data set composition. Using that, you can easily show that below some 100 molecules, the effect of 'selecting' a training set become dangerously large.

Noel O'Boyle said...

I certainly agree. What I had in mind by way of a paper, would be to try all possible training/test set splits and make an RMSD histogram, and then highlight where on the histogram a "rationally-selected" data set would appear.

Unknown said...

I couldn't agree more with your post Noel.
There's also related evidence that sophisticated diversity selection methods do not actually perform better than random picking:
http://www.ncbi.nlm.nih.gov/pubmed/16562980

Noel O'Boyle said...

Thanks for that George. The evidence is mounting...

Vincent said...

There are some other studies that state otherwise regarding diversity picking for HTS (funny part is that it is also a Novartis study!): http://www.ncbi.nlm.nih.gov/pubmed/16562980

Difficult to compare the two purposes though... (QSAR training / test set VS diversity for screening)

Noel O'Boyle said...

Don't get me wrong - diversity selection has its place. Selecting a diverse set of molecules, for example! My comments are specifically regarding unbiased testing of a predictive QSAR model.

Vincent said...

Yep, my comment was more targeted on George's post :)