Not everyone agrees with me (surprise!). See for example this excellent review by Scior et al in 2009 on "How to Recognize and Workaround Pitfalls in QSAR Studies: A Critical Review". Well, the paper is excellent except for the bit about training/test set selection which explicitly pours cold water on random selection in favour of, for example, diversity selection, hand selection or *cough* rational selection.
I had vague thoughts about writing a paper on this topic, but now there's no need. A paper by Martin et al has just appeared in JCIM: "Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?"
And the answer? No.
Combining their discussion with my own randomly-selected thoughts, the reasons it's a bad idea can be summarised as follows:
- You are violating the first rule of statistical testing - the training and test sets must be chosen in the same way (this was pointed out to me by a bioinformatician on FF - sorry I can't recall whom).
- All the weird molecules are sucked into your training set
- Every item in the test set is going to be close to an item in your training set, and your internal predictions are going to be overoptimistic compared to reality. (You do care about reality, don't you?)
Update (7/11/12): In the comments, George Papadatos points out that "There's also related evidence that sophisticated diversity selection methods do not actually perform better than random picking".