The safest procedure when doing the two-third/one-third split (or whatever) of a dataset into training and test sets is to choose the split randomly. As soon as you do anything else, the performance of your results on the test set will not be related to its real-life predictive performance.

Now and then a paper comes out where the training set is chosen to be a diverse set (whether by a Kennard-Stone type procedure, gridding the points, or initial clustering). At first this seems quite reasonable: you've made sure that you have chosen points that span the entire space of your dataset. In actual fact, what you've just done is to ensure that all of your test set points are close to a point in your training set. This means that the predictions on the test set are now woefully optimistic and unrepresentative of predictions on a true test set (which probably won't have the courtesy to be close to points in your training set, nor be distributed evenly across the whole space).

Here's another way of thinking about it. Let's imagine that our predictive model is meant to work on a particular chemical space of molecules (the domain of applicability or so). These molecules have a particular distribution of descriptor variables (or whatever). And both the training and test sets are supposed to be representative samples of this distribution. However, once you use a non-random method to split into training and test sets, both the training and test sets become less and less representative of the population and hence you will have little idea of the real-life predictive ability of your model.

I've never seen this idea written down anywhere, so I'd be interested to know whether anyone thinks there is justification for choosing a non-random training set in certain circumstances. My own advice though is to keep it real and keep it random.

**Image credit:**Arenamontanus

## 8 comments:

Another way would be to have non random spiting but use similar compound together in one set. So the training set and test set are normally different.

It can be used to see the performance of the model on a different set (without neighbour artefact in the model making it better than it is).

One can also use 3 sets, one very different as test set (using cluster, or grid…), then the second set is spitted in two using random for a training and validation set.

This could help reduce the optimism of a model; however the results of the test sets could be not very impressive, and people may not publish/use/trust the results.

In my thesis I used bootstrapping to 'visualize' the effect of difference test set selections (repeated random selections, btw). The point there was not to estimate the effect of those selections, but to show that your quality measures (R^2, Q^2, RMSEP) are intrinsically variant; in short, a Q^2 of 0.92 is *not* better than one of 0.89. Not for the typical sizes of QSAR data sets. Below 100 compounds the variance is even larges and can easily go up to 0.05.

People indeed very often underestimate how their assumptions effect their statistical models.

I agree with you. A comment, though... random division into subsets should also give even spacing in your chemical space, at least for very large sets.

If the training set is small, it seems to me that helping it to span the entire relevant space is a good thing. Is not the point of QSAR is to approximately model a

specific, closedset of structures? I don't think you expect QSAR to work universally.In any case, the set will always be insignificant in size compared to the entire possible chemical space. The compounds you want QSAR to work for (drug-like) constitute a tiny fraction of it, so too much diversity will be counterproductive for prediction. A well-defined training set, biased towards the sort of chemicals you want to model, seems to be a good thing. I've seen this argumentation in the context of HTS and docking libraries.

@Jeremy: I understand, but I don't know if it's very useful to see how well a model trained on apples can predict oranges. A separate model trained on oranges might be a good idea.

@Egon: I agree. Accurate assessment of predictive ability is not straightforward, and there are some subtle traps.

@Karol: It depends what you mean by "even spacing". Let's say you have only one descriptor, and it has a normal distribution about the mean. A random subset of molecules will also have a normal distribution; however, a diverse subset will be 'even spaced' from the minimum value to the maximum, and have a flat distribution. Any model derived from this will give equal weight to all points on the range, instead of focusing on the central part of the range, where most of the points are. As a result, it will perform poorer on a true test set (as far as I can see).

If the training set is small, it is all the more important that you don't intervene or you will have an even greater effect. One solution could be to generate 100 random models and get the mean of the predictions found.

Regarding "Is not the point of QSAR is to approximately model a specific, closed set of structures?" Indeed it is - this is the concept of the domain of applicability; in other words, the model won't work for all molecules.

The choice of what structures to model is slightly different than the topic of the blog post, which is rather how to accurately assess the quality of a predictive model given a dataset respresentative of "the sort of chemicals you want to model".

Noel, I see what you mean. Are the distributions of descriptors in QSAR usually normal?

@baoilleach

Hi, well the words are used may be too much. The molecules will come from the same dataset so they all should be oranges.

However using random splitting only can leave molecules which are very similar (like Cl -> Br) in 2 different sets. Then when you look at the performance of the model it performs very well.

While if you cluster the molecules and then use the cluster to split in training/test sets you avoid this effect.

After it depends also on what you are trying to do , work on well defined space (or try to explain existent results), or try to find a model whcih can be apllied to slightly different chemistry (not too different but more than just aromatic substitution), and it is important to know how the model performs on something different.

@Jeremy: Ok, forget the oranges :-)

I defintely agree that clustering could be useful for selecting the dataset in the first place.

Post a Comment