Comments on Noel O'Blog: A non-random method to improve your QSAR results - works every time!

@Jeremy: Ok, forget the oranges :-) I defintely a...

2010-06-17T11:57:00.361+01:00

@Jeremy: Ok, forget the oranges :-)

I defintely agree that clustering could be useful for selecting the dataset in the first place.

@baoilleach Hi, well the words are used may be too...

2010-06-17T11:37:23.566+01:00

@baoilleach
Hi, well the words are used may be too much. The molecules will come from the same dataset so they all should be oranges.
However using random splitting only can leave molecules which are very similar (like Cl -> Br) in 2 different sets. Then when you look at the performance of the model it performs very well.
While if you cluster the molecules and then use the cluster to split in training/test sets you avoid this effect.
After it depends also on what you are trying to do , work on well defined space (or try to explain existent results), or try to find a model whcih can be apllied to slightly different chemistry (not too different but more than just aromatic substitution), and it is important to know how the model performs on something different.

Noel, I see what you mean. Are the distributions o...

2010-06-16T16:49:10.487+01:00

Noel, I see what you mean. Are the distributions of descriptors in QSAR usually normal?

@Karol: It depends what you mean by "even spa...

2010-06-16T15:09:06.747+01:00

@Karol: It depends what you mean by "even spacing". Let's say you have only one descriptor, and it has a normal distribution about the mean. A random subset of molecules will also have a normal distribution; however, a diverse subset will be 'even spaced' from the minimum value to the maximum, and have a flat distribution. Any model derived from this will give equal weight to all points on the range, instead of focusing on the central part of the range, where most of the points are. As a result, it will perform poorer on a true test set (as far as I can see).

If the training set is small, it is all the more important that you don't intervene or you will have an even greater effect. One solution could be to generate 100 random models and get the mean of the predictions found.

Regarding "Is not the point of QSAR is to approximately model a specific, closed set of structures?" Indeed it is - this is the concept of the domain of applicability; in other words, the model won't work for all molecules.

The choice of what structures to model is slightly different than the topic of the blog post, which is rather how to accurately assess the quality of a predictive model given a dataset respresentative of "the sort of chemicals you want to model".

@Jeremy: I understand, but I don't know if it&...

2010-06-16T14:54:18.414+01:00

@Jeremy: I understand, but I don't know if it's very useful to see how well a model trained on apples can predict oranges. A separate model trained on oranges might be a good idea.

@Egon: I agree. Accurate assessment of predictive ability is not straightforward, and there are some subtle traps.

I agree with you. A comment, though... random divi...

2010-06-15T21:11:39.136+01:00

I agree with you. A comment, though... random division into subsets should also give even spacing in your chemical space, at least for very large sets.

If the training set is small, it seems to me that helping it to span the entire relevant space is a good thing. Is not the point of QSAR is to approximately model a specific, closed set of structures? I don't think you expect QSAR to work universally.

In any case, the set will always be insignificant in size compared to the entire possible chemical space. The compounds you want QSAR to work for (drug-like) constitute a tiny fraction of it, so too much diversity will be counterproductive for prediction. A well-defined training set, biased towards the sort of chemicals you want to model, seems to be a good thing. I've seen this argumentation in the context of HTS and docking libraries.

In my thesis I used bootstrapping to 'visualiz...

2010-06-15T16:37:16.448+01:00

In my thesis I used bootstrapping to 'visualize' the effect of difference test set selections (repeated random selections, btw). The point there was not to estimate the effect of those selections, but to show that your quality measures (R^2, Q^2, RMSEP) are intrinsically variant; in short, a Q^2 of 0.92 is *not* better than one of 0.89. Not for the typical sizes of QSAR data sets. Below 100 compounds the variance is even larges and can easily go up to 0.05.

People indeed very often underestimate how their assumptions effect their statistical models.

Another way would be to have non random spiting bu...

2010-06-15T16:25:04.876+01:00

Another way would be to have non random spiting but use similar compound together in one set. So the training set and test set are normally different.
It can be used to see the performance of the model on a different set (without neighbour artefact in the model making it better than it is).
One can also use 3 sets, one very different as test set (using cluster, or grid…), then the second set is spitted in two using random for a training and validation set.

This could help reduce the optimism of a model; however the results of the test sets could be not very impressive, and people may not publish/use/trust the results.