In a recent blog post, Pat Walters describes using visualisation of chemical space to answer questions about the overlap and complementarity of different chemical datasets. His introductory paragraph gives two examples of where this is useful: considering the purchase of an additional set of screening compounds, and secondly comparing training and test sets.
Now usually I'd be the first to reach for a visualisation technique to answer a question, but here I'm going to argue that at least for the first example, where the within-set diversity is high, this might not be the best approach. I would be worried that the layout of molecules in the visualisation would largely be determined by low-similarity values, and that appearing close together in the depiction would not be meaningful.
The approach I used recently to tackle this problem involved clustering with ChemFP's Taylor-Butina (TB) implementation. I used the provided TB script to calculate the number of clusters for each of the two datasets on its own. Then I concatenated the fingerprint files and using the same settings, I generated clusters for the two datasets together. Finally I processed the output of the combined clustering to work out how many clusters only contained dataset A, only dataset B or contained a mixture.
The result is conceptually a Venn diagram of A and B, describing (a) the number of entries in dataset B that are within the similarity threshold to entries in dataset A (and whose purchase is potentially a waste of money) and (b) the number that are outside this threshold. For these latter, you also know how many clusters they form, which is important as you are looking for a diverse set that is distinct from dataset A.
And I'll finish with a shout-out to Andrew Dalke, whose ChemFP made this analysis possible.
1. It's not strictly necessary to cluster each dataset on its own, but it's a useful sanity check. The number of clusters found for each dataset on its own was roughly the same as those containing that dataset in the combined clustering.
2. I can't recall the similarity threshold I used, but likely something in the 0.30-0.40 ECFP4 region.
3. I considered true singletons to be clusters of their own in this analysis, and ignored false singletons. After the fact I realised that duplicates by fingerprint (or better still perhaps by InChI) should probably be eliminated before doing the TB clustering, as their presence may convert a false singleton into a true cluster of size two. It's kind of arbitrary, but duplicating a dataset should not change the results of a clustering algorithm.