My motivating example is fluoro and chloro - are these R groups similar? Well, a chemical fingerprint would say they have no bits in common and hence have zero similarity. But we know they're similar. Not just because they are nearby in the periodic table, but because chemists tend to make molecules that have Cl replaced with F.
In other words, we can use medicinal chemistry project data to infer a measure of R group similarity by looking at what R groups co-occur in medicinal chemistry projects as matched pairs. This measure can be used to suggest changes, form the basis of an enumeration strategy, or search for similar molecules. Because this similarity measure is derived from medicinal chemistry data, the results should be reasonable and interpretable.
At the recent ACS meeting in San Diego, I described my work on a measure of R group similarity:
Measuring R group similarity using medicinal chemistry data
I found it difficult to jam everything into my 20+5min talk, so let me add a bit of discussion here on how it compares to previous work...
There certainly has been some interesting work on R group similarity from Sheffield (e.g. Holliday et al), and more recently from this year's Skolnik Award winner Kimito Funatsu (Tamura et al), among others. But perhaps the most similar work is that on large-scale identification of bioisosteric replacements from the Bajorath group (e.g. Wassermann and Bajorath) and Swiss-Bioisotere (Wirth et al). Others have used 3D data from the PDB to further refine predictions, e.g. Seddon et al and Lešnik et al.
The thing is, most of the time in a med chem project (correct me if wrong) you're not looking for bioisosteric replacements - you want to improve potency, or you want just want to probe the binding site, e.g. see can you grow in this direction or that direction. The changes people make are often to R groups around the same size (maybe a bit bigger), but not always to ones with the same properties. For example, if an electron-withdrawing group doesn't work, time to test an electron-donating group. Of course, towards the end, you might have potency nailed and are trying to work around some physchem properties - at this point, bioisosteric replacements come into play.
Something else that I feel previous work has missed is the time dimension. The path of R group replacements can be envisaged as starting with a hydrogen or methyl, and through an iterative process ending up somewhere a whole lot more complicated. Although methyl co-occurs with the vast majority of R groups (as a matched molecular pair), after a certain point in a project that's not a useful suggestion; it's already been done, or already been ruled out. What you want to know is what R groups are typically made before versus after. You can easily work this out if you have time series information from med chem projects, but as I show in the talk, even in the absence of explicit time series data, it's possible to infer the approximate order in which R groups are made if you have a large number of sets of med chem project data.
I had a couple of questions about how to turn this method of finding similar R groups into a similarity metric. I turned this question back on the questioners: "Why do you want that? My guess is to use it to find similar molecules." But you can find similar molecules without turning the similarity into a number between 0 and 1; the talk describes one way to do this.
Oh, and here's the poster on the same topic I showed earlier this year at Sheffield and subsequently at the ACS.