Following on from #1 and #2, here's some work I began (maybe twice) but never completed.
Back in 2016 I published a paper entitled "Which fingerprint is best?" - actually, no, someone talked me down into the title "Comparing structural fingerprints using a literature-based similarity benchmark", a title which accurately describes what we did, but not why we did it. Anyway, it's work I'm very proud of - for example, it shows there is a detectable performance difference between 16384 and 4096 bit ECFP fingerprints, and that the fingerprints most appropriate for finding very close analogs are different than for those more distant.
The obvious follow-up to this work would be compare similarity metrics, for some particular fingerprint, e.g. the 16K ECFP4 fingerprint. There's a great paper by Todeschini et al that compares 51 similarity metrics and has a nice table where the equations are all given in one place. So I took these as the starting point for my own work.
After coding them all up and running them against the benchmark, I found that a number of the metrics gave identical rank orders across the entire dataset. That part wasn't surprising - the paper itself notes that several metrics are correlated, and the Riniker and Landrum paper has an appendix with a worked proof that shows that Tanimoto and Dice are correlated. What was surprising is that the ones I was finding as correlated were not necessarily the same as those in the paper (there was some overlap, but...).
Regarding the main reason for doing the paper, Tanimoto surprisingly (or not) did turn out to be one of the best (if not the best, I can't remember exactly). Perhaps more interesting were the results I got from looking at the effect of changing the weighting of the Tversky similarity; I can't remember the details but the best results were not where I expected them to be, and I never got to the bottom of why.
Tuesday, 17 September 2019
Monday, 16 September 2019
Mutation testing partialsmiles
Many Python programmers will be familiar with the concept of achieving high test coverage (e.g. via an analysis tool such as coverage.py), but they should also know that high coverage does not mean that everything is bug free. So where can you go from there?
One option is to use mutation testing. This is based around the concept that at least one test should fail if the code is "mutated". Now, mutated is a bit vague but consider changing a plus to a minus, or a "break" to a "continue" in a Python script. If your tests still all pass, it means that they don't sufficiently cover all scenarios.
Let's take a look at the codebase I just released, partialsmiles 1.0. I'm going to use cosmic-ray to generate and test mutants. To begin with I've installed it into a virtualenv. The Scripts folder contains the executables I'm going to use:
1. Create the config file: cosmic-ray new-config config.toml
2. Edit it to only test smiparser.py: module-path = "partialsmiles/smiparser.py"
3. Edit it to run the test suite: test-command = "python suite.py"
4. Analyse the code and create the mutants: cosmic-ray init config.toml session.sqlite
5. Generate the mutated code and test the mutants: cosmic-ray exec session.sqlite
6. Generate a HTML report: cr-html session.sqlite > cosmic-ray.html
Note that on Windows I've found that it doesn't always stop all of the running Python processes so I needed to kill them directly through the Process Explorer.
Looking at the report, 1514 mutants were generated and 25% survived. Yikes. Here's an example:
The plus was changed into a minus and all of the tests still passed. The line itself is covered by the tests - I have a test for a two-digit isotope - but I am not testing the returned value.
So there you go. Another tool for the toolbox. And some work for me to do.
One option is to use mutation testing. This is based around the concept that at least one test should fail if the code is "mutated". Now, mutated is a bit vague but consider changing a plus to a minus, or a "break" to a "continue" in a Python script. If your tests still all pass, it means that they don't sufficiently cover all scenarios.
Let's take a look at the codebase I just released, partialsmiles 1.0. I'm going to use cosmic-ray to generate and test mutants. To begin with I've installed it into a virtualenv. The Scripts folder contains the executables I'm going to use:
1. Create the config file: cosmic-ray new-config config.toml
2. Edit it to only test smiparser.py: module-path = "partialsmiles/smiparser.py"
3. Edit it to run the test suite: test-command = "python suite.py"
4. Analyse the code and create the mutants: cosmic-ray init config.toml session.sqlite
5. Generate the mutated code and test the mutants: cosmic-ray exec session.sqlite
6. Generate a HTML report: cr-html session.sqlite > cosmic-ray.html
Note that on Windows I've found that it doesn't always stop all of the running Python processes so I needed to kill them directly through the Process Explorer.
Looking at the report, 1514 mutants were generated and 25% survived. Yikes. Here's an example:
The plus was changed into a minus and all of the tests still passed. The line itself is covered by the tests - I have a test for a two-digit isotope - but I am not testing the returned value.
So there you go. Another tool for the toolbox. And some work for me to do.
Sunday, 15 September 2019
partialsmiles - A validating parser for partial SMILES
I've just released version 1.0 of partialsmiles, which is available via pip. Documentation can be found here and the project is hosted by GitHub.
partialsmiles is a Python library that provides a validating SMILES parser with support for partial SMILES. The parser checks syntax, ability to kekulize aromatic system, and checks whether atoms’ valences appear on a list of allowed valences.
The main use of the library is to accept or reject SMILES strings while they are being generated character-by-character by a machine-learning method. Previously the only way to check such strings was after the entire string was generated, a process that could take 10s of seconds. Using partialsmiles, SMILES strings with invalid syntax or problematic semantics can be detected while they are being generated, allowing them to be discarded or for alternate tokens to be suggested.
Some other toolkits (such as OEChem and the CDK) can read partial SMILES; should you use these instead? Certainly, if they meet your needs. Their goal is to return a molecule object that captures bonding information so far (useful for ‘depiction as you type’, for example), perhaps converting partial aromatic rings to single bonds and unmatched ring connections to dangling bonds. This is in contrast to the goal of partialsmiles which is to work out whether the partial SMILES string could be the prefix of a complete SMILES string that would be accepted by the parser. For example, "CC(C)(C)(" may be read as isobutane by other readers, but partialsmiles will reject it with a valence error as completion of the string will lead to a five or more valent carbon (see the docs).
A secondary goal of this project is to make available a permissively-licensed SMILES parser that correctly reads all aromatic systems in the SMILES reading benchmark. This is with the hope to encourage improvements in existing parsers.
And finally, I just want to thank Xuhan Liu for providing the suggestion that such a project would be useful.
Image credit: half by MandoBarista (licensed CC BY-SA)
partialsmiles is a Python library that provides a validating SMILES parser with support for partial SMILES. The parser checks syntax, ability to kekulize aromatic system, and checks whether atoms’ valences appear on a list of allowed valences.
The main use of the library is to accept or reject SMILES strings while they are being generated character-by-character by a machine-learning method. Previously the only way to check such strings was after the entire string was generated, a process that could take 10s of seconds. Using partialsmiles, SMILES strings with invalid syntax or problematic semantics can be detected while they are being generated, allowing them to be discarded or for alternate tokens to be suggested.
Some other toolkits (such as OEChem and the CDK) can read partial SMILES; should you use these instead? Certainly, if they meet your needs. Their goal is to return a molecule object that captures bonding information so far (useful for ‘depiction as you type’, for example), perhaps converting partial aromatic rings to single bonds and unmatched ring connections to dangling bonds. This is in contrast to the goal of partialsmiles which is to work out whether the partial SMILES string could be the prefix of a complete SMILES string that would be accepted by the parser. For example, "CC(C)(C)(" may be read as isobutane by other readers, but partialsmiles will reject it with a valence error as completion of the string will lead to a five or more valent carbon (see the docs).
A secondary goal of this project is to make available a permissively-licensed SMILES parser that correctly reads all aromatic systems in the SMILES reading benchmark. This is with the hope to encourage improvements in existing parsers.
And finally, I just want to thank Xuhan Liu for providing the suggestion that such a project would be useful.
Image credit: half by MandoBarista (licensed CC BY-SA)
Saturday, 14 September 2019
Moving to pastures new, but still in the same field Part III
Following on from my previous post on the topic, 30th August was my last day at NextMove Software.
Monday week I'll be joining the Computational Chemistry team at Sosei Heptares which is lead by Chris de Graaf. If you're not familiar with the company, the R&D site is based just outside Cambridge (UK) and works on structure-based drug discovery with a focus on GPCRs. I'm very excited about this move as I'm really looking forward to having a chance to apply my skills more directly to drug discovery.
Here's a high-level overview of what the company does, which I found quite useful (jump to 1:36 for comp chem):
Monday week I'll be joining the Computational Chemistry team at Sosei Heptares which is lead by Chris de Graaf. If you're not familiar with the company, the R&D site is based just outside Cambridge (UK) and works on structure-based drug discovery with a focus on GPCRs. I'm very excited about this move as I'm really looking forward to having a chance to apply my skills more directly to drug discovery.
Here's a high-level overview of what the company does, which I found quite useful (jump to 1:36 for comp chem):
Friday, 13 September 2019
R you similar? Inferring R group similarity from medicinal chemistry data
While molecular similarity is widely-studied (even by me!), the question of how to measure R group similarity is much less so, despite the fact that R group replacement is very commonly applied in medicinal chemistry projects.
My motivating example is fluoro and chloro - are these R groups similar? Well, a chemical fingerprint would say they have no bits in common and hence have zero similarity. But we know they're similar. Not just because they are nearby in the periodic table, but because chemists tend to make molecules that have Cl replaced with F.
In other words, we can use medicinal chemistry project data to infer a measure of R group similarity by looking at what R groups co-occur in medicinal chemistry projects as matched pairs. This measure can be used to suggest changes, form the basis of an enumeration strategy, or search for similar molecules. Because this similarity measure is derived from medicinal chemistry data, the results should be reasonable and interpretable.
At the recent ACS meeting in San Diego, I described my work on a measure of R group similarity:
I found it difficult to jam everything into my 20+5min talk, so let me add a bit of discussion here on how it compares to previous work...
There certainly has been some interesting work on R group similarity from Sheffield (e.g. Holliday et al), and more recently from this year's Skolnik Award winner Kimito Funatsu (Tamura et al), among others. But perhaps the most similar work is that on large-scale identification of bioisosteric replacements from the Bajorath group (e.g. Wassermann and Bajorath) and Swiss-Bioisotere (Wirth et al). Others have used 3D data from the PDB to further refine predictions, e.g. Seddon et al and Lešnik et al.
The thing is, most of the time in a med chem project (correct me if wrong) you're not looking for bioisosteric replacements - you want to improve potency, or you want just want to probe the binding site, e.g. see can you grow in this direction or that direction. The changes people make are often to R groups around the same size (maybe a bit bigger), but not always to ones with the same properties. For example, if an electron-withdrawing group doesn't work, time to test an electron-donating group. Of course, towards the end, you might have potency nailed and are trying to work around some physchem properties - at this point, bioisosteric replacements come into play.
Something else that I feel previous work has missed is the time dimension. The path of R group replacements can be envisaged as starting with a hydrogen or methyl, and through an iterative process ending up somewhere a whole lot more complicated. Although methyl co-occurs with the vast majority of R groups (as a matched molecular pair), after a certain point in a project that's not a useful suggestion; it's already been done, or already been ruled out. What you want to know is what R groups are typically made before versus after. You can easily work this out if you have time series information from med chem projects, but as I show in the talk, even in the absence of explicit time series data, it's possible to infer the approximate order in which R groups are made if you have a large number of sets of med chem project data.
I had a couple of questions about how to turn this method of finding similar R groups into a similarity metric. I turned this question back on the questioners: "Why do you want that? My guess is to use it to find similar molecules." But you can find similar molecules without turning the similarity into a number between 0 and 1; the talk describes one way to do this.
Oh, and here's the poster on the same topic I showed earlier this year at Sheffield and subsequently at the ACS.
My motivating example is fluoro and chloro - are these R groups similar? Well, a chemical fingerprint would say they have no bits in common and hence have zero similarity. But we know they're similar. Not just because they are nearby in the periodic table, but because chemists tend to make molecules that have Cl replaced with F.
In other words, we can use medicinal chemistry project data to infer a measure of R group similarity by looking at what R groups co-occur in medicinal chemistry projects as matched pairs. This measure can be used to suggest changes, form the basis of an enumeration strategy, or search for similar molecules. Because this similarity measure is derived from medicinal chemistry data, the results should be reasonable and interpretable.
At the recent ACS meeting in San Diego, I described my work on a measure of R group similarity:
Measuring R group similarity using medicinal chemistry data
I found it difficult to jam everything into my 20+5min talk, so let me add a bit of discussion here on how it compares to previous work...
There certainly has been some interesting work on R group similarity from Sheffield (e.g. Holliday et al), and more recently from this year's Skolnik Award winner Kimito Funatsu (Tamura et al), among others. But perhaps the most similar work is that on large-scale identification of bioisosteric replacements from the Bajorath group (e.g. Wassermann and Bajorath) and Swiss-Bioisotere (Wirth et al). Others have used 3D data from the PDB to further refine predictions, e.g. Seddon et al and Lešnik et al.
The thing is, most of the time in a med chem project (correct me if wrong) you're not looking for bioisosteric replacements - you want to improve potency, or you want just want to probe the binding site, e.g. see can you grow in this direction or that direction. The changes people make are often to R groups around the same size (maybe a bit bigger), but not always to ones with the same properties. For example, if an electron-withdrawing group doesn't work, time to test an electron-donating group. Of course, towards the end, you might have potency nailed and are trying to work around some physchem properties - at this point, bioisosteric replacements come into play.
Something else that I feel previous work has missed is the time dimension. The path of R group replacements can be envisaged as starting with a hydrogen or methyl, and through an iterative process ending up somewhere a whole lot more complicated. Although methyl co-occurs with the vast majority of R groups (as a matched molecular pair), after a certain point in a project that's not a useful suggestion; it's already been done, or already been ruled out. What you want to know is what R groups are typically made before versus after. You can easily work this out if you have time series information from med chem projects, but as I show in the talk, even in the absence of explicit time series data, it's possible to infer the approximate order in which R groups are made if you have a large number of sets of med chem project data.
I had a couple of questions about how to turn this method of finding similar R groups into a similarity metric. I turned this question back on the questioners: "Why do you want that? My guess is to use it to find similar molecules." But you can find similar molecules without turning the similarity into a number between 0 and 1; the talk describes one way to do this.
Oh, and here's the poster on the same topic I showed earlier this year at Sheffield and subsequently at the ACS.
Thursday, 12 September 2019
Talk about making a hash of a molecule
At the recent ACS meeting in San Diego, I presented a talk entitled:
Firstly, this is an overview of a general technique for solving a wide range of cheminformatics problems; namely, the calculation and comparison of hashes of molecules to find duplicates with respect to some structural property of interest.
It also introduces MolHash to the world, a command-line application from NextMove Software which is freely available from GitHub (*).
* If you don't fancy compiling it yourself, you may find Matt Swain's conda version useful.
Making a hash of it: The advantage of selectively leaving out structural information
Firstly, this is an overview of a general technique for solving a wide range of cheminformatics problems; namely, the calculation and comparison of hashes of molecules to find duplicates with respect to some structural property of interest.
It also introduces MolHash to the world, a command-line application from NextMove Software which is freely available from GitHub (*).
* If you don't fancy compiling it yourself, you may find Matt Swain's conda version useful.