Friday, 3 June 2022

Diagnosing problems with SMILES

For my poster at the upcoming ICCS, I wanted to categorise any problems with the SMILES strings generated by a recurrent neural network. I did this using the partialsmiles library, a validating SMILES parser I wrote a little while ago.

The speciality of this library is dealing with partial SMILES strings as they are being generated - this potentially allows you to choose an alternative token if the original token causes a problem. However, it can equally well be used with full SMILES strings. Reported errors are broken down into three catagories: valence errors, kekulisation failures and syntax errors. The error message describes the specific problem, and the index of the relevant point in the SMILES string is available. As the docs state, errors associated with the semantics of cis/trans stereo symbols are not currently handled but that's not a problem here.

I mentioned valence errors; this involves a check against a table of allowed valences. I edited the defaults to allow hypervalent nitrogen (i.e. valence 5) as it may be present in the training data. Here's a typical output:
smiles_syntax=20 smiles_valence=1 smiles_kek=22
Total errors: 43   %conversion: 95.7
 22 cases of Aromatic system cannot be kekulized
  4 cases of Unmatched close parenthesis
  3 cases of 1 branch has not been closed
  5 cases of 2 ring openings have not been closed
  6 cases of 1 ring opening has not been closed
  1 cases of 3 ring openings have not been closed
  1 cases of Uncommon valence or charge state
  1 cases of Cannot have a bond opening and closing on the same atom
...and here's the code. Note the use of defaultdict, a hidden gem of the Python library which appears in almost all of my scripts:
from collections import defaultdict

import partialsmiles as ps

if __name__ == "__main__":
    verbose = False
    fname = "Regular_SMILES_1K.smi"

    smiles_syntax = smiles_valence = smiles_kek = 0
    msgs = defaultdict(int)
    N = 0
    with open(fname) as inp:
        for line in inp:
            N += 1
            smi = line.rstrip()
                mol = ps.ParseSmiles(smi, partial=False)
            except ps.SMILESSyntaxError as e:
                if verbose:
                    print(f"SMILESSyntaxError: {e}")
                smiles_syntax += 1
                msgs[e.message] += 1
            except ps.ValenceError as e:
                if verbose:
                    print(f"ValenceError: {e}")
                smiles_valence += 1
                msgs[e.message] += 1
            except ps.KekulizationFailure as e:
                if verbose:
                    print(f"KekulizationFailure: {e}")
                smiles_kek += 1
                msgs[e.message] += 1

    print(f"{smiles_syntax=} {smiles_valence=} {smiles_kek=}")

    tot_errors = smiles_syntax + smiles_valence + smiles_kek
    print(f"Total errors: {tot_errors}   %conversion: {(N-tot_errors)*100/N:0.1f}")

    for x, y in msgs.items():
        print(f"{y:3d} cases of {x}")

Tuesday, 26 April 2022

Threading time through Vortex

Vortex (a chemical spreadsheet/visualisation software from Dotmatics) has a plugin system built around Jython. Simply drop a .vpy file into a specific scripts folder, and a menu item immediately appears in the application. Here are some notes on using this to communicate with a webserver.

Code organisation

I found it best to separate Vortex-specific code (in the .vpy files) from supporting code that could be written and tested independently. This also naturally enables reuse of code across plugins. This supporting code I put in a folder adjacent to the scripts folder, and accessed it as follows:

import os
import sys
sys.path.append(os.path.join(vortex.getVortexFolder(), "MYFOLDERNAME"))

Something to note is that the application needs to be restarted to pick up on edits made to the supporting Python codebase. This is in contrast to edits made to the .vpy files, which can be tested immediately.

Access to a JSON parser

Communicating with a webservice is most easily done with JSON. Unfortunately, the 'json' module is missing in the bundled Jython. To install your own, you can go down the Java route and download an appropriate .jar file (see for example Chris Swain here or here). As an alternative, I prefer to use Python and the same module I used back in the days of ye olde Python, simplejson, a library which is still supported. I downloaded and extracted this into the MYFOLDERNAME folder mentioned earlier, so that it was available as:

import simplejson as json
Call a webservice without blocking

Passing several hundred or more molecules to a webservice that is calculating some property can take several seconds or indeed minutes. The simple approach to this looks something like the following. Note that here we use time.sleep() as a stand-in for the webservice call:

def main():
    time.sleep(5) # pretend to contact a webservice

if __name__ == "__main__":

Unfortunately, this causes the entire application to become unresponsive until the time.sleep() is complete. This is because the main thread of a GUI application is supposed to spend its time listening out for events like you clicking on something; if it's busy doing some other work, then it can't respond to those events. The solution is to run your code on a separate thread:

class MyRunnable(java.lang.Runnable):
    def run(self):
if __name__ == "__main__":
    t = java.lang.Thread(MyRunnable())

This seems to works perfectly. The only problem I found was that it prevented any useful error messages from appearing in the console (accessed via Help/Console - diagnostic) apart from "Exception in Thread - 1" or somesuch. If this happened, I temporarily changed the code to call main() directly.

Create and close a dialog box

Given that the calculation might take a while, it's a good idea to indicate to the user that the calculation has started; for example to avoid the user starting the calculation multiple times while waiting for the result. One solution is to pop up an info box immediately to let the user know that the calculation has started, and then close it if still present at the end of the calculation. To do this, we need to create the dialog box ourselves:

import javax.swing.JOptionPane as JOptionPane

pane = JOptionPane("This might take a while...", JOptionPane.INFORMATION_MESSAGE)
dialog = pane.createDialog(None, "The calculation has begun")
dialog.setVisible(False) # close it if still present

If you find this useful or have any additional tips/tricks feel free to leave a comment.

[Update 27/04/2022] On Twitter, John Mayfield adds "From a few years ago now but I remember the jython version of the Python http request was really slow and was much much faster to use Java’s libs (still via jython)". Chris Swain pointed to his Cambridge MedChem Consulting website which has a large number of useful scripts.

Sunday, 30 May 2021

Combining protein structure with deep generative models for ligands

Journal of Cheminformatics has just published the first result from a collaboration between ourselves at Sosei Heptares and the Andreas Bender group. Morgan Thomas, the PhD student who did all the work, has presented early versions of this at various AI/Chemistry meetings but it's finally out there:

Morgan Thomas, Robert T. Smith, Noel M. O'Boyle, Chris de Graaf, Andreas Bender. Comparison of structure- and ligand-based scoring functions for deep generative models: a GPCR case study J. Cheminform. 2021, 13, 39.

Deep generative models have shown the ability to devise both valid and novel chemistry, which could significantly accelerate the identification of bioactive compounds. Many current models, however, use molecular descriptors or ligand-based predictive methods to guide molecule generation towards a desirable property space. This restricts their application to relatively data-rich targets, neglecting those where little data is available to sufficiently train a predictor. Moreover, ligand-based approaches often bias molecule generation towards previously established chemical space, thereby limiting their ability to identify truly novel chemotypes.

In this work, we assess the ability of using molecular docking via Glide—a structure-based approach—as a scoring function to guide the deep generative model REINVENT and compare model performance and behaviour to a ligand-based scoring function. Additionally, we modify the previously published MOSES benchmarking dataset to remove any induced bias towards non-protonatable groups. We also propose a new metric to measure dataset diversity, which is less confounded by the distribution of heavy atom count than the commonly used internal diversity metric. 
With respect to the main findings, we found that when optimizing the docking score against DRD2, the model improves predicted ligand affinity beyond that of known DRD2 active molecules. In addition, generated molecules occupy complementary chemical and physicochemical space compared to the ligand-based approach, and novel physicochemical space compared to known DRD2 active molecules. Furthermore, the structure-based approach learns to generate molecules that satisfy crucial residue interactions, which is information only available when taking protein structure into account.

Overall, this work demonstrates the advantage of using molecular docking to guide de novo molecule generation over ligand-based predictors with respect to predicted affinity, novelty, and the ability to identify key interactions between ligand and protein target. Practically, this approach has applications in early hit generation campaigns to enrich a virtual library towards a particular target, and also in novelty-focused projects, where de novo molecule generation either has no prior ligand knowledge available or should not be biased by it.

For further background, a Q&A with Morgan appears over on Andreas's blog.

Monday, 18 January 2021

Data/cheminf/compchem openings at Sosei Heptares

A year and a half into my new life in pharma, and I'm really enjoying it. And now we're looking for new members of the team at Sosei Heptares in Cambridge (UK), with the advertised posts covering everything from data management, cheminformatics through to computational chemistry, both junior and senior.

I've pasted in the basic details of the posts below, but there are more details if you follow the links. Feel free to reach out to me if you have questions (, or contact Chris de Graaf who heads the Computational Chemistry team.

Computational Chemist – 3 positions at Sosei Heptares (Cambridge, UK) (Research Scientist, Senior Scientist, Principal Scientist)

We are growing our Computer-Aided Drug Design and Cheminformatics/AI capabilities by extending the Sosei Heptares Computational Chemistry team with three additional positions:

Link to advertised positions

These cover all experience levels from recent PhD to a well experienced senior computational chemist in drug discovery. The positions are flexible, so different combinations of skills and/or experience are acceptable for the right candidate, so please forward to those who you feel passionate about joining the Sosei Heptares CompChem team where scientific excellence and passion combine in a friendly fun environment to impact drug discovery projects and create new cutting-edge approaches.

Discovery Data Manager at Sosei Heptares (Cambridge, UK)

In addition, Sosei Heptares is looking to recruit an experienced Discovery Data Manager to support our Research team:

Link to advertised position

This position is an exciting opportunity to work at the interface between Computational Chemistry, Medicinal Chemistry, Molecular Pharmacology, Translational Sciences, and Platform groups to streamline the GPCR structure-based drug discovery process in an industry-leading biotech company.

Closing dates for all applications is 14th March.

Saturday, 24 October 2020

The SMILES reading benchmark - two years on

In August 2017, after attending an InChI meeting at the NIH in Bethesda, I had the idea of putting together a SMILES reading benchmark. I already had the bones of one to test my rewrite of Open Babel's reading of aromatic SMILES, but after attending a workshop led by Greg Landrum on Open File Formats for Chemical Information I decided to tidy it up and broaden the scope.

My goals were to identify issues affecting interoperability, to resolve those issues by working with developers, and to provide a resource to help future implementations avoid problems. This last goal has recently been realised through Rich Apodaca's work on a Rust-based SMILES parser where he gives an extensive write-up on the role of the SMILES benchmark. The benchmark has also been of use to the IUPAC SMILES+ project, which grew out of Greg's workshop at the NIH and is led by Vin Scalfani.

Results and progress were described in a poster at the ICCS in June 2018, and subsequently (with updates) at the ACS in Aug of that year in "A de facto standard or a free-for-all? A benchmark for reading SMILES". I've thought about writing up a paper but I was never really keen - the point wasn't to write a paper, or point out software that had problems, but to improve SMILES. 

Back in the heady days of 2017-18, my approach with the benchmark was to work with, or at least nudge, various software vendors/developers towards improved interoperability. A tricky task when I worked for a software vendor myself, was a developer of a cheminformatics toolkit, and was sometimes neither a customer nor a user. Despite this, the benchmark was reasonably successful...but not completely, and two years down the line I find myself in a different environment relying on different tools, and wondering if some more nudging in the right direction might help.

In this spirit, let's take a look at an example from the ChemDraw results in the benchmark (to be found here), illustrate the problem and work out the solution by hand.

Figure 1 (left) shows entry 26359 in the benchmark. The CDK generates the following aromatic SMILES for this structure: c1(=O)c2c(c3=c1n(nco3)C)cccc2. However, when this SMILES is pasted into ChemDraw, the depiction in Figure 1 (middle) is obtained, which resolves to the structure on the right on hitting Alt+K. No error or warning appeared that might indicate problems when reading the SMILES.

Figure 1

Now let's do this by hand. Figure 2 shows the structure as described by the SMILES string. A key point to note/remember is that a SMILES string exactly describes the hydrogen count on every atom - we 'just' need to work out the bond orders of the aromatic bonds making sure that every atom that needs a double bond gets exactly one.

Figure 2

For the actual details of the algorithm, check out the source code of Open Babel or my partialsmiles project (also the CDK, but that's a different algorithm than described here). But you can think of it like solving Minesweeper - to begin with we tackle the bits we are sure about, before we have to start guessing. The two bonds to the carbonyl carbon must be single bonds; ditto for the bonds to NMe, and to the O in the ring (see here for some details). The remaining bonds to be kekulized are shown in black in Figure 3 (left):

Figure 3

We'll call this point A. Each of the remaining black atoms needs to have a double bond. But which to start with? If we put the first double bond in the wrong place we might end up having to start over. Again, you should start with those you are certain about - and that's those black atoms that have a single black bond. This must be a double bond. Once you've placed those, set the other neighbouring bonds to single, and updated the list of atoms that need a double bond, your structure will look like Figure 3 (middle). 

At this point, there are no black atoms with just a single black bond, so it's time to guess: just choose one and place a double bond. Now update the list of atoms that need a double bond, and go back to point A. Keep repeating until all the bonds are kekulized...or there are no bonds left to choose.

For more than 95% of the cases in the benchmark this will result in a kekulized structure. For the remaining cases, you instead end up with a pair of black atoms that don't have a double bond. To fix this, do a DFS to find an alternating path ('augmenting path') that joins them, and then flip the bond orders along the path. For example, consider the situation below, where I started by placing the double bond along the bond joining the 6-membered rings. To fix, just flip the bond orders from C-C=C-C to C=C-C=C.

Figure 4

The described procedure will successfully kekulize any structure that can be kekulized. Feel free to reach out if you have any questions.

Sunday, 11 October 2020

Finding matched pairs of a peptide at the RDKit UGM

The recent RDKit UGM was a masterclass in how to organise a conference virtually, successfully replicating at least some of the in-person experience. This was due to the extensive use of Discord (best known as a chat server for gamerz) to manage questions, answers, discussion and networking, but also the technical support for Discord (thanks to Floriane Montanari) and Zoom (thanks to Christiane from Knime). With previous virtual meetings I have attended, the meeting only had an existence while someone was speaking; here discussions filled the interims between, and indeed the duration of, the talks.

I contributed a lightning talk to the meeting entitled "An efficient algorithm to find matched pairs of a peptide". Somehow I managed to give a talk on peptides without showing any peptide structures, which I'll blame on the 5 minute time limit and not on a perverse sense of humour.

Friday, 9 October 2020

Comparing methods two-by-two

It is common to compare different methods using results from N distinct datasets. My earlier blogpost described why the mean rank is not a good measure of performance in these cases. Essentially, the relative performance of two methods (e.g. A and B) can be altered based on the performance of other methods (e.g. C, D and E).

But it's not just the mean rank that's the problem. It's the use of any performance measure where the assessment of the pairwise performance (e.g. between methods A and B) can be altered by the performance of other methods.

At the recent (virtual) AI in Chemistry Meeting organised by the RSC, one of the speakers showed an assessment of different methods based on how frequently that method came first relative to the other methods. Is this a reasonable way to assess performance? Let's look at an example...

Consider two methods A and B assessed using this metric on 10 datasets, where A comes first 9 times and B comes first once. Clearly A is better than B, and this is reflected by this metric.

Now let's add a method C to this comparison. It turns out that C does better than A on every dataset but still fails to beat B on the 10th. This means that A never comes first, but B still comes first once. In other words, by adding method C to the comparison, the relative performance of A and B has been inverted according to this metric. Which can't be right - A is still better than B - other methods have nothing to say about this.

So what's the solution? Well, one possibility is to read my previous blog post starting from "So what's the solution?"

Having done so, let's apply that solution. The key point is that it only makes sense to compare the methods pairwise. So let's do so by giving each dataset a vote on which method is best. This is a paired comparison (greater ability to resolve differences). 10 say C>A, 8 (net, see note 1 below) say C>B, and 8 again say A>B. These results are depicted above (see note 2 below). We can summarise this (but lose some information in the general case) with some transitive reduction by removing the C--B edge.

Will this approach catch on? It's tricky because this is one of those areas where the obvious solution seems quite reasonable, whereas the problem is quite subtle, nor have I ever seen it discussed in the field (or any field). Despite this, I will continue to pipe my thoughts directly to /dev/noel here.


1. If you're wondering why 9 x C>B and 1 x B>C leads to a net difference of 8, this is to handle the case of C=B. If it were 9 x C > B and 1 x B = C, the net difference would be 9.

2. This was generated from the following graphviz file using "dot -Tpng -o myfile.png":

digraph D {
C -> A [label="10"]
C -> B [label="8"]
A -> B [label="8"]