Sunday, 30 May 2021

Combining protein structure with deep generative models for ligands

Journal of Cheminformatics has just published the first result from a collaboration between ourselves at Sosei Heptares and the Andreas Bender group. Morgan Thomas, the PhD student who did all the work, has presented early versions of this at various AI/Chemistry meetings but it's finally out there:

Morgan Thomas, Robert T. Smith, Noel M. O'Boyle, Chris de Graaf, Andreas Bender. Comparison of structure- and ligand-based scoring functions for deep generative models: a GPCR case study J. Cheminform. 2021, 13, 39.

Deep generative models have shown the ability to devise both valid and novel chemistry, which could significantly accelerate the identification of bioactive compounds. Many current models, however, use molecular descriptors or ligand-based predictive methods to guide molecule generation towards a desirable property space. This restricts their application to relatively data-rich targets, neglecting those where little data is available to sufficiently train a predictor. Moreover, ligand-based approaches often bias molecule generation towards previously established chemical space, thereby limiting their ability to identify truly novel chemotypes.

In this work, we assess the ability of using molecular docking via Glide—a structure-based approach—as a scoring function to guide the deep generative model REINVENT and compare model performance and behaviour to a ligand-based scoring function. Additionally, we modify the previously published MOSES benchmarking dataset to remove any induced bias towards non-protonatable groups. We also propose a new metric to measure dataset diversity, which is less confounded by the distribution of heavy atom count than the commonly used internal diversity metric. 
With respect to the main findings, we found that when optimizing the docking score against DRD2, the model improves predicted ligand affinity beyond that of known DRD2 active molecules. In addition, generated molecules occupy complementary chemical and physicochemical space compared to the ligand-based approach, and novel physicochemical space compared to known DRD2 active molecules. Furthermore, the structure-based approach learns to generate molecules that satisfy crucial residue interactions, which is information only available when taking protein structure into account.

Overall, this work demonstrates the advantage of using molecular docking to guide de novo molecule generation over ligand-based predictors with respect to predicted affinity, novelty, and the ability to identify key interactions between ligand and protein target. Practically, this approach has applications in early hit generation campaigns to enrich a virtual library towards a particular target, and also in novelty-focused projects, where de novo molecule generation either has no prior ligand knowledge available or should not be biased by it.

For further background, a Q&A with Morgan appears over on Andreas's blog.

Monday, 18 January 2021

Data/cheminf/compchem openings at Sosei Heptares

A year and a half into my new life in pharma, and I'm really enjoying it. And now we're looking for new members of the team at Sosei Heptares in Cambridge (UK), with the advertised posts covering everything from data management, cheminformatics through to computational chemistry, both junior and senior.

I've pasted in the basic details of the posts below, but there are more details if you follow the links. Feel free to reach out to me if you have questions (noel.oboyle@soseiheptares.com), or contact Chris de Graaf who heads the Computational Chemistry team.

Computational Chemist – 3 positions at Sosei Heptares (Cambridge, UK) (Research Scientist, Senior Scientist, Principal Scientist)

We are growing our Computer-Aided Drug Design and Cheminformatics/AI capabilities by extending the Sosei Heptares Computational Chemistry team with three additional positions:

Link to advertised positions

These cover all experience levels from recent PhD to a well experienced senior computational chemist in drug discovery. The positions are flexible, so different combinations of skills and/or experience are acceptable for the right candidate, so please forward to those who you feel passionate about joining the Sosei Heptares CompChem team where scientific excellence and passion combine in a friendly fun environment to impact drug discovery projects and create new cutting-edge approaches.

Discovery Data Manager at Sosei Heptares (Cambridge, UK)

In addition, Sosei Heptares is looking to recruit an experienced Discovery Data Manager to support our Research team:

Link to advertised position

This position is an exciting opportunity to work at the interface between Computational Chemistry, Medicinal Chemistry, Molecular Pharmacology, Translational Sciences, and Platform groups to streamline the GPCR structure-based drug discovery process in an industry-leading biotech company.

Closing dates for all applications is 14th March.

Saturday, 24 October 2020

The SMILES reading benchmark - two years on

In August 2017, after attending an InChI meeting at the NIH in Bethesda, I had the idea of putting together a SMILES reading benchmark. I already had the bones of one to test my rewrite of Open Babel's reading of aromatic SMILES, but after attending a workshop led by Greg Landrum on Open File Formats for Chemical Information I decided to tidy it up and broaden the scope.

My goals were to identify issues affecting interoperability, to resolve those issues by working with developers, and to provide a resource to help future implementations avoid problems. This last goal has recently been realised through Rich Apodaca's work on a Rust-based SMILES parser where he gives an extensive write-up on the role of the SMILES benchmark. The benchmark has also been of use to the IUPAC SMILES+ project, which grew out of Greg's workshop at the NIH and is led by Vin Scalfani.

Results and progress were described in a poster at the ICCS in June 2018, and subsequently (with updates) at the ACS in Aug of that year in "A de facto standard or a free-for-all? A benchmark for reading SMILES". I've thought about writing up a paper but I was never really keen - the point wasn't to write a paper, or point out software that had problems, but to improve SMILES. 


Back in the heady days of 2017-18, my approach with the benchmark was to work with, or at least nudge, various software vendors/developers towards improved interoperability. A tricky task when I worked for a software vendor myself, was a developer of a cheminformatics toolkit, and was sometimes neither a customer nor a user. Despite this, the benchmark was reasonably successful...but not completely, and two years down the line I find myself in a different environment relying on different tools, and wondering if some more nudging in the right direction might help.

In this spirit, let's take a look at an example from the ChemDraw results in the benchmark (to be found here), illustrate the problem and work out the solution by hand.

Figure 1 (left) shows entry 26359 in the benchmark. The CDK generates the following aromatic SMILES for this structure: c1(=O)c2c(c3=c1n(nco3)C)cccc2. However, when this SMILES is pasted into ChemDraw, the depiction in Figure 1 (middle) is obtained, which resolves to the structure on the right on hitting Alt+K. No error or warning appeared that might indicate problems when reading the SMILES.

Figure 1

Now let's do this by hand. Figure 2 shows the structure as described by the SMILES string. A key point to note/remember is that a SMILES string exactly describes the hydrogen count on every atom - we 'just' need to work out the bond orders of the aromatic bonds making sure that every atom that needs a double bond gets exactly one.

Figure 2

For the actual details of the algorithm, check out the source code of Open Babel or my partialsmiles project (also the CDK, but that's a different algorithm than described here). But you can think of it like solving Minesweeper - to begin with we tackle the bits we are sure about, before we have to start guessing. The two bonds to the carbonyl carbon must be single bonds; ditto for the bonds to NMe, and to the O in the ring (see here for some details). The remaining bonds to be kekulized are shown in black in Figure 3 (left):

Figure 3

We'll call this point A. Each of the remaining black atoms needs to have a double bond. But which to start with? If we put the first double bond in the wrong place we might end up having to start over. Again, you should start with those you are certain about - and that's those black atoms that have a single black bond. This must be a double bond. Once you've placed those, set the other neighbouring bonds to single, and updated the list of atoms that need a double bond, your structure will look like Figure 3 (middle). 

At this point, there are no black atoms with just a single black bond, so it's time to guess: just choose one and place a double bond. Now update the list of atoms that need a double bond, and go back to point A. Keep repeating until all the bonds are kekulized...or there are no bonds left to choose.

For more than 95% of the cases in the benchmark this will result in a kekulized structure. For the remaining cases, you instead end up with a pair of black atoms that don't have a double bond. To fix this, do a DFS to find an alternating path ('augmenting path') that joins them, and then flip the bond orders along the path. For example, consider the situation below, where I started by placing the double bond along the bond joining the 6-membered rings. To fix, just flip the bond orders from C-C=C-C to C=C-C=C.

Figure 4

The described procedure will successfully kekulize any structure that can be kekulized. Feel free to reach out if you have any questions.

Sunday, 11 October 2020

Finding matched pairs of a peptide at the RDKit UGM

The recent RDKit UGM was a masterclass in how to organise a conference virtually, successfully replicating at least some of the in-person experience. This was due to the extensive use of Discord (best known as a chat server for gamerz) to manage questions, answers, discussion and networking, but also the technical support for Discord (thanks to Floriane Montanari) and Zoom (thanks to Christiane from Knime). With previous virtual meetings I have attended, the meeting only had an existence while someone was speaking; here discussions filled the interims between, and indeed the duration of, the talks.

I contributed a lightning talk to the meeting entitled "An efficient algorithm to find matched pairs of a peptide". Somehow I managed to give a talk on peptides without showing any peptide structures, which I'll blame on the 5 minute time limit and not on a perverse sense of humour.



Friday, 9 October 2020

Comparing methods two-by-two

It is common to compare different methods using results from N distinct datasets. My earlier blogpost described why the mean rank is not a good measure of performance in these cases. Essentially, the relative performance of two methods (e.g. A and B) can be altered based on the performance of other methods (e.g. C, D and E).

But it's not just the mean rank that's the problem. It's the use of any performance measure where the assessment of the pairwise performance (e.g. between methods A and B) can be altered by the performance of other methods.

At the recent (virtual) AI in Chemistry Meeting organised by the RSC, one of the speakers showed an assessment of different methods based on how frequently that method came first relative to the other methods. Is this a reasonable way to assess performance? Let's look at an example...

Consider two methods A and B assessed using this metric on 10 datasets, where A comes first 9 times and B comes first once. Clearly A is better than B, and this is reflected by this metric.

Now let's add a method C to this comparison. It turns out that C does better than A on every dataset but still fails to beat B on the 10th. This means that A never comes first, but B still comes first once. In other words, by adding method C to the comparison, the relative performance of A and B has been inverted according to this metric. Which can't be right - A is still better than B - other methods have nothing to say about this.

So what's the solution? Well, one possibility is to read my previous blog post starting from "So what's the solution?"

Having done so, let's apply that solution. The key point is that it only makes sense to compare the methods pairwise. So let's do so by giving each dataset a vote on which method is best. This is a paired comparison (greater ability to resolve differences). 10 say C>A, 8 (net, see note 1 below) say C>B, and 8 again say A>B. These results are depicted above (see note 2 below). We can summarise this (but lose some information in the general case) with some transitive reduction by removing the C--B edge.

Will this approach catch on? It's tricky because this is one of those areas where the obvious solution seems quite reasonable, whereas the problem is quite subtle, nor have I ever seen it discussed in the field (or any field). Despite this, I will continue to pipe my thoughts directly to /dev/noel here.

Notes:

1. If you're wondering why 9 x C>B and 1 x B>C leads to a net difference of 8, this is to handle the case of C=B. If it were 9 x C > B and 1 x B = C, the net difference would be 9.

2. This was generated from the following graphviz file using "dot myfile.dot -Tpng -o myfile.png":

digraph D {
C -> A [label="10"]
C -> B [label="8"]
A -> B [label="8"]
}

Sunday, 5 July 2020

SMI to 2D MOL: Parallel processing with the CDK from Jython

At some point in a computational chemistry workflow, if you started with SMILES you need to transition to the world of 3D. A slight niggle in the works is that some tools may not correctly read SMILES, or indeed not read them at all. For this reason, it is useful to convert to 2D MOL files. In this respect, the CDK has established itself as one of the go-to tools thanks to efforts by John Mayfield (see for example, CDK depict and his RDKit talk on the topic).

The best way to use the CDK is from Java. However, I find the energy barrier to writing a Java problem to be high, and so here I'll use Jython. Once installed, you just add the cdk-2.3.jar to the environment variable CLASSPATH and you are good to go.

Serial approach
I'll be writing about timings for reading 100K SMILES strings from a CSV file and converting them to an SDF. The baseline is the serial implementation, which takes about 60s.
# Python
import csv
import time

# Java
import java
import org.openscience.cdk as cdk

sp = cdk.smiles.SmilesParser(cdk.silent.SilentChemObjectBuilder.getInstance())
TITLE = cdk.CDKConstants.TITLE

def calculate(smi, title):
    # Read SMILES
    mol = sp.parseSmiles(smi)

    # Set the title
    mol.setProperty(TITLE, title)

    # Do the SDG
    sdg = cdk.layout.StructureDiagramGenerator()
    sdg.generateCoordinates(mol)

    # Write SDF file
    writer = java.io.StringWriter()
    molwriter = cdk.io.SDFWriter(writer)
    molwriter.write(mol)
    molwriter.close() # flushes

    return writer.toString()

if __name__ == "__main__":
    INPUTFILE = "100000.csv"
    OUTPUTFILE = "out.sdf"

    t = time.time()
    with open(OUTPUTFILE, "w") as out:
        with open(INPUTFILE) as inp:
            reader = csv.reader(inp)
            for smi, _, title in reader:
                out.write(calculate(smi, title))
    print(time.time() - t)
If we have millions of SMILES strings, a parallel approach can help. Unfortunately, Jython does not provide an implementation of the multiprocessing library so we need to do this the Java way...

Approach 1 - Using streams
The script below reads in SMILES strings from a CSV file as a stream and passes them one-at-a-time to multiple threads running in parallel to be converted to an SDF entry. The API doesn't allow any access (as far as I can tell) to control the number of threads. The SDF entries are written to the output file in the original order if ".forEachOrdered" is used versus ".forEach". There was a 4.5X speed-up, from 60s to 13s. This was on a machine with 12 physical cores (24 logical, due to the hyperthreading). Timings for forEach() instead of forEachOrdered() were about the same (surprisingly).
# Python
import csv

# Java
import java
import org.openscience.cdk as cdk
from java.nio.file import Files, Paths
from java.util.function import Function, Consumer

sp = cdk.smiles.SmilesParser(cdk.silent.SilentChemObjectBuilder.getInstance())
TITLE = cdk.CDKConstants.TITLE

def smi2sdf(line):
    smi, _, title = next(csv.reader([line]))

    # Read SMILES
    mol = sp.parseSmiles(smi)

    # Set the title
    mol.setProperty(TITLE, title)

    # Do the SDG
    sdg = cdk.layout.StructureDiagramGenerator()
    sdg.generateCoordinates(mol)

    # Write SDF file
    writer = java.io.StringWriter()
    molwriter = cdk.io.SDFWriter(writer)
    molwriter.write(mol)
    molwriter.close() # flushes

    return writer.toString()

class Calculate(Function):
    def apply(self, text):
        return smi2sdf(text)

class Write(Consumer):
    def __init__(self, filename):
        self.mfile = open(filename, "w")
    def accept(self, text):
        self.mfile.write(text)
    def __del__(self):
        self.mfile.close()

if __name__ == "__main__":
    INPUTFILE = "100000.csv"
    OUTPUTFILE = "out.sdf"

    calculate = Calculate()
    write = Write(OUTPUTFILE)

    Files.lines(Paths.get(INPUTFILE)).parallel().map(calculate).forEach(write)
Approach 2 - Using a ThreadPool
Daniel Lowe suggested using a ThreadPool and provided example Java code showing that it ran faster that the streams approach. This was also the case in Jython, where a timing of 9.6s was obtained for 12 threads, a 6X speedup over the serial implementation. The upside of using a ThreadPool is that the number of threads can be controlled explicitly, and it's worth noting that using 24 actually slowed things down to 10.2s - a reminder that "hyperthreading" is marketing BS. A potential downside is that there's no possibility (with this implementation at least) to order the output.
# Python
import csv

# Java
import java
import org.openscience.cdk as cdk
import java.util.concurrent as conc

sp = cdk.smiles.SmilesParser(cdk.silent.SilentChemObjectBuilder.getInstance())
TITLE = cdk.CDKConstants.TITLE

def calculate(smi, title):

    # Read SMILES
    mol = sp.parseSmiles(smi)

    # Set the title
    mol.setProperty(TITLE, title)

    # Do the SDG
    sdg = cdk.layout.StructureDiagramGenerator()
    sdg.generateCoordinates(mol)

    # Write SDF file
    writer = java.io.StringWriter()
    molwriter = cdk.io.SDFWriter(writer)
    molwriter.write(mol)
    molwriter.close() # flushes

    return writer.toString()

class SmiToMol(java.lang.Runnable):

    def __init__(self, smi, title, writer):
        self.smi = smi
        self. title = title
        self.writer = writer

    def run(self):
        self.writer.write(calculate(self.smi, self.title))

class LimitedQueue(conc.LinkedBlockingQueue):
    serialVersionIUD = 1

    def __init__(self, maxSize):
        conc.LinkedBlockingQueue.__init__(self, maxSize)

    def offer(self, e):
        # convert offer to 'put' to make it blocking
        try:
            self.put(e)
            return True
        except InterruptedException as ie:
            Thread.currentThread().interrupt()
        return False

if __name__ == "__main__":
    INPUTFILE = "100000.csv"
    OUTPUTFILE = "out.threadpool.sdf"

    THREADS = 12
    executor = conc.ThreadPoolExecutor(THREADS, THREADS, 0, conc.TimeUnit.MILLISECONDS, LimitedQueue(THREADS * 2))
    with open(OUTPUTFILE, "w") as out:
        with open(INPUTFILE) as inp:
            reader = csv.reader(inp)
            for smi, _, title in reader:
                executor.execute(SmiToMol(smi, title, out))
        executor.shutdown()
        executor.awaitTermination(10000, conc.TimeUnit.SECONDS)
Credits
Thanks to Daniel Lowe and John Mayfield for an interesting discussion about various approaches and what's going on under-the-hood.

Sunday, 31 May 2020

Python patterns for processing large SDF files

Running large SDF files through a processing pipeline is not an uncommon task in cheminformatics/computational chemistry. Here are some patterns I have been using to manage this task while making maximum use of available processing power.

An SDF iterator
Cheminformatics toolkits are great, but have you ever not used one? If you are handing off an entry in an SDF file to third-party software for processing, then there is no need to make your processor do the busy-work of reading the SDF file, building a molecule, and then writing it out again. Here's a lazy iterator over a file-like object that returns SDF entries one-at-a-time.
def sdf_iterator(fileobj):
    data = []
    for line in fileobj:
        data.append(line)
        if line.startswith("$$$$"):
            yield "".join(data)
            data = []

Handle gzipped files
As all chemists know, an atom is mostly empty space. It is fitting therefore that SDF files mirror this property and exhibit excellent compressibility. In practice, one ends up having to deal with a mixture of compressed and uncompressed SDF files, and it's tedious adapting code to handle them both. Here's a helper function one could use for this:
def myopen(filename):
    return gzip.open(filename, "rt") if filename.endswith(".gz") else open(filename)

if __name__ == "__main__":
    # assign inputfile as either an .sdf or .sdf.gz
    with myopen(inputfile) as inp:
        for sdfentry in sdf_iterator(inp):
            # do something with sdfentry

Calling a command-line application from Python to process an SDF entry
For some reason, not everything is wrapped up nicely as a Python library, and often it's necessary to pass an SDF file to a command-line application for processing and get the result back. For example, the result could be another SDF file containing docked poses and results, or protonated molecules, or shape-matched hits. In the best-case scenario, the command-line application can read from standard input and write to standard out. This is the situation I describe below; it's the most efficient way of communicating as disk-access will slow things down. The alternative is to use temporary files - this can be more tricky than you'd think, and is best avoided if at all possible.
def calculate(sdf):
    proc = subprocess.Popen(["noelina", "-option1", "option1value", "-option2", "option2value"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, encoding="utf-8")
    stdout, stderr = proc.communicate(sdf)
    return stdout

if __name__ == "__main__":
    with open("outputfile.sdf") as out:
        with myopen("inputfile.sdf") as inp:
            for sdfentry in sdf_iterator(inp):
                output = calculate(sdfentry)
                out.write(output)
Note that you may wish to inspect and handle error messages. My advice is to pass any errors back to the caller (as a string that is, not an Exception), and let it handle them (e.g. write them to a logfile). This separates the processing of the results from their generation. This is particularly important when we move to parallel processing (see below), where we want the processing of results to be handled by a single process, the primary process.

Parallel processing I
It's all very well doing this from Python, but it would have been faster and simpler to call the command-line application directly using the command-line. The advantage of using Python comes into play when we consider parallel processing.

There are several approaches you can use here. One is to split the SDF file into N chunks, one for each processor, and then run the command-line application N times in the background. And you then wait as you watch all of the chunks finish, except for that last one, which just keeps going seemingly for ever. To avoid this problem, a more sophisticated approach is to split into M chunks where M is a few multiples larger than N, and use GNU parallel to manage a queue over N processors.

I prefer using Python for this as it feels much tidier. Everything is contained in a single fairly-simple script. It avoids the need for manual splitting and collation. The script finishes when everything is finished (so I don't need to keep checking 'top'...but I do anyway :-) ), which means I can queue things up at a higher-level. The magic is handled by the multiprocessing library (see my earlier post for more details):
if __name__ == "__main__":
    INPUTFILE = "myinput.sdf"
    OUTPUTFILE = "myoutput.sdf"

    POOLSIZE = 12 # the number of CPUs to use
    CHUNKSIZE = 250

    pool = multiprocessing.Pool(POOLSIZE)
    with open(OUTPUTFILE, "w") as out:
        with myopen(INPUTFILE) as inp:
            miter = sdf_iterator(inp)
            for data in pool.imap_unordered(calculate, miter, CHUNKSIZE):
                out.write(data)

You may question whether 'calculate' and 'data' are appropriate function/variable names in this case. The thing is, I use this pattern over and over again for different applications, and I just keep the same variable names. Regarding CHUNKSIZE, I try to choose a value that keeps each process running for 10 seconds or so (check 'top' immediately after starting the script); this tends to ensure that all processes are running at 100% instead of waiting to communicate their results to the primary process.

Chunked parallel processing II
The code above will work fine, but it's not as efficient as I would like for this use-case. The command-line application is being called with a single SDF entry each time, and there is a price to pay in terms of process overhead. What about the CHUNKSIZE in the code above? Well, that's solving a different but similar problem; minimising the communication between the N worker processes and the primary process. Each worker is given a chunk of 250 SDF entries, and these are passed one-at-a-time to the calculate function. What we'd really like to do is give the command-line application the whole chunk of 250 in one go. Enter the chunked_sdf_iterator(), which lazily returns chunks of N molecules at a time:
def chunked_sdf_iterator(fileobj, num_mols):
    miter = sdf_iterator(fileobj)
    tmp = []
    i = 0
    for sdf in miter:
        if i==num_mols:
            yield "".join(tmp)
            tmp = []
            i = 0
        i += 1
        tmp.append(sdf)
    yield "".join(tmp) # NOTE!!

Note the final line of the function. It's quite easy to accidentally omit the final instance when writing iterators like this, so always test by checking that the output includes the final case.

Given this function, only small changes are needed to __main__ to use it:
if __name__ == "__main__":
    INPUTFILE = "myinput.sdf"
    OUTPUTFILE = "myoutput.sdf"

    CHUNK_MOLS = 250
    POOLSIZE = 12 # the number of CPUs to use
    CHUNKSIZE = 1

    pool = multiprocessing.Pool(POOLSIZE)

    with open(OUTPUTFILE, "w") as out:
        with myopen(INPUTFILE) as inp:
            miter = chunked_sdf_iterator(inp, CHUNK_MOLS)
            for data in pool.imap_unordered(calculate, miter, CHUNKSIZE):
                out.write(data)

Note that it's possible to keep using a value of CHUNKSIZE > 1 but there's no real benefit and it's simpler this way. Just ensure that the process takes around 10 seconds as before by increasing CHUNK_MOLS.