Sunday, 5 July 2020

SMI to 2D MOL: Parallel processing with the CDK from Jython

At some point in a computational chemistry workflow, if you started with SMILES you need to transition to the world of 3D. A slight niggle in the works is that some tools may not correctly read SMILES, or indeed not read them at all. For this reason, it is useful to convert to 2D MOL files. In this respect, the CDK has established itself as one of the go-to tools thanks to efforts by John Mayfield (see for example, CDK depict and his RDKit talk on the topic).

The best way to use the CDK is from Java. However, I find the energy barrier to writing a Java problem to be high, and so here I'll use Jython. Once installed, you just add the cdk-2.3.jar to the environment variable CLASSPATH and you are good to go.

Serial approach
I'll be writing about timings for reading 100K SMILES strings from a CSV file and converting them to an SDF. The baseline is the serial implementation, which takes about 60s.
# Python
import csv
import time

# Java
import java
import org.openscience.cdk as cdk

sp = cdk.smiles.SmilesParser(cdk.silent.SilentChemObjectBuilder.getInstance())
TITLE = cdk.CDKConstants.TITLE

def calculate(smi, title):
    # Read SMILES
    mol = sp.parseSmiles(smi)

    # Set the title
    mol.setProperty(TITLE, title)

    # Do the SDG
    sdg = cdk.layout.StructureDiagramGenerator()
    sdg.generateCoordinates(mol)

    # Write SDF file
    writer = java.io.StringWriter()
    molwriter = cdk.io.SDFWriter(writer)
    molwriter.write(mol)
    molwriter.close() # flushes

    return writer.toString()

if __name__ == "__main__":
    INPUTFILE = "100000.csv"
    OUTPUTFILE = "out.sdf"

    t = time.time()
    with open(OUTPUTFILE, "w") as out:
        with open(INPUTFILE) as inp:
            reader = csv.reader(inp)
            for smi, _, title in reader:
                out.write(calculate(smi, title))
    print(time.time() - t)
If we have millions of SMILES strings, a parallel approach can help. Unfortunately, Jython does not provide an implementation of the multiprocessing library so we need to do this the Java way...

Approach 1 - Using streams
The script below reads in SMILES strings from a CSV file as a stream and passes them one-at-a-time to multiple threads running in parallel to be converted to an SDF entry. The API doesn't allow any access (as far as I can tell) to control the number of threads. The SDF entries are written to the output file in the original order if ".forEachOrdered" is used versus ".forEach". There was a 4.5X speed-up, from 60s to 13s. This was on a machine with 12 physical cores (24 logical, due to the hyperthreading). Timings for forEach() instead of forEachOrdered() were about the same (surprisingly).
# Python
import csv

# Java
import java
import org.openscience.cdk as cdk
from java.nio.file import Files, Paths
from java.util.function import Function, Consumer

sp = cdk.smiles.SmilesParser(cdk.silent.SilentChemObjectBuilder.getInstance())
TITLE = cdk.CDKConstants.TITLE

def smi2sdf(line):
    smi, _, title = next(csv.reader([line]))

    # Read SMILES
    mol = sp.parseSmiles(smi)

    # Set the title
    mol.setProperty(TITLE, title)

    # Do the SDG
    sdg = cdk.layout.StructureDiagramGenerator()
    sdg.generateCoordinates(mol)

    # Write SDF file
    writer = java.io.StringWriter()
    molwriter = cdk.io.SDFWriter(writer)
    molwriter.write(mol)
    molwriter.close() # flushes

    return writer.toString()

class Calculate(Function):
    def apply(self, text):
        return smi2sdf(text)

class Write(Consumer):
    def __init__(self, filename):
        self.mfile = open(filename, "w")
    def accept(self, text):
        self.mfile.write(text)
    def __del__(self):
        self.mfile.close()

if __name__ == "__main__":
    INPUTFILE = "100000.csv"
    OUTPUTFILE = "out.sdf"

    calculate = Calculate()
    write = Write(OUTPUTFILE)

    Files.lines(Paths.get(INPUTFILE)).parallel().map(calculate).forEach(write)
Approach 2 - Using a ThreadPool
Daniel Lowe suggested using a ThreadPool and provided example Java code showing that it ran faster that the streams approach. This was also the case in Jython, where a timing of 9.6s was obtained for 12 threads, a 6X speedup over the serial implementation. The upside of using a ThreadPool is that the number of threads can be controlled explicitly, and it's worth noting that using 24 actually slowed things down to 10.2s - a reminder that "hyperthreading" is marketing BS. A potential downside is that there's no possibility (with this implementation at least) to order the output.
# Python
import csv

# Java
import java
import org.openscience.cdk as cdk
import java.util.concurrent as conc

sp = cdk.smiles.SmilesParser(cdk.silent.SilentChemObjectBuilder.getInstance())
TITLE = cdk.CDKConstants.TITLE

def calculate(smi, title):

    # Read SMILES
    mol = sp.parseSmiles(smi)

    # Set the title
    mol.setProperty(TITLE, title)

    # Do the SDG
    sdg = cdk.layout.StructureDiagramGenerator()
    sdg.generateCoordinates(mol)

    # Write SDF file
    writer = java.io.StringWriter()
    molwriter = cdk.io.SDFWriter(writer)
    molwriter.write(mol)
    molwriter.close() # flushes

    return writer.toString()

class SmiToMol(java.lang.Runnable):

    def __init__(self, smi, title, writer):
        self.smi = smi
        self. title = title
        self.writer = writer

    def run(self):
        self.writer.write(calculate(self.smi, self.title))

class LimitedQueue(conc.LinkedBlockingQueue):
    serialVersionIUD = 1

    def __init__(self, maxSize):
        conc.LinkedBlockingQueue.__init__(self, maxSize)

    def offer(self, e):
        # convert offer to 'put' to make it blocking
        try:
            self.put(e)
            return True
        except InterruptedException as ie:
            Thread.currentThread().interrupt()
        return False

if __name__ == "__main__":
    INPUTFILE = "100000.csv"
    OUTPUTFILE = "out.threadpool.sdf"

    THREADS = 12
    executor = conc.ThreadPoolExecutor(THREADS, THREADS, 0, conc.TimeUnit.MILLISECONDS, LimitedQueue(THREADS * 2))
    with open(OUTPUTFILE, "w") as out:
        with open(INPUTFILE) as inp:
            reader = csv.reader(inp)
            for smi, _, title in reader:
                executor.execute(SmiToMol(smi, title, out))
        executor.shutdown()
        executor.awaitTermination(10000, conc.TimeUnit.SECONDS)
Credits
Thanks to Daniel Lowe and John Mayfield for an interesting discussion about various approaches and what's going on under-the-hood.

Sunday, 31 May 2020

Python patterns for processing large SDF files

Running large SDF files through a processing pipeline is not an uncommon task in cheminformatics/computational chemistry. Here are some patterns I have been using to manage this task while making maximum use of available processing power.

An SDF iterator
Cheminformatics toolkits are great, but have you ever not used one? If you are handing off an entry in an SDF file to third-party software for processing, then there is no need to make your processor do the busy-work of reading the SDF file, building a molecule, and then writing it out again. Here's a lazy iterator over a file-like object that returns SDF entries one-at-a-time.
def sdf_iterator(fileobj):
    data = []
    for line in fileobj:
        data.append(line)
        if line.startswith("$$$$"):
            yield "".join(data)
            data = []

Handle gzipped files
As all chemists know, an atom is mostly empty space. It is fitting therefore that SDF files mirror this property and exhibit excellent compressibility. In practice, one ends up having to deal with a mixture of compressed and uncompressed SDF files, and it's tedious adapting code to handle them both. Here's a helper function one could use for this:
def myopen(filename):
    return gzip.open(filename, "rt") if filename.endswith(".gz") else open(filename)

if __name__ == "__main__":
    # assign inputfile as either an .sdf or .sdf.gz
    with myopen(inputfile) as inp:
        for sdfentry in sdf_iterator(inp):
            # do something with sdfentry

Calling a command-line application from Python to process an SDF entry
For some reason, not everything is wrapped up nicely as a Python library, and often it's necessary to pass an SDF file to a command-line application for processing and get the result back. For example, the result could be another SDF file containing docked poses and results, or protonated molecules, or shape-matched hits. In the best-case scenario, the command-line application can read from standard input and write to standard out. This is the situation I describe below; it's the most efficient way of communicating as disk-access will slow things down. The alternative is to use temporary files - this can be more tricky than you'd think, and is best avoided if at all possible.
def calculate(sdf):
    proc = subprocess.Popen(["noelina", "-option1", "option1value", "-option2", "option2value"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, encoding="utf-8")
    stdout, stderr = proc.communicate(sdf)
    return stdout

if __name__ == "__main__":
    with open("outputfile.sdf") as out:
        with myopen("inputfile.sdf") as inp:
            for sdfentry in sdf_iterator(inp):
                output = calculate(sdfentry)
                out.write(output)
Note that you may wish to inspect and handle error messages. My advice is to pass any errors back to the caller (as a string that is, not an Exception), and let it handle them (e.g. write them to a logfile). This separates the processing of the results from their generation. This is particularly important when we move to parallel processing (see below), where we want the processing of results to be handled by a single process, the primary process.

Parallel processing I
It's all very well doing this from Python, but it would have been faster and simpler to call the command-line application directly using the command-line. The advantage of using Python comes into play when we consider parallel processing.

There are several approaches you can use here. One is to split the SDF file into N chunks, one for each processor, and then run the command-line application N times in the background. And you then wait as you watch all of the chunks finish, except for that last one, which just keeps going seemingly for ever. To avoid this problem, a more sophisticated approach is to split into M chunks where M is a few multiples larger than N, and use GNU parallel to manage a queue over N processors.

I prefer using Python for this as it feels much tidier. Everything is contained in a single fairly-simple script. It avoids the need for manual splitting and collation. The script finishes when everything is finished (so I don't need to keep checking 'top'...but I do anyway :-) ), which means I can queue things up at a higher-level. The magic is handled by the multiprocessing library (see my earlier post for more details):
if __name__ == "__main__":
    INPUTFILE = "myinput.sdf"
    OUTPUTFILE = "myoutput.sdf"

    POOLSIZE = 12 # the number of CPUs to use
    CHUNKSIZE = 250

    pool = multiprocessing.Pool(POOLSIZE)
    with open(OUTPUTFILE, "w") as out:
        with myopen(INPUTFILE) as inp:
            miter = sdf_iterator(inp)
            for data in pool.imap_unordered(calculate, miter, CHUNKSIZE):
                out.write(data)

You may question whether 'calculate' and 'data' are appropriate function/variable names in this case. The thing is, I use this pattern over and over again for different applications, and I just keep the same variable names. Regarding CHUNKSIZE, I try to choose a value that keeps each process running for 10 seconds or so (check 'top' immediately after starting the script); this tends to ensure that all processes are running at 100% instead of waiting to communicate their results to the primary process.

Chunked parallel processing II
The code above will work fine, but it's not as efficient as I would like for this use-case. The command-line application is being called with a single SDF entry each time, and there is a price to pay in terms of process overhead. What about the CHUNKSIZE in the code above? Well, that's solving a different but similar problem; minimising the communication between the N worker processes and the primary process. Each worker is given a chunk of 250 SDF entries, and these are passed one-at-a-time to the calculate function. What we'd really like to do is give the command-line application the whole chunk of 250 in one go. Enter the chunked_sdf_iterator(), which lazily returns chunks of N molecules at a time:
def chunked_sdf_iterator(fileobj, num_mols):
    miter = sdf_iterator(fileobj)
    tmp = []
    i = 0
    for sdf in miter:
        if i==num_mols:
            yield "".join(tmp)
            tmp = []
            i = 0
        i += 1
        tmp.append(sdf)
    yield "".join(tmp) # NOTE!!

Note the final line of the function. It's quite easy to accidentally omit the final instance when writing iterators like this, so always test by checking that the output includes the final case.

Given this function, only small changes are needed to __main__ to use it:
if __name__ == "__main__":
    INPUTFILE = "myinput.sdf"
    OUTPUTFILE = "myoutput.sdf"

    CHUNK_MOLS = 250
    POOLSIZE = 12 # the number of CPUs to use
    CHUNKSIZE = 1

    pool = multiprocessing.Pool(POOLSIZE)

    with open(OUTPUTFILE, "w") as out:
        with myopen(INPUTFILE) as inp:
            miter = chunked_sdf_iterator(inp, CHUNK_MOLS)
            for data in pool.imap_unordered(calculate, miter, CHUNKSIZE):
                out.write(data)

Note that it's possible to keep using a value of CHUNKSIZE > 1 but there's no real benefit and it's simpler this way. Just ensure that the process takes around 10 seconds as before by increasing CHUNK_MOLS.

Tuesday, 18 February 2020

Reflecting on stereochemistry

Stereochemistry is a tricky concept. And writing programs to handle it is equally tricky. Take tetrahedral stereochemistry for example; it's 'interesting' to note that none of the following markers of parity use the same system: R/S in IUPAC names, D/L in biochemical nomenclature, @/@@ in SMILES, atom parity flag 1/2 in MOL files, or InChI's +/-. Which is why a sane cheminformatics library abstracts the representation of stereochemistry away from any one format to one which can be more easily reasoned about and interconverted to all.

Here's the introduction to a description I wrote up last week of how stereochemistry is handled in Open Babel. In particular, it tries to make clear how the different aspects of stereochemistry information (e.g. coordinates versus stereochemical configurations perceived from those coordinates) interact with each other from the point of view of the toolkit.

Open Babel stores stereochemistry as the relative arrangement of a set of atoms in space. For example, for a tetrahedral stereocenter, we store information like “looking from atom 2, atoms 4, 5 and 6 are arranged clockwise around atom 3”. This section describes how a user can work with or manipulate this information. This might be useful to invert a particular center, replace a substituent at a stereocenter, enumerate stereoisomers or determine the number of unspecified stereocenters.

Although Open Babel has data structures to support a variety of forms of stereochemistry, currently little use is made of any stereochemistry other than tetrahedral and cis/trans (and square planar to a certain degree).

We will look first of all at how stereochemistry information is stored, accessed, and modified. Then we describe how this information is deduced from the chemical structure. This chapter should be read in combination with the API documentation (see the Stereochemistry overview page found under “Modules”)...

The full version can be found in the development docs. Thanks to Geoff Hutchison and Stefano Forli for feedback on an early draft.

Image credit: Monument to Paul Walden by Anita Gould (licensed CC-BY-NC)

Monday, 30 December 2019

No charge - A simple approach to neutralising charged molecules

There are several reasons you might want to convert a molecule with charges to its neutral form. For example, as a step in standardising a representation for registration, to generate a parent form during registration, to deduplicate vendor-supplied databases, or to simplify SMARTS pattern searches for a particular ionizable group. Here I describe a simple procedure to generate the neutral form of a molecule, and briefly compare it to existing approaches. While a more nuanced approach may be more appropriate to prepare molecules for registration, the described method appears to be suitable for the other applications listed above.

Algorithm
To a first approximation the procedure is simply to identify all atoms with either a +1 or -1 charge, set their charges to zero and adjust their hydrogen counts by -1 or +1 (i.e. we are adding/removing H+). The first minor wrinkle is that +1 charged atoms must have a hydrogen, or otherwise we can't remove H+. The second wrinkle is that we must avoid altering charge-separated representations of multiple bonds, such as nitro which is often represented as [N+](=O)[O-]. I do this by checking whether a neighbor atom has a negative charge (for the +1 atoms) or a positive charge (for the -1 atoms).

Test set
For the purposes of timing implementations of the algorithm and comparing results to another algorithm, I created a test set from a ChEMBL 23 SMILES file I had to hand:
> obabel chembl_23.smi -r -O singlecomp.smi
> obabel singlecomp.smi -aa -s "[!+0]" -O charged.smi
> C:\cygwin\bin\gawk '{print length($0), $1, $2 }' < charged.smi | C:\cygwin\bin\sort -n > tmp.txt
> C:\cygwin\bin\cut -f 2,3 -d " " tmp.txt > charged.sorted.smi

Preamble
First, my usual boilerplate. The code below does not actually use Pybel, but I find it convenient to import both Pybel and openbabel as follows:
from openbabel import pybel
ob = pybel.ob
Implementations
I will describe 4 implementations of the algorithm above which show my progress in optimising the code for speed. Each implementation gives identical results on the test set, but runs faster. Note that hydrogens are assumed to be implicit (this will be the case if SMILES written by Open Babel are used as input, as in the test set above); if not, then they must either be converted to implicit, or the algorithm adjusted to take this into account.

Version 1
This would be the fastest version, if this were written in C++. When applied to the 125,536 molecules in the test set, this version spends 11.3s in the neutralize_v1 function.
def NoNegativelyChargedNbr(atom):
    for nbr in ob.OBAtomAtomIter(atom):
        if nbr.GetFormalCharge() < 0:
            return False
    return True

def NoPositivelyChargedNbr(atom):
    for nbr in ob.OBAtomAtomIter(atom):
        if nbr.GetFormalCharge() > 0:
            return False
    return True

def neutralize_v1(mol):
    """Neutralize charges of +1 or -1"""
    for atom in ob.OBMolAtomIter(mol):
        chg = atom.GetFormalCharge()
        if chg == 1:
            hcount = atom.GetImplicitHCount()
            if hcount >= 1 and NoNegativelyChargedNbr(atom):
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount - 1)
        elif chg == -1: # -ve charge
            hcount = atom.GetImplicitHCount()
            if NoPositivelyChargedNbr(atom): # avoid nitro, etc.
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount + 1)

Version 2
Iterating over all of the atoms of all of the molecules in Python takes some time due to the memory allocation for each atom on the Python side. We can avoid this by using a SMARTS pattern to find just those atoms that have a +1/-1 charge; now the Python code will only have to iterate over the matched atoms, usually zero or a small number.

With this change, only 5.3s is spent in the neutralize function.
pat_v2 = ob.OBSmartsPattern()
pat_v2.Init("[+1,-1]")

def neutralize_v2(mol):
    """Neutralize charges of +1 or -1"""
    pat_v2.Match(mol)
    matches = pat_v2.GetMapList()
    for match in matches:
        atom = mol.GetAtom(match[0])
        chg = atom.GetFormalCharge()
        if chg == 1: # +1 charge
            hcount = atom.GetImplicitHCount()
            if hcount >= 1 and NoNegativelyChargedNbr(atom):
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount - 1)
        else: # -1 charge
            hcount = atom.GetImplicitHCount()
            if NoPositivelyChargedNbr(atom):
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount + 1)

Version 3
Now that we're using a SMARTS pattern, we might as well jam more of the logic into it. The first is the hydrogen check for +1 charged atoms; the second is the test to ensure that there isn't a neighboring atom with a negative charge (for +1 atoms) or positive charge (for -1 atoms). Unfortunately SMARTS does not provide a direct test for an atom bearing positive or negative charge, so I just end up listing a bunch of charges.

With these changes, the helper functions for checking charges on neighboring atoms are no longer needed, and in addition the 'chg' branches in the code can be unified. As a result, Version 3 is considerably shorter than earlier versions and faster as well, taking 3.5s.
pat_v3 = ob.OBSmartsPattern()
pat_v3.Init("[+1!h0!$([*]~[-1,-2,-3,-4]),-1!$([*]~[+1,+2,+3,+4])]")

def neutralize_v3(mol):
    """Neutralize charges of +1 or -1"""
    pat_v3.Match(mol)
    for match in pat_v3.GetMapList():
        atom = mol.GetAtom(match[0])
        chg = atom.GetFormalCharge()
        hcount = atom.GetImplicitHCount()
        atom.SetFormalCharge(0)
        atom.SetImplicitHCount(hcount - chg)

Version 4
I was going to leave it there, but on checking the OBSmartsPattern API I noticed a method that enables re-use of the result vector (rather than returning a new one each time). This is useful, because it means that it's possible to have a single memory allocation for the vector up-front, and by re-using it, avoid memory allocations for all subsequent molecules (strictly speaking, there may be a handful of reallocations to make it bigger if necessary, but you're talking ~10 memory allocations versus 125K). I also realised that it's probably a good idea to check the Boolean returned by the Match method.

With these changes, Version 4 spends 1.5s in neutralise, a considerable improvement compared to the original.
matchvector_v4 = ob.vectorvInt()

def neutralize_v4(mol):
    """Neutralize charges of +1 or -1"""
    match_found = pat_v3.Match(mol, matchvector_v4)
    if match_found:
        for match in matchvector_v4:
            atom = mol.GetAtom(match[0])
            chg = atom.GetFormalCharge()
            hcount = atom.GetImplicitHCount()
            atom.SetFormalCharge(0)
            atom.SetImplicitHCount(hcount - chg)

Comparison to other algorithms
I wrote this code as a baseline implementation of neutralisation with which to think about the problem, identify corner cases, and consider existing implementations. However, it seems to be good enough as is to be ready for use. Let's look at the alternatives...

Both Francis Atkinson's Standardiser and Matt Swain's MolVS appear to be aimed at preparing a standard form for registering, and think about charge balancing and so forth. This is not the goal here, and this method is not suitable for that purpose. The form of neutralisation described above could be considered a "per-atom" neutralisation, rather than "whole-molecule"; I want all carboxylates present to be turned into COOH regardless of what else is going on in the molecule; in fact, by the time I run the neutralisation I have already chopped off all but the single largest component, and so it's likely to be unbalanced. I should look into those methods further though, and try to understand the rationale...or maybe just ask the authors at the next Cambridge Cheminformatics Network Meeting. :-) I did compare my results to MolVS and apart from charge balancing differences (e.g. for "C[S+](C)CCCC(=O)[O-]"), I note that MolVS handles [N+]([O-])[O-] better (e.g. "c1cc(cc(c1)F)NC(=O)C(=[N+]([O-])[O-])C#N" derived from CHEMBL1651804); I leave it alone because it appears to be a charge-separated form, but MolVS protonates one of the oxygens.

In terms of its intended goal, the code from Hans de Winter included in the RDKit Cookbook is the most similar. One difference is that that code sticks to a specific list of ionizable groups (and thus misses at least some, e.g. hydroxamic acids) whereas the code above is completely general. Comparing the results, I note that the HdW code incorrectly affects some charge-separated forms (e.g. "CC[S+](C)[O-]" to "CC[S+](O)C", "[O-][O+]=O" (ozone) to "O[O+]=O") but some tweaking of the SMARTS patterns will fix this, and there are some oddities (e.g. "[O-][Cl]=O" converted to "O[Cl+]O").

In short, for my purposes, I find that the simple code above does what I need. If I need something more sophisticated down the line, I'll look more closely at what problems the Standardiser and MolVS codes are attempting to solve.

Using this algorithm
If you are interested in applying this algorithm to your own data, you can find the code in a more convenient form at http://github.com/baoilleach/nocharge. If applying to millions of molecules, you may wish to combine this with my multiprocessing receipe to speed things up N times (where N is the number of cores you have).

[Update 02/01/2020]....or...you can wait until it arrives at an Open Babel near you. I've just added a "--neutralize" operation to my development version. When called from Python (see neutralize_v5 at the Github link above) this takes ~0.5s. Note that the C++ code is the same as in the Python v1.

[Update 06/01/2020] Changed H0 to h0 in the SMARTS pattern (doh!) to avoid an error when handling, for example, deuterated N+.

Monday, 25 November 2019

Comparing chemical spaces

In a recent blog post, Pat Walters describes using visualisation of chemical space to answer questions about the overlap and complementarity of different chemical datasets. His introductory paragraph gives two examples of where this is useful: considering the purchase of an additional set of screening compounds, and secondly comparing training and test sets.

Now usually I'd be the first to reach for a visualisation technique to answer a question, but here I'm going to argue that at least for the first example, where the within-set diversity is high, this might not be the best approach. I would be worried that the layout of molecules in the visualisation would largely be determined by low-similarity values, and that appearing close together in the depiction would not be meaningful.

The approach I used recently to tackle this problem involved clustering with ChemFP's Taylor-Butina (TB) implementation. I used the provided TB script to calculate the number of clusters for each of the two datasets on its own. Then I concatenated the fingerprint files and using the same settings, I generated clusters for the two datasets together. Finally I processed the output of the combined clustering to work out how many clusters only contained dataset A, only dataset B or contained a mixture.

The result is conceptually a Venn diagram of A and B, describing (a) the number of entries in dataset B that are within the similarity threshold to entries in dataset A (and whose purchase is potentially a waste of money) and (b) the number that are outside this threshold. For these latter, you also know how many clusters they form, which is important as you are looking for a diverse set that is distinct from dataset A.

And I'll finish with a shout-out to Andrew Dalke, whose ChemFP made this analysis possible.

Notes:
1. It's not strictly necessary to cluster each dataset on its own, but it's a useful sanity check. The number of clusters found for each dataset on its own was roughly the same as those containing that dataset in the combined clustering.
2. I can't recall the similarity threshold I used, but likely something in the 0.30-0.40 ECFP4 region.
3. I considered true singletons to be clusters of their own in this analysis, and ignored false singletons. After the fact I realised that duplicates by fingerprint (or better still perhaps by InChI) should probably be eliminated before doing the TB clustering, as their presence may convert a false singleton into a true cluster of size two. It's kind of arbitrary, but duplicating a dataset should not change the results of a clustering algorithm.

Thursday, 24 October 2019

Open Babel 3.0 released

As announced by Geoff on the mailing list, Open Babel 3.0 is now available for download:

I'm pleased to announce the release of Open Babel 3.0.0 [finally]:

This release represents a major update and is strongly recommended for all users.

It also removes deprecated components and breaks the API in a few places. For information on migrating from the previous version, please see:
https://open-babel.readthedocs.io/en/latest/UseTheLibrary/migration.html#migrating-to-3-0

We intend to move to semi-annual releases in Spring and Fall, with bug fix releases as needed.

A sample of major new features:
  • Code for handling implicit hydrogens and kekulization has been entirely replaced. As well as being accurate, the new approach is much faster.
  • Speed of reading and writing SMILES has been improved by more than 50-fold.
  • Removal of the old 'babel' binary in favor of the newer 'obabel' command-line tool.
  • New improved fragment-based 3D coordinate generation code as part of Google Summer of code 2018/2019. Significantly faster and more accurate: https://doi.org/10.1186/s13321-019-0372-5
  • (Please cite J. Cheminf. (2019) v11, article 49 if you use the new 3D coordinate generation.)
  • New API for handling reactions stored as molecules (e.g. Reaction InChI, etc.)
  • New API for copying part of an OBMol as a substructure
  • Support for Maestro file format, contributed by Patrick Lorton of Schrodinger

There are an incredible number of improvements, minor features and many bug fixes.

For a full list of changes and to download source packages:
https://open-babel.readthedocs.io/en/latest/ReleaseNotes/ob300.html
https://github.com/openbabel/openbabel/releases

Thanks to a cast of many for this release, particularly including Noel O'Boyle:
aandi, adalke (Andrew Dalke), adamjstewart (Adam J. Stewart), afonari (Alexandr Fonari), artoria2e5 (Mingye Wang), baoilleach (Noel O’Boyle), barrymoo (Barry Moore), bbucior (Ben Bucior), boryszef (Borys Szefczyk), camannguyen (An Nguyen), cmanion (Charles A. Manion), cowsandmilk (David Hall), cstein (Casper Steinmann), derekharmon (Derek Harmon), djhogan (Daniel Hogan), dkoes (David Koes), e-kwsm (Eisuke Kawashima), eloyfelix (Eloy Felix), fredrikw (Fredrik Wallner), ghutchis (Geoff Hutchison), hille721 (Christoph Hille), hseara (Hector Martinez-Seara), jasonychuang (Jason Huang), jeffjanes (Jeff Janes), johnmay (John Mayfield), katrinleinweber (Katrin Leinweber), keipertk (Kristopher Keipert), kyle-roberts-arzeda, langner (Karol M. Langner), lorton (Pat Lorton), mcs07 (Matt Swain), merkys (Andrius Merkys), mkrykunov, mmghahremanpour (Mohammad Ghahremanpour), mwojcikowski (Maciej W√≥jcikowski), n-yoshikawa (Naruki Yoshikawa), nakatamaho (Nakata Maho), nsoranzo (Nicola Soranzo), oititov (Titov Oleg), orex (Kirill Okhotnikov), pbecherer (Paul Becherer), peawagon (Jen), philthiel (Philipp Thiel), psavery (Patrick Avery), rmeli (Rocco Meli), serval2412 (Julien Nabet), sunoru, susilehtola (Susi Lehtola), tgaudin (Th√©ophile Gaudin), theavey (Thomas Heavey), timvdm (Tim Vandermeersch), torcolvin (Tor Colvin), wojdyr (Marcin Wojdyr), xomachine (Dmitriy Fomichev), yishutu (Yi-Shu Tu)
If you haven't already been using the development version, now's the time.

Tuesday, 17 September 2019

Least Publishable Unit #3 - Which similarity metric is best?

Following on from #1 and #2, here's some work I began (maybe twice) but never completed.

Back in 2016 I published a paper entitled "Which fingerprint is best?" - actually, no, someone talked me down into the title "Comparing structural fingerprints using a literature-based similarity benchmark", a title which accurately describes what we did, but not why we did it. Anyway, it's work I'm very proud of - for example, it shows there is a detectable performance difference between 16384 and 4096 bit ECFP fingerprints, and that the fingerprints most appropriate for finding very close analogs are different than for those more distant.

The obvious follow-up to this work would be compare similarity metrics, for some particular fingerprint, e.g. the 16K ECFP4 fingerprint. There's a great paper by Todeschini et al that compares 51 similarity metrics and has a nice table where the equations are all given in one place. So I took these as the starting point for my own work.

After coding them all up and running them against the benchmark, I found that a number of the metrics gave identical rank orders across the entire dataset. That part wasn't surprising - the paper itself notes that several metrics are correlated, and the Riniker and Landrum paper has an appendix with a worked proof that shows that Tanimoto and Dice are correlated. What was surprising is that the ones I was finding as correlated were not necessarily the same as those in the paper (there was some overlap, but...).

Regarding the main reason for doing the paper, Tanimoto surprisingly (or not) did turn out to be one of the best (if not the best, I can't remember exactly). Perhaps more interesting were the results I got from looking at the effect of changing the weighting of the Tversky similarity; I can't remember the details but the best results were not where I expected them to be, and I never got to the bottom of why.