Tuesday, 24 April 2018

Running CUDA samples with Visual Studio 2017

I've been installing the CUDA drivers on a Windows 10 box with Visual Studio 2017, and trying to get the CUDA samples to compile. Although solution files are provided for VS2017 (among other VSs), you will get something similar to the following error when you attempt to compile:
error MSB8036: The Windows SDK version 10.0.15063.0 was not found. Install the required version of Windows SDK or change the SDK version in the project property pages or by right-clicking the solution and selecting "Retarget solution".

Right-clicking on the solution and retargeting gets you a bit further:
fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, 2015 and 2017 are supported!
...which is funny, because I am using VS2017. If you dig into it, it's the specific version that's the problem, and there doesn't seem to be an easy fix.

However, a nice feature (finally!) of VS2017 is that you can optionally install other compiler toolchains. If you rerun your VS2017 Installer, and find the Modify option (under More), you will see a whole bunch of extra features you can install under "Individual components". The one of interest here is "VC++ 2015.3 v140 toolset for desktop". Once installed, you can instead open the Visual Studio 2015 solutions, and the good news is that these successfully compile.

Saturday, 14 April 2018

Generating multiple SMILES

While sometimes presented as a negative, the ability to generate multiple SMILES strings for the same molecule can also be a positive, particularly when you want to avoid bias (e.g. machine learning from SMILES - see here and here) or check that an algorithm is atom-order invariant.

Here are two different ways to generate multiple SMILES strings for the same molecule using Open Babel (without introducing dot disconnections). As an example, let's consider my favourite molecule: c1ccccc1C(=O)Cl.

The first approach is to use canonical SMILES...except that the canonical labels are generated randomly. You can do this directly at the commandline (see "obabel -Hsmi" for more info):
>obabel -:c1ccccc1C(=O)Cl -osmi -xC
O=C(c1ccccc1)Cl

Each time you do it, a different random SMILES string will be generated [1], up to a total of 16 variants (in this case):
C(=O)(Cl)c1ccccc1
C(=O)(c1ccccc1)Cl
ClC(=O)c1ccccc1
O=C(Cl)c1ccccc1
O=C(c1ccccc1)Cl
c1(C(=O)Cl)ccccc1
c1(ccccc1)C(=O)Cl
c1c(C(=O)Cl)cccc1
c1c(cccc1)C(=O)Cl
c1cc(C(=O)Cl)ccc1
c1cc(ccc1)C(=O)Cl
c1ccc(C(=O)Cl)cc1
c1ccc(cc1)C(=O)Cl
c1cccc(C(=O)Cl)c1
c1cccc(c1)C(=O)Cl
c1ccccc1C(=O)Cl

We can generate even more variants by specifying the output order directly - this overrides some decisions that are usually left to the SMILES writer and allows us, for example, to force single bonds to be followed before double bonds:
>obabel -:c1ccccc1C(=O)Cl -osmi -xo 1-2-3-4-5-6-7-9-8
c1ccccc1C(Cl)=O

Using this approach, 32 variants can be generated:
C(=O)(Cl)c1ccccc1
C(=O)(c1ccccc1)Cl
C(Cl)(=O)c1ccccc1
C(Cl)(c1ccccc1)=O
C(c1ccccc1)(=O)Cl
C(c1ccccc1)(Cl)=O
ClC(=O)c1ccccc1
ClC(c1ccccc1)=O
O=C(Cl)c1ccccc1
O=C(c1ccccc1)Cl
c1(C(=O)Cl)ccccc1
c1(C(Cl)=O)ccccc1
c1(ccccc1)C(=O)Cl
c1(ccccc1)C(Cl)=O
c1c(C(=O)Cl)cccc1
c1c(C(Cl)=O)cccc1
c1c(cccc1)C(=O)Cl
c1c(cccc1)C(Cl)=O
c1cc(C(=O)Cl)ccc1
c1cc(C(Cl)=O)ccc1
c1cc(ccc1)C(=O)Cl
c1cc(ccc1)C(Cl)=O
c1ccc(C(=O)Cl)cc1
c1ccc(C(Cl)=O)cc1
c1ccc(cc1)C(=O)Cl
c1ccc(cc1)C(Cl)=O
c1cccc(C(=O)Cl)c1
c1cccc(C(Cl)=O)c1
c1cccc(c1)C(=O)Cl
c1cccc(c1)C(Cl)=O
c1ccccc1C(=O)Cl
c1ccccc1C(Cl)=O

In summary, these approaches allow you to generate all possible SMILES strings consistent with a depth-first ordering of atoms [2], starting from different points and choosing different routes at each branch point. For machine learning, I'd imagine that the first approach would be preferred as the second approach will generate SMILES strings that will contain substrings that would never be observed normally (in Open Babel SMILES).

Python code

import random
random.seed(1)
import pybel

def randomlabels(mol, N):
    ans = set()
    for i in range(N):
        ans.add(mol.write("smi", opt={"C":True}).rstrip())
    return sorted(list(ans))

def randomorder(mol, N):
    ans = set()
    numatoms = mol.OBMol.NumAtoms()
    for i in range(N):
        idxs = list(range(1, numatoms+1))
        random.shuffle(idxs)
        optval = "-".join(str(x) for x in idxs)
        ans.add(mol.write("smi", opt={"o": optval}).rstrip())
    return sorted(list(ans))

if __name__ == "__main__":
    mol = pybel.readstring("smi", "c1ccccc1C(=O)Cl")

    print("Random canonical labels")
    randomsmis = randomlabels(mol, 500)
    print(len(randomsmis))
    for smi in randomsmis:
        print(smi)
    print()
    print("Random output order")
    randomsmis = randomorder(mol, 500)
    print(len(randomsmis))
    for smi in randomsmis:
        print(smi)
    print()

Notes:
1. An alternative (but slower) way to generate these same SMILES would be to shuffle the atoms in the OBMol and then write it out as a SMILES string.
2. If dot disconnections are tolerated, then see Andrew Dalke's approach.

Tuesday, 27 February 2018

Calling all students: Google Summer of Code and CSA Trust Grant both happening now!

Deadlines are fast approaching for the following:

Open Chemistry Google Summer of Code: If you're a student and interested in hacking on open source chemistry software (and get paid a bit for the privilege), then get on over to the Google Summer of Code (GSoC) page of the Open Chemistry project. A whole bunch of Open Source chemistry tools have gathered together (including Open Babel, natch) and come up with project ideas that hopefully will spark interest. If you've always wanted to get involved with Open Source but didn't know how, this is a good chance to do so.

CSA Trust Grant: I was a recipient of a CSA Trust Grant myself, and funny story, I'm now a CSA Trustee and on the Grant Committee. Applications are now invited - it's pretty straightforward. If you look at the details of previous recipients you can see the sorts of things they applied for. The success rate is pretty high, so I really do encourage you to apply. And if you don't get it this year, well, try again next year (it worked for me).

Saturday, 3 February 2018

Using those other processors

This year, I've decided it's time to make more use of those other processors that my PC has. At some point, it gets embarrassing to see my single CPU job trundling along at its max speed, while 80 or 90% of the processing power is just sitting idle.

My typical input is a SMILES file such as ChEMBL or PubChem, or a file I've created by processing these, or a set of reaction SMILES extracted from patents. In any case, the problem is the same - a large number of inputs that can be processed in parallel. Sure, you can split the file (see the 'split' command, though be careful to split on lines) but I tend to avoid that as it's a bit clunky. If on Linux, there's the magic "GNU parallel". But what I really should be doing is making use of multiple CPUs directly in my Python scripts.

And actually, that's not too hard. The multiprocessing module makes it pretty easy once you've figured it out. Here's my goto template for these calculations. I try to keep it as slimmed down as possible, because there's some magic going on behind the scenes and I don't want to have to debug any complex problems. (Note that there's also a multithreading module, but CPython cannot parallelize CPU-bound calculations using threads due to its Global Interpreter Lock.)

import multiprocessing as mp
import pybel

def calculate(data):
    return pybel.readstring("smi", data).molwt

if __name__ == "__main__":
    POOLSIZE = 4 # the number of CPUs
    CHUNKSIZE = 1000
    pool = mp.Pool(POOLSIZE)
    with open("output.txt", "w") as out:
        with open(r"C:\Tools\LargeData\chembl_23.smi", "r") as inp:
            for result in pool.imap(calculate, inp, CHUNKSIZE):
            # for result in pool.imap_unordered(calculate, inp, CHUNKSIZE):
            # for result in map(calculate, inp): # no multiprocessing
                out.write("%f\n" % result)

Notes:
1. This blog is now Python 3.
2. By editing the commented-out code, you can choose one of three variations. During development, you can avoid multiprocessing entirely by just using a regular map (equivalent to itertools.imap in Python 2). If using multiprocessing, you can choose to have the output in the same order as the input, or just have it as it comes; the latter is presumably faster.
3. If you need to kill a multiprocessing job, CTRL+C just won't do it, as it spawns an additional process to replace it (if anyone knows of a way to make this work please let me know). You need to use the Process Manager, choose the master Python process and kill that one.

Tuesday, 23 January 2018

Which came first, the atom or the bond?

Let's suppose you need to iterate over the neighbours of an atom, and you want to know both the corresponding bond order and the atomic number of the neighbour. Given that toolkits typically provide iterators over the attached bonds or the attached atoms, but not both simultaneously, how exactly should you do this?

Should you:
  • (a) iterate over the neighbouring atoms, and then request the bond joining the two atoms?
  •     for nbr in ob.OBAtomAtomIter(atom):
            bond = atom.GetBond(nbr)
    
  • or (b) iterate over the attached bonds, and ask for the atom at the other end?
  •     for bond in ob.OBAtomBondIter(atom):
            nbr = bond.GetNbrAtom(atom)
    

Obviously, either way you get your atom and bond. But which is more efficient? Clearly, the answer to this depends on the internals of the toolkit. But if you assume that each atom knows its attached atoms and bonds, then it's only the second step that determines the relative efficiency. That is:
  • (a) given two atoms find the bond that joins them, versus
  • (b) given an atom and a bond find the atom at the other end

Since the implementation of (a) will probably involve the same test that (b) is doing plus additional work, it follows that (b) must be more efficient. I never really thought about this before until I was writing the kekulization code for Open Babel. It's the sort of thing that's useful to work out once and then apply in future without thinking. Sure, the speed difference may be minimal but given that you have to choose, you might as well write it the more efficient way.

Thursday, 18 January 2018

Implementing the Sayle tautomer hash with Open Babel

One of the consequences of last year's overhaul of handling of kekulization, aromaticity and implicit hydrogens is the ability to easily calculate structure hashes such as Roger Sayle's tautomer hash, which I wrote up on the NextMove blog a while ago.

I routinely use this hash to handle tautomers, particularly when dealing with R groups. It doesn't solve all tautomer issues (e.g. ones that involve carbon) but it can quickly bring you from having no support for handling tautomers to getting you 95% of the way. In fact, I've been thinking about adding this (and some of the other hashes that Roger has come up with) as cansmi output options. Anyhoo, here's an implementation in Python:
import pybel
ob = pybel.ob

def tautomerhash(smi):
    """Take a SMILES and return the Sayle tautomer hash:
    https://nextmovesoftware.com/blog/2016/06/22/fishing-for-matched-series-in-a-sea-of-structure-representations/
    """
    mol = pybel.readstring("smi", smi).OBMol
    mol.DeleteHydrogens() # just in case
    formalcharges = 0
    hcount = 0
    for atom in ob.OBMolAtomIter(mol):
        formalcharges += atom.GetFormalCharge()
        atom.SetFormalCharge(0)
        if atom.GetAtomicNum() != 6: # non-carbon
            hcount += atom.GetImplicitHCount()
        atom.SetImplicitHCount(0)
        atom.UnsetAromatic()
    for bond in ob.OBMolBondIter(mol):
        bond.SetBondOrder(1)
        bond.UnsetAromatic()
    mol.SetAromaticPerceived() # no point triggering perception
    return "%s_%d" % (pybel.Molecule(mol).write("can").rstrip(), hcount-formalcharges)

if __name__ == "__main__":
    smis = ["*c1c(c(C(=N)O)cc2nc([nH]c12)C(=O)[O-])N(=O)=O",
            "*c1c(c(C(=O)N)cc2[nH]c(nc12)C(=O)O)[N+](=O)[O-]"]
    for smi in smis:
        print(tautomerhash(smi))
The two SMILES in the example code above are those from the original blog post. Here they give the following two identical hashes (note to self: 'fix' the bracketed asterisk):
[O][C]([C]1[N][C]2[C]([N]1)[C]([*])[C]([C]([C]2)[C]([O])[N])N([O])[O])[O]_4
[O][C]([C]1[N][C]2[C]([N]1)[C]([*])[C]([C]([C]2)[C]([O])[N])N([O])[O])[O]_4

Sunday, 17 December 2017

Faster toolkit, faster! Part II

A while ago I described some work I was doing to improve the overall speed of the Open Babel toolkit. Here I want to focus on a particular use-case and where I've got to.

This use case is SMILES to SMILES conversion. Now, while this particular transformation might not sound very interesting, it does encompass both SMILES reading and SMILES writing in one handy package, and both of these operations are often important when dealing with databases or datasets of chemical structures. It also exercises several areas of the toolkit such as kekulization, handling of aromaticity, and stereo perception (or not, as we'll see). Canonicalization is also relevant here, but I didn't do any work on that (and it needs some).

To begin with, some timings. To convert 100K ChEMBL molecules from smi to smi took 10m7s with OB 2.4.1. With the current development version it takes 31s. One change to the defaults is that the dev version does not reperceive the stereo. If you turn on stereo perception (-aS), it takes 1m13s. You can speed things up if you also avoid reperceiving the aromaticity (-aa) and read it as provided in the input; then the conversion only takes 19s.

[21 days later] So I was originally going to describe the results from the Visual Studio profiler for this conversion. But then I said, hey, I might as well fix that one, and that one there, and, well, you know how it goes - this part is actually quite fun, when you make a small change and see the speed go up. Anyway, the conversion that used to take 19s, now takes 11.0s. If you're interested, the speedups included things like replacing std::endl by "\n", caching option values, avoiding string copies, avoiding use of stringstream, avoiding SSSR calculation, and using reserve() on vectors. It was often surprising what things appeared high on the list in the profiler. I can see a few more things that could be improved, but I'm going to leave it there for the moment.

So, in summary, this particular conversion has gone from slow to fast, with a speedup of 55x. There's always more that could be done, but it's respectable.