Tuesday, 24 April 2018

Running CUDA samples with Visual Studio 2017

I've been installing the CUDA drivers on a Windows 10 box with Visual Studio 2017, and trying to get the CUDA samples to compile. Although solution files are provided for VS2017 (among other VSs), you will get something similar to the following error when you attempt to compile:
error MSB8036: The Windows SDK version 10.0.15063.0 was not found. Install the required version of Windows SDK or change the SDK version in the project property pages or by right-clicking the solution and selecting "Retarget solution".

Right-clicking on the solution and retargeting gets you a bit further:
fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, 2015 and 2017 are supported!
...which is funny, because I am using VS2017. If you dig into it, it's the specific version that's the problem, and there doesn't seem to be an easy fix.

However, a nice feature (finally!) of VS2017 is that you can optionally install other compiler toolchains. If you rerun your VS2017 Installer, and find the Modify option (under More), you will see a whole bunch of extra features you can install under "Individual components". The one of interest here is "VC++ 2015.3 v140 toolset for desktop". Once installed, you can instead open the Visual Studio 2015 solutions, and the good news is that these successfully compile.

Saturday, 14 April 2018

Generating multiple SMILES

While sometimes presented as a negative, the ability to generate multiple SMILES strings for the same molecule can also be a positive, particularly when you want to avoid bias (e.g. machine learning from SMILES - see here and here) or check that an algorithm is atom-order invariant.

Here are two different ways to generate multiple SMILES strings for the same molecule using Open Babel (without introducing dot disconnections). As an example, let's consider my favourite molecule: c1ccccc1C(=O)Cl.

The first approach is to use canonical SMILES...except that the canonical labels are generated randomly. You can do this directly at the commandline (see "obabel -Hsmi" for more info):
>obabel -:c1ccccc1C(=O)Cl -osmi -xC
O=C(c1ccccc1)Cl

Each time you do it, a different random SMILES string will be generated [1], up to a total of 16 variants (in this case):
C(=O)(Cl)c1ccccc1
C(=O)(c1ccccc1)Cl
ClC(=O)c1ccccc1
O=C(Cl)c1ccccc1
O=C(c1ccccc1)Cl
c1(C(=O)Cl)ccccc1
c1(ccccc1)C(=O)Cl
c1c(C(=O)Cl)cccc1
c1c(cccc1)C(=O)Cl
c1cc(C(=O)Cl)ccc1
c1cc(ccc1)C(=O)Cl
c1ccc(C(=O)Cl)cc1
c1ccc(cc1)C(=O)Cl
c1cccc(C(=O)Cl)c1
c1cccc(c1)C(=O)Cl
c1ccccc1C(=O)Cl

We can generate even more variants by specifying the output order directly - this overrides some decisions that are usually left to the SMILES writer and allows us, for example, to force single bonds to be followed before double bonds:
>obabel -:c1ccccc1C(=O)Cl -osmi -xo 1-2-3-4-5-6-7-9-8
c1ccccc1C(Cl)=O

Using this approach, 32 variants can be generated:
C(=O)(Cl)c1ccccc1
C(=O)(c1ccccc1)Cl
C(Cl)(=O)c1ccccc1
C(Cl)(c1ccccc1)=O
C(c1ccccc1)(=O)Cl
C(c1ccccc1)(Cl)=O
ClC(=O)c1ccccc1
ClC(c1ccccc1)=O
O=C(Cl)c1ccccc1
O=C(c1ccccc1)Cl
c1(C(=O)Cl)ccccc1
c1(C(Cl)=O)ccccc1
c1(ccccc1)C(=O)Cl
c1(ccccc1)C(Cl)=O
c1c(C(=O)Cl)cccc1
c1c(C(Cl)=O)cccc1
c1c(cccc1)C(=O)Cl
c1c(cccc1)C(Cl)=O
c1cc(C(=O)Cl)ccc1
c1cc(C(Cl)=O)ccc1
c1cc(ccc1)C(=O)Cl
c1cc(ccc1)C(Cl)=O
c1ccc(C(=O)Cl)cc1
c1ccc(C(Cl)=O)cc1
c1ccc(cc1)C(=O)Cl
c1ccc(cc1)C(Cl)=O
c1cccc(C(=O)Cl)c1
c1cccc(C(Cl)=O)c1
c1cccc(c1)C(=O)Cl
c1cccc(c1)C(Cl)=O
c1ccccc1C(=O)Cl
c1ccccc1C(Cl)=O

In summary, these approaches allow you to generate all possible SMILES strings consistent with a depth-first ordering of atoms [2], starting from different points and choosing different routes at each branch point. For machine learning, I'd imagine that the first approach would be preferred as the second approach will generate SMILES strings that will contain substrings that would never be observed normally (in Open Babel SMILES).

Python code

import random
random.seed(1)
import pybel

def randomlabels(mol, N):
    ans = set()
    for i in range(N):
        ans.add(mol.write("smi", opt={"C":True}).rstrip())
    return sorted(list(ans))

def randomorder(mol, N):
    ans = set()
    numatoms = mol.OBMol.NumAtoms()
    for i in range(N):
        idxs = list(range(1, numatoms+1))
        random.shuffle(idxs)
        optval = "-".join(str(x) for x in idxs)
        ans.add(mol.write("smi", opt={"o": optval}).rstrip())
    return sorted(list(ans))

if __name__ == "__main__":
    mol = pybel.readstring("smi", "c1ccccc1C(=O)Cl")

    print("Random canonical labels")
    randomsmis = randomlabels(mol, 500)
    print(len(randomsmis))
    for smi in randomsmis:
        print(smi)
    print()
    print("Random output order")
    randomsmis = randomorder(mol, 500)
    print(len(randomsmis))
    for smi in randomsmis:
        print(smi)
    print()

Notes:
1. An alternative (but slower) way to generate these same SMILES would be to shuffle the atoms in the OBMol and then write it out as a SMILES string.
2. If dot disconnections are tolerated, then see Andrew Dalke's approach.