Thursday 24 May 2018

Reader's challenge: beat my SMILES genetic algorithm

I'm a big fan of genetic algorithms. I used one back in the day for inverse design of molecular wires. They are deceptively simple, so much so that when you see one in action, churning out molecules that seem to magically improve generation on generation, it's a real eye-opener.

A key part of a genetic algorithm is the application of the crossover and mutation operators. The idea is to take a population of existing molecules, somehow combine pairs (crossover) while changing the structure slightly (mutation). When dealing with chemical structures, you can imagine operators that work directly on the chemical graph, or those that work on a proxy such as a SMILES string. Applying a genetic algorithm to molecules isn't a new idea (there was a patent back in 1993 from Dave Weininger on this topic) but that doesn't mean it's not useful still. And in contrast to generative models for DNNs, which have recently taken the limelight, using GAs is fast, doesn't require GPUs, and is easy to understand and improve. Are the results worse then using DNNs? I note that it is so claimed.

If the idea of using a GA sounds interesting, you may find the following code useful as a starting point. I've tried to make it as simple as possible while still doing something useful: given N random molecules from ChEMBL, chosen to be around the size of abilify but with ECFP4 of less than 0.2 to abilify, can we optimise to a structure with ECFP4 close to abilify? In my implementation, the mutation operator is very basic - it simply swaps adjacent characters in the SMILES string.

The challenge for the reader, if you choose to accept it, is to change/improve this genetic algorithm to get better performance. If you succeed, post the code publicly and leave a comment below. My recommended place to start would be to write a more sophisticated mutation operator (e.g. one that can introduce and break rings, and introduce new genetic material).

  1. The SMILES strings generated by the GA can look a bit iffy. However, remember that the goal of the GA is not to create nice SMILES strings, but to optimise a chemical structure - how it gets there is somewhat irrelevant. To tidy up the SMILES for consumption by other programs, just run them through Open Babel.
  2. Requires the development version of Open Babel for ECFP support
  3. The code illustrates again the use of a valence check. Note that the valence check differs depending on the use case. Here I only allow valences present in abilify, whereas in the previous blog post I was trying to quality control a chemical database.
  4. The input database should consist of "anti-canonical" SMILES, e.g. generated as described in a previous blog post. Originally I did this on-the-fly but there's currently no way to control the random seed used in Open Babel (and I like reproducibility).
  5. The code above is not slow, but you could make it go faster if you used multitple processors to create the children.
  6. I realise it's something of a cop-out to use N SMILES from ChEMBL to seed the procedure. A prefect genetic algorithm would be able to start with a bunch of methane, water and ammonia and evolve to the desired optimum. The one I've implemented just manipulates the original genetic material, and so a cop-out is required.

Tuesday 8 May 2018

When all you want is a ring: Part II

The previous post used a fairly strict definition of a ring system; a set of atoms, all of which are in a ring, and which are connected by bonds, all of which are in a ring. But this ignores any attached atoms, and a case can be made that the removal of these attached atoms completely alters the identity of a ring system.

As a first step towards considering these atoms, a value of 2 can be passed to CopySubstructure as the fourth parameter. This sprouts dummy atoms to replace missing bonds, and so all para-substituted phenyl rings (for example) become *c1ccc(*)cc1.

To go beyond this and actually recover the identity of the attached atoms, well, that's a little bit trickier. To begin with, we need to think about which atoms we want to copy. Depending on your application, you might want to focus on atoms that influence a particular aromaticity model, or have tautomeric potential. Here I'm going to include any atom that's doubly bonded, and any non-C atoms that are singly-bonded.

The most obvious way to do this is to add these atoms to the atomsToCopy variable before doing the copy. Unfortunately, that approach quickly runs into difficulties as the atom might be attached to two rings (e.g. C1CC1OC2CCCC2), or worse still, the atom might itself be in a ring (e.g. C1CC1=C2CCCC2). In fact, the only way to do this "straightforward approach" is to first count the ring systems and label the atoms belonging to each (e.g. with a flood-fill algorithm), and then one-at-a-time copy each ring system and the attached atoms. It works, but elegant it ain't.

So let's do it a different way, a way that turns out to be quite a bit simpler and handles all of the ring systems in a single copy. Remember above when I mentioned para-substituted phenyl rings becoming *c1ccc(*)cc1? No? Well, I did. Anyway, let's suppose that we know the original atom to which each dummy atom corresponds; we could then change the dummy atom to have the same atomic number and charge as the original. Well, what d'yaknow - the fifth parameter to CopySubstructure can be used to find exactly this correspondence.

Oh yeah, there's one catch. Any stereo involving the attached atoms will be missing (I didn't implement this in CopyStructure). If this is important, then you'll need to use the first approach I described above. Anyway, here's the code, which should be used in combination with the code samples in the original post. For the two examples listed above, C1CC1OC2CCCC2 and C1CC1=C2CCCC2, it will produce C1CC1O.C1(CCCC1)O and C1CC1=C.C1(=C)CCCC1.
atomorder = ob.vectorUnsignedInt()

def CopyRingSystem_3(orig, copy):
    for atom in ob.OBMolAtomIter(orig):
        if atom.IsInRing():
    for bond in ob.OBMolBondIter(orig):
        if not bond.IsInRing():
    ok = orig.CopySubstructure(copy, atomsToCopy, bondsToExclude, 2, atomorder)
    assert ok

    mark_for_deletion = []
    for atom in ob.OBMolAtomIter(copy):
        if atom.GetAtomicNum() != 0:
        bond = next(ob.OBAtomBondIter(atom))
        bondorder = bond.GetBondOrder()
        nbr = bond.GetNbrAtom(atom)
        origatom = orig.GetAtom(atomorder[atom.GetIdx()-1])
        if bondorder == 2 or origatom.GetAtomicNum() != 6:
            # Turn the asterisk into the 'original atom'
            valence = origatom.BOSum() + origatom.GetImplicitHCount()
            atom.SetImplicitHCount(valence - bondorder)
            nbr.SetImplicitHCount(nbr.GetImplicitHCount() + bondorder)
    for atom in reversed(mark_for_deletion):

    return ok

Saturday 5 May 2018

When all you want is a ring

I've just added a method called CopySubstructure() to Open Babel. It's something I've missed for a while, the ability to copy part of a molecule into a new one. The basic usage is where the user gives a list of atoms to copy. When you copy the substructure, all bonds between atoms in the substructure are retained (and only those). By default, hydrogen counts are adjusted to account for missing bonds.

As an example of use, if you copy all atoms that are in rings, you will find all of the ring systems:
import pybel
ob = pybel.ob

atomsToCopy = ob.OBBitVec()
bondsToExclude = ob.OBBitVec()

def CopyRingSystem(orig, copy):
    for atom in ob.OBMolAtomIter(orig):
        if atom.IsInRing():
    return orig.CopySubstructure(copy, atomsToCopy)
So Clc1ccccc1OC2CCC2 will become c1ccccc1.C2CCC2. If you do this for all of ChEMBL, split on the dots, and collate using the canonical SMILES, you can find the frequencies of all ring systems. Well, almost...

Even for this 'simple' use case, there are some subtleties that may trap the unwary when copying substructures of a molecule. For example, aromaticity is copied as present in the original molecule, and this may or may not be what one wants - it's up to the user to decide, and to mark aromaticity as unperceived if they wish. In this context, it's worth bearing in mind that the Daylight aromaticity model includes contributions from some exocyclic double bonds (and these won't be present with the code above).

Another quirk is that while it may seem that the code above creates SMILES strings for isolated ring systems, that's not quite true. If two ring systems are connected by a bond, then that bond will be retained even where it itself is not in a ring (because all bonds between atoms in the substructure are retained - see the very first paragraph above). For example, c1ccccc1C2CCC2 will stay c1ccccc1C2CCC2.

Which brings us to the second argument to the method, a list of bonds to exclude. To get the behaviour we might expect, all we need to do is add all bonds that are not in a ring to this list of bonds to exclude:
...continued from above
def CopyRingSystem_2(orig, copy):
    for atom in ob.OBMolAtomIter(orig):
        if atom.IsInRing():
    for bond in ob.OBMolBondIter(orig):
        if not bond.IsInRing():
    return orig.CopySubstructure(copy, atomsToCopy, bondsToExclude)

Here is a main() that could be used with the code above:
if __name__ == "__main__":
    copied = ob.OBMol()
    with open("outputC.smi", "w") as out:
        for mol in pybel.readfile("smi", r"C:\Tools\LargeData\sortedbylength.smi"):
            ok = CopyRingSystem_2(mol.OBMol, copied)
            assert ok
            if copied.NumAtoms() > 0:
                smi = pybel.Molecule(copied).write("smi").rstrip()