Thursday, 24 May 2018

Reader's challenge: beat my SMILES genetic algorithm

I'm a big fan of genetic algorithms. I used one back in the day for inverse design of molecular wires. They are deceptively simple, so much so that when you see one in action, churning out molecules that seem to magically improve generation on generation, it's a real eye-opener.

A key part of a genetic algorithm is the application of the crossover and mutation operators. The idea is to take a population of existing molecules, somehow combine pairs (crossover) while changing the structure slightly (mutation). When dealing with chemical structures, you can imagine operators that work directly on the chemical graph, or those that work on a proxy such as a SMILES string. Applying a genetic algorithm to molecules isn't a new idea (there was a patent back in 1993 from Dave Weininger on this topic) but that doesn't mean it's not useful still. And in contrast to generative models for DNNs, which have recently taken the limelight, using GAs is fast, doesn't require GPUs, and is easy to understand and improve. Are the results worse then using DNNs? I note that it is so claimed.

If the idea of using a GA sounds interesting, you may find the following code useful as a starting point. I've tried to make it as simple as possible while still doing something useful: given N random molecules from ChEMBL, chosen to be around the size of abilify but with ECFP4 of less than 0.2 to abilify, can we optimise to a structure with ECFP4 close to abilify? In my implementation, the mutation operator is very basic - it simply swaps adjacent characters in the SMILES string.

The challenge for the reader, if you choose to accept it, is to change/improve this genetic algorithm to get better performance. If you succeed, post the code publicly and leave a comment below. My recommended place to start would be to write a more sophisticated mutation operator (e.g. one that can introduce and break rings, and introduce new genetic material).

import random
random.seed(1)
from collections import defaultdict
import pybel
ob = pybel.ob
ob.obErrorLog.StopLogging()
def CreateValencyTable(mol):
common_valencies = defaultdict(lambda:defaultdict(set))
for atom in ob.OBMolAtomIter(mol.OBMol):
elem = atom.GetAtomicNum()
chg = atom.GetFormalCharge()
totalbonds = atom.BOSum() + atom.GetImplicitHCount()
common_valencies[elem][chg].add(totalbonds)
# Convert from defaultdict to dict
ans = {}
for x, y in common_valencies.items():
ans[x] = dict(y)
return ans
abilifysmi = "Clc4cccc(N3CCN(CCCCOc2ccc1c(NC(=O)CC1)c2)CC3)c4Cl"
abilify = pybel.readstring("smi", abilifysmi)
common_valencies = CreateValencyTable(abilify)
print("The allowed elements, charge states and valencies are")
print(common_valencies)
def HasCommonValence(mol):
for atom in ob.OBMolAtomIter(mol):
elem = atom.GetAtomicNum()
if elem not in common_valencies:
return False # unusual elem
chg = atom.GetFormalCharge()
data = common_valencies[elem]
if chg not in data:
return False # unusual charge state
totalbonds = atom.BOSum() + atom.GetImplicitHCount()
if totalbonds not in data[chg]:
return False # unusual valence
return True
def create_mutants(A, B):
# Let's randomly choose a cross-over point in both A and B
# and generate four possible combinations
c1 = random.randint(0, len(A))
c2 = random.randint(0, len(B))
startA, endA = A[:c1], A[c1:]
startB, endB = B[:c2], B[c2:]
children = [
startA+endB, startB+endA, # somewhat sensible
endA+startB, endB+startA, # less sensible
]
# Let's mutate a few characters by swapping nbors randomly
mutant_children = []
for child in children:
mutant = ""
i = 0
N = len(child)
while i < N:
if i+1 < N and random.random() > 0.66: # 1 in 3 chance
mutant += child[i+1]
mutant += child[i]
i += 1 # extra increment
else:
mutant += child[i]
i += 1
mutant_children.append(mutant)
random.shuffle(mutant_children) # don't favour any of them
return mutant_children
def get_mutant(smiA, smiB):
for N in range(50): # try combining these 50 times
mutantsmis = create_mutants(smiA, smiB)
for mutantsmi in mutantsmis:
try:
mol = pybel.readstring("smi", mutantsmi)
except IOError:
continue # bad syntax
if HasCommonValence(mol.OBMol):
return mutantsmi, mol
return "", None
def CreateObjectiveFn(targetfp):
def objectivefn(smi):
return pybel.readstring("smi", smi).calcfp("ecfp4") | targetfp
return objectivefn
class GA:
def __init__(self, objectivefn, N):
self.objectivefn = objectivefn
self.N = N
def createInitialPop(self, db, targetlensmi):
# The population will always be in sorted order by decreasing
# objectivefn() (i.e. most similar first). This will simplify
# top-slicing and tournament selection.
pop = []
while len(pop) < self.N:
smi = random.choice(db)
mol = pybel.readstring("smi", smi)
# Select molecules with a similar SMILES length but with a low value of the objectivefn
if HasCommonValence(mol.OBMol) and abs(targetlensmi - len(smi)) < 10 and self.objectivefn(smi) < 0.2:
mol.OBMol.SetTitle("")
pop.append(mol.write("smi", opt={"i":True}).rstrip()) # random order, leave out stereo
self.pop = sorted(pop, key=lambda x:self.objectivefn(x), reverse=True)
def createChildren(self):
# tournament selection of 2 smis
children = []
mrange = range(self.N)
while len(children) < self.N:
chosenA = sorted(random.sample(mrange, 3))[0]
chosenB = chosenA
while chosenB == chosenA:
chosenB = sorted(random.sample(mrange, 3))[0]
# unleash the mutants
mutantsmi, mol = get_mutant(self.pop[chosenA], self.pop[chosenB])
if not mol:
continue
children.append(mutantsmi)
self.children = sorted(children, key=lambda x:self.objectivefn(x), reverse=True)
def createNextGen(self):
# top-slice the existing population and the children
self.pop = sorted(self.pop[:int(self.N/2)] + self.children[:int(self.N/2)],
key=lambda x:self.objectivefn(x), reverse=True)
def report(self):
for p in self.pop[:10]:
print(p, self.objectivefn(p))
print()
if __name__ == "__main__":
targetfp = abilify.calcfp("ecfp4")
objfn = CreateObjectiveFn(targetfp)
with open(r"D:\LargeData\ChEMBL\chembl_23.smi") as inp:
allchembl = inp.readlines()
N = 100 # no. of chromosomes
ga = GA(objfn, N)
ga.createInitialPop(allchembl, len(abilifysmi))
ga.report()
iter = 0
while True:
iter += 1
print("Iter =", iter)
ga.createChildren()
ga.createNextGen()
ga.report()
view raw geneticalgo.py hosted with ❤ by GitHub


Notes:
  1. The SMILES strings generated by the GA can look a bit iffy. However, remember that the goal of the GA is not to create nice SMILES strings, but to optimise a chemical structure - how it gets there is somewhat irrelevant. To tidy up the SMILES for consumption by other programs, just run them through Open Babel.
  2. Requires the development version of Open Babel for ECFP support
  3. The code illustrates again the use of a valence check. Note that the valence check differs depending on the use case. Here I only allow valences present in abilify, whereas in the previous blog post I was trying to quality control a chemical database.
  4. The input database should consist of "anti-canonical" SMILES, e.g. generated as described in a previous blog post. Originally I did this on-the-fly but there's currently no way to control the random seed used in Open Babel (and I like reproducibility).
  5. The code above is not slow, but you could make it go faster if you used multitple processors to create the children.
  6. I realise it's something of a cop-out to use N SMILES from ChEMBL to seed the procedure. A prefect genetic algorithm would be able to start with a bunch of methane, water and ammonia and evolve to the desired optimum. The one I've implemented just manipulates the original genetic material, and so a cop-out is required.

2 comments:

  1. Thanks for this very interesting post! In your code, you make reference to a file, "D:\LargeData\ChEMBL\chembl_23.smi". Would you be able to share the contents of this file? Also, what was the highest objective function score you achieved? how many generations did it take? and what was the best molecule?

    ReplyDelete
  2. I can't share that file I'm afraid, but you can download it from ChEMBL or convert from the SDF. The idea of providing the code is so that you can do a like-for-like comparison.

    ReplyDelete