Thursday, 24 May 2018

Reader's challenge: beat my SMILES genetic algorithm

I'm a big fan of genetic algorithms. I used one back in the day for inverse design of molecular wires. They are deceptively simple, so much so that when you see one in action, churning out molecules that seem to magically improve generation on generation, it's a real eye-opener.

A key part of a genetic algorithm is the application of the crossover and mutation operators. The idea is to take a population of existing molecules, somehow combine pairs (crossover) while changing the structure slightly (mutation). When dealing with chemical structures, you can imagine operators that work directly on the chemical graph, or those that work on a proxy such as a SMILES string. Applying a genetic algorithm to molecules isn't a new idea (there was a patent back in 1993 from Dave Weininger on this topic) but that doesn't mean it's not useful still. And in contrast to generative models for DNNs, which have recently taken the limelight, using GAs is fast, doesn't require GPUs, and is easy to understand and improve. Are the results worse then using DNNs? I note that it is so claimed.

If the idea of using a GA sounds interesting, you may find the following code useful as a starting point. I've tried to make it as simple as possible while still doing something useful: given N random molecules from ChEMBL, chosen to be around the size of abilify but with ECFP4 of less than 0.2 to abilify, can we optimise to a structure with ECFP4 close to abilify? In my implementation, the mutation operator is very basic - it simply swaps adjacent characters in the SMILES string.

The challenge for the reader, if you choose to accept it, is to change/improve this genetic algorithm to get better performance. If you succeed, post the code publicly and leave a comment below. My recommended place to start would be to write a more sophisticated mutation operator (e.g. one that can introduce and break rings, and introduce new genetic material).



Notes:
  1. The SMILES strings generated by the GA can look a bit iffy. However, remember that the goal of the GA is not to create nice SMILES strings, but to optimise a chemical structure - how it gets there is somewhat irrelevant. To tidy up the SMILES for consumption by other programs, just run them through Open Babel.
  2. Requires the development version of Open Babel for ECFP support
  3. The code illustrates again the use of a valence check. Note that the valence check differs depending on the use case. Here I only allow valences present in abilify, whereas in the previous blog post I was trying to quality control a chemical database.
  4. The input database should consist of "anti-canonical" SMILES, e.g. generated as described in a previous blog post. Originally I did this on-the-fly but there's currently no way to control the random seed used in Open Babel (and I like reproducibility).
  5. The code above is not slow, but you could make it go faster if you used multitple processors to create the children.
  6. I realise it's something of a cop-out to use N SMILES from ChEMBL to seed the procedure. A prefect genetic algorithm would be able to start with a bunch of methane, water and ammonia and evolve to the desired optimum. The one I've implemented just manipulates the original genetic material, and so a cop-out is required.

2 comments:

  1. Thanks for this very interesting post! In your code, you make reference to a file, "D:\LargeData\ChEMBL\chembl_23.smi". Would you be able to share the contents of this file? Also, what was the highest objective function score you achieved? how many generations did it take? and what was the best molecule?

    ReplyDelete
  2. I can't share that file I'm afraid, but you can download it from ChEMBL or convert from the SDF. The idea of providing the code is so that you can do a like-for-like comparison.

    ReplyDelete