Tuesday, 2 October 2018
The problematic areas of SMILES syntax involve paired ring closure symbols for cycles, and parentheses for branches. These particular aspects of the syntax are difficult to reproduce correctly when generating SMILES strings using machine-learning methods, and so a certain percentage of generated SMILES tend to fail basic syntax checks. While there have been a variety of approaches aimed at improving the SMILES generation (with quite some success), it is reasonable to assume that the syntax also causes difficulties during the learning phase.
Our approach is not to use SMILES, but instead an alternative syntax that does not have these problems. Paired ring closures are replaced by a single digit, the ring size; paired parentheses are replaced by close parentheses indicating branch size. See the talk below, the preprint, or the GitHub site for more information. Feedback (positive or negative) is welcome, either here or on GitHub.
Friday, 20 July 2018
Take the case of aromaticity handling. Just this part of the toolkit on it's own raises many questions. Should aromaticity be lazily perceived or require an explicit function call? What happens when the user modifies a molecule? What about copying a molecule? Or just copying a subset of the atoms to another molecule? What should happen when you add two molecules together? If you read aromaticity from a SMILES string, what if it's a different model than the toolkit uses internally? Should the SMILES writer reperceive aromaticity, or just use it as presented?
Often the easiest solution to these problems is to always do the maximum amount of work possible (i.e. throwing away perceived aromaticity information at every opportunity), which I wanted to avoid at all costs. So I went through removing bits and simplifying things, making sure that aromaticity information was copied where possible, and hoping that in the end all of the complex behaviour that I wanted to maintain would still be possible without adding back fixes or kludges. And fortunately it was. I even managed to add additional behaviour, an option to keep the aromaticity information as read from a SMILES string.
I'm a firm believer that there's no point adding features or improving things if you don't write documentation explaining why/how/when they should use it. What's the point doing this work if no-one knows how they can take advantage of it? So, I've just written up documentation on how Open Babel handles aromaticity. The following (exclusive!) excerpt describes how aromaticity information is stored by Open Babel.
Handling of aromaticity¶
The purpose of this section is to give an overview of how Open Babel handles aromaticity. Given that atoms can be aromatic, bonds can be aromatic, and that molecules have a flag for aromaticity perceived, it’s important to understand how these all work together.
How is aromaticity information stored?¶
Like many other toolkits, Open Babel stores aromaticity information separate from bond order information. This means that there isn’t a special bond order to indicate aromatic bond. Instead, aromaticity is stored as a flag on an atom as well as a flag on a bond. You can access and set this information using the following API functions:
- OBAtom::IsAromatic(), OBAtom::SetAromatic(), OBBond::UnsetAromatic()
- OBBond::IsAromatic(), OBBond::SetAromatic(), OBBond::UnsetAromatic()
There is a catch though, or rather a key point to note. OBMols have a flag to indicate whether aromaticity has been perceived. This is set via the following API functions:
- OBMol::SetAromaticPerceived(), OBMol::UnsetAromaticPerceived()
The value of this flag determines the behaviour of the OBAtom and OBBond IsAromatic() functions.
- If the flag is set, then IsAromatic() simply returns the corresponding value of the atom or bond flag.
- If unset, then IsAromatic() triggers aromaticity perception (from scratch), and then returns the value of the flag.
See https://open-babel.readthedocs.io/en/latest/Aromaticity/Aromaticity.html for the nail-biting conclusion to this thrilling exposition.
Tuesday, 10 July 2018
The CIP system is a series of rules that describe how to assign a stereodescriptor (e.g. R/S, E/Z) to a stereocentre. When Bob Hanson decided to add support for CIP to Jmol, rather than simply read the rules and implement it according to his interpretation as others have done, he decided to work with three other implementations to challenge each other on disagreements and clarify the wording of the rules.
The result is described in:
Algorithmic Analysis of Cahn-Ingold-Prelog Rules of Stereochemistry: Proposals for Revised Rules and a Guide for Machine Implementation. Robert M. Hanson, John W. Mayfield, Mikko J. Vainio, Andrey Yerin, Dmitry Redkin, and Sophia
Essentially, the issue that the authors are addressing is the fact that existing implementations even in "highly respected software packages" disagree with each other (see John Mayfield's presentation). By comparing the implementations in Jmol, Centres, Balloon and ChemSketch they were able to identify cases where:
"(a) the disagreement was due to different interpretations of CIP rules among software developers, (b) there was a problem with an algorithm or its implementation in code, or (c) the CIP rules themselves were flawed."
In all cases, however, they were able to come to a consensus, which led to "the discovery of a small number of errors in the Blue Book, two minor flaws in the CIP rules, and a proposal for a new rule".
The paper walks through their discussions of each rule in turn, looking at any issues arising and clarifying any ambiguities. It also includes a validation suite (browse it here) that covers all aspects of the rules and will allow future CIP implementations to avoid the pitfalls that have beset the field in the past.
Thursday, 14 June 2018
This is one of those cases where this just feels like a mistake to me - and I'm going to do my best to articulate my concerns. However, I'm not familiar enough with the mechanics of DNNs to be sure, but I hope those in the field will consider the points I raise.
So, canonical SMILES. Behind the scenes, the canonicalisation procedure gives a unique label to each atom (typically 1 to N). These labels are then used to generate a canonical SMILES, typically by starting at the lowest label (but not necessarily). The canonicalisation procedure is based upon the attributes of the graph, with the result that the first atom in a canonical SMILES tends to favor particular atom types and avoid others.
For example, if you look at the atom types for the first atom in the canonical SMILES generated by RDKit for ChEMBL molecules, you will find that the second most common atom type in ChEMBL (namely, *-C(-*)=*) never appears as the first atom in a canonical SMILES string. This is by design and you'll see similar behaviour with other toolkits - SMILES strings tend to start with degree 1 atoms.
So what if the distribution of atom types is different for the first atom?
Well, firstly, I predict that as a result these atom types will be over-represented in structures generated by DNNs (and others under-represented). If you train on canonical SMILES, then the probabilities for the first atom will be determined by the atom types favored as starting atoms by canonical SMILES. Consider the extreme example where fluorine always occurs as the first atom in any canonical SMILES that contains it; you should see an increased number of fluorines as the first atom in the generated molecules. Now you could argue that the probabilities for the remaining atoms will be adjusted accordingly, but I believe that there is a strong edge affect and that any correction will attenuate as the SMILES string becomes longer.
Secondly, this bias makes the DNNs job harder. Instead of a relatively even distribution of atom types at all points in the SMILES string, the distribution will depend on the distance from the starting atom. It's now trying to learn something about the properties of canonical SMILES instead on concentrating on the task at hand...
...which bring me nicely to the third point. Predictive models attempt to deduce a property value from the structure, and a SMILES string is used by DNNs as a proxy for the structure. Using a canonical SMILES string is another step removed. What about a molecule with a very similar structure but very different canonical SMILES? Surely the goal of a robust model is to handle this well. Restricting good predictive power to only those structures that are both similar and have similar canonical SMILES is to develop a model with a reduced applicability. A fun task is do is to measure the degree to which this fitting to canonical SMILES occurs; this is left to the reader.
The solution is simple. Use random SMILES. A single one, or multiple. The use of multiple random SMILES has already been described by Thomas Bergwinkl and subsequently by Esben Jannik Bjerrum as a 'data augmentation technique', but I see it as just avoiding the inherent bias of using canonical SMILES. But either way, I like this quote from Thomas:
The output for alternative representations of a molecule should be the same, if you understand SMILES. Using alternative representations in the test data allows to verify if the neural network understands SMILES. Spoiler: After a while the output becomes very close for the alternatives!
So why do people use canonical SMILES in the first place? I have my theory on this.
I believe it's because the generative models more quickly converge to generation of syntactically valid SMILES strings when they train on canonical SMILES. And for some reason, the percentage of syntactically valid SMILES strings generated by the model has become a figure of merit.
But this makes no sense - who cares what this percentage is? Sure, we can all overfit to canonical SMILES and get high percentages quickly. But how is this a good thing? You know that feeling you get when you see a QSAR model with very high R2 on training data - that's how I feel when I see a high value for this percentage. If it's actually doing what it's supposed to be doing (i.e. learning the underlying structure rather than the training set of canonical SMILES), then the percentage should really be lower. What do I care if the percentage of syntactically valid SMILES is 1%? So long as that 1% solves my problem, it's irrelevant - these structures are spewed out of these models thousands per second (I presume, but even so).
Please let him stop talking now
Okay, okay - I'm done. What do you think?
Wednesday, 6 June 2018
Here are my notes on the scientific program, which I have just extracted from Twitter. A big thank-you to ThreadReader, without which I couldn't have done the extraction. Naturally, all errors are my own - I may have misunderstood something or lost the thread of the talk. Also, just to note, I didn't take notes on all of the talks.
If you are interested in a particular talk, and paste the provided link into a browser, you can see if anyone on Twitter added a comment on the tweet.
If you want to follow the entire Twitter conversation on the meeting, which used the hashtag #11thICCS, go to this link. Again, if a particular Tweet has replies, you need to click on it to see them.
As well as taking notes, I also presented a poster entitled "Can we agree on the structure represented by a SMILES string? A benchmark dataset". For more info, follow the link over to the NextMove blog.
Note to self:
Next time include the hashtag with every tweet. Otherwise it's hard to extract, and hard for attendees to follow automatically.
Image courtesy of Jason Cole.
Thursday, 24 May 2018
A key part of a genetic algorithm is the application of the crossover and mutation operators. The idea is to take a population of existing molecules, somehow combine pairs (crossover) while changing the structure slightly (mutation). When dealing with chemical structures, you can imagine operators that work directly on the chemical graph, or those that work on a proxy such as a SMILES string. Applying a genetic algorithm to molecules isn't a new idea (there was a patent back in 1993 from Dave Weininger on this topic) but that doesn't mean it's not useful still. And in contrast to generative models for DNNs, which have recently taken the limelight, using GAs is fast, doesn't require GPUs, and is easy to understand and improve. Are the results worse then using DNNs? I note that it is so claimed.
If the idea of using a GA sounds interesting, you may find the following code useful as a starting point. I've tried to make it as simple as possible while still doing something useful: given N random molecules from ChEMBL, chosen to be around the size of abilify but with ECFP4 of less than 0.2 to abilify, can we optimise to a structure with ECFP4 close to abilify? In my implementation, the mutation operator is very basic - it simply swaps adjacent characters in the SMILES string.
The challenge for the reader, if you choose to accept it, is to change/improve this genetic algorithm to get better performance. If you succeed, post the code publicly and leave a comment below. My recommended place to start would be to write a more sophisticated mutation operator (e.g. one that can introduce and break rings, and introduce new genetic material).
- The SMILES strings generated by the GA can look a bit iffy. However, remember that the goal of the GA is not to create nice SMILES strings, but to optimise a chemical structure - how it gets there is somewhat irrelevant. To tidy up the SMILES for consumption by other programs, just run them through Open Babel.
- Requires the development version of Open Babel for ECFP support
- The code illustrates again the use of a valence check. Note that the valence check differs depending on the use case. Here I only allow valences present in abilify, whereas in the previous blog post I was trying to quality control a chemical database.
- The input database should consist of "anti-canonical" SMILES, e.g. generated as described in a previous blog post. Originally I did this on-the-fly but there's currently no way to control the random seed used in Open Babel (and I like reproducibility).
- The code above is not slow, but you could make it go faster if you used multitple processors to create the children.
- I realise it's something of a cop-out to use N SMILES from ChEMBL to seed the procedure. A prefect genetic algorithm would be able to start with a bunch of methane, water and ammonia and evolve to the desired optimum. The one I've implemented just manipulates the original genetic material, and so a cop-out is required.
Tuesday, 8 May 2018
As a first step towards considering these atoms, a value of 2 can be passed to CopySubstructure as the fourth parameter. This sprouts dummy atoms to replace missing bonds, and so all para-substituted phenyl rings (for example) become *c1ccc(*)cc1.
To go beyond this and actually recover the identity of the attached atoms, well, that's a little bit trickier. To begin with, we need to think about which atoms we want to copy. Depending on your application, you might want to focus on atoms that influence a particular aromaticity model, or have tautomeric potential. Here I'm going to include any atom that's doubly bonded, and any non-C atoms that are singly-bonded.
The most obvious way to do this is to add these atoms to the atomsToCopy variable before doing the copy. Unfortunately, that approach quickly runs into difficulties as the atom might be attached to two rings (e.g. C1CC1OC2CCCC2), or worse still, the atom might itself be in a ring (e.g. C1CC1=C2CCCC2). In fact, the only way to do this "straightforward approach" is to first count the ring systems and label the atoms belonging to each (e.g. with a flood-fill algorithm), and then one-at-a-time copy each ring system and the attached atoms. It works, but elegant it ain't.
So let's do it a different way, a way that turns out to be quite a bit simpler and handles all of the ring systems in a single copy. Remember above when I mentioned para-substituted phenyl rings becoming *c1ccc(*)cc1? No? Well, I did. Anyway, let's suppose that we know the original atom to which each dummy atom corresponds; we could then change the dummy atom to have the same atomic number and charge as the original. Well, what d'yaknow - the fifth parameter to CopySubstructure can be used to find exactly this correspondence.
Oh yeah, there's one catch. Any stereo involving the attached atoms will be missing (I didn't implement this in CopyStructure). If this is important, then you'll need to use the first approach I described above. Anyway, here's the code, which should be used in combination with the code samples in the original post. For the two examples listed above, C1CC1OC2CCCC2 and C1CC1=C2CCCC2, it will produce C1CC1O.C1(CCCC1)O and C1CC1=C.C1(=C)CCCC1.
atomorder = ob.vectorUnsignedInt() def CopyRingSystem_3(orig, copy): atomsToCopy.Clear() for atom in ob.OBMolAtomIter(orig): if atom.IsInRing(): atomsToCopy.SetBitOn(atom.GetIdx()) bondsToExclude.Clear() for bond in ob.OBMolBondIter(orig): if not bond.IsInRing(): bondsToExclude.SetBitOn(bond.GetIdx()) atomorder.clear() ok = orig.CopySubstructure(copy, atomsToCopy, bondsToExclude, 2, atomorder) assert ok mark_for_deletion =  for atom in ob.OBMolAtomIter(copy): if atom.GetAtomicNum() != 0: continue bond = next(ob.OBAtomBondIter(atom)) bondorder = bond.GetBondOrder() nbr = bond.GetNbrAtom(atom) origatom = orig.GetAtom(atomorder[atom.GetIdx()-1]) if bondorder == 2 or origatom.GetAtomicNum() != 6: # Turn the asterisk into the 'original atom' atom.SetAtomicNum(origatom.GetAtomicNum()) atom.SetFormalCharge(origatom.GetFormalCharge()) valence = origatom.BOSum() + origatom.GetImplicitHCount() atom.SetImplicitHCount(valence - bondorder) else: nbr.SetImplicitHCount(nbr.GetImplicitHCount() + bondorder) mark_for_deletion.append(atom) for atom in reversed(mark_for_deletion): copy.DeleteAtom(atom) return ok