Sunday 25 November 2018

SMARTS for dummies

One of Roger's observations came to mind recently when I was looking at some SMARTS patterns in Open Babel that were causing some problems: "People cannot write SMARTS correctly." Instead, he argues, they should use tools to create SMARTS for them. The problem is not so much that patterns don't match molecules of interest, but that they match additional molecules (false positives) and may miss others (false negatives). False positives are a particular problem when transformations (e.g. SMIRKS) are applied to matching molecules, as they can result in exotic valence and charge states that were unintended.

The following is a description of a tool I put together to help craft SMARTS patterns. It is based upon the idea that a SMARTS pattern is completely defined by the molecules it matches. Thus, by looking at the matches in a database of sufficient size, we can see whether there are any mismatches.

So far, so obvious. What's novel is that the tool only returns distinct matches based upon simple atom types of the matching atoms, and sorts them by probability of those atom types (most unusual first). Thus, even for common substructures the number of hits is small and information-rich. Furthermore, by using my favourite trick of sorting the reference database by size (ChEMBL here), this means that the first hit found for each distinct match is likely to be small in size.

An example will illustrate: let's suppose we want to create a SMARTS pattern to match alcohols. I know that "CO" in SMILES indicates this, and so let's start with that. Only 28 hits are returned (from a 10% subset of ChEMBL), of which the first three are:
Ok, obviously we don't want a charged oxygen, so let's try "C[O+0]" and look at the first three hits:
And the oxygen should have a hydrogen, "C[Oh1+0]":
Now, about that carbon. I only had sp3-hybridised carbons in mind, so how about "[CX4][Oh1+0]"? Here is the full list of hits at this point:
At this point the hits are what I expected to see. Does this mean I should stop here? No, but it's a reasonable start.

After using this tool for a while, you begin to appreciate the importance of nailing down exactly what you want to see: if you only want to match neutral carbons, you should say so; if you don't want the oxygen to have additional neighbours, you should say so. Once every degree of freedom is nailed down...then it's finished. You might think this is going too far, but the molecules that exist in databases don't necessarily conform to typical valence rules - they may be unusual (consider TEMPO) or downright incorrect. Your SMARTS pattern should match exactly what you had in mind and not leave any wiggle room to match anything more: "[Cv4X4+0]-[OD1h1+0]"

This tool is by no means a panacea; you still need to think for yourself, and consider for example what to do about aromaticity, or the presence of explicit hydrogens or hydrogen isotopes, or whether to reorder the pattern for speed. But one could imagine a tool that could ease the burden further and do some of this for the user, or for example allow the user to mark "match" or "don't match" for the set of hits returned to automatically refine the query.

Tuesday 2 October 2018

DeepSMILES - a SMILES-like syntax for use in machine-learning

At the last ICCS, Andrew Dalke and I got talking. We had both been around the poster session, and had heard first-hand the problems of "invalid SMILES" that were affecting the generation of SMILES strings using deep neural networks. I remember Xuhan Liu in particular asking me whether I had a solution to this. Well, I had been turning over an idea in my mind for a little while, and by the time I got talking to Andrew, I had decided to do something about it. Now Andrew knows a thing or two about SMILES himself, so I shouldn't have been surprised that he had also come up with something (along with two references from 1964 to back it up). We put the two ideas together and called it DeepSMILES.

The problematic areas of SMILES syntax involve paired ring closure symbols for cycles, and parentheses for branches. These particular aspects of the syntax are difficult to reproduce correctly when generating SMILES strings using machine-learning methods, and so a certain percentage of generated SMILES tend to fail basic syntax checks. While there have been a variety of approaches aimed at improving the SMILES generation (with quite some success), it is reasonable to assume that the syntax also causes difficulties during the learning phase.

Our approach is not to use SMILES, but instead an alternative syntax that does not have these problems. Paired ring closures are replaced by a single digit, the ring size; paired parentheses are replaced by close parentheses indicating branch size. See the talk below, the preprint, or the GitHub site for more information. Feedback (positive or negative) is welcome, either here or on GitHub.

Friday 20 July 2018

Streamlining the handling of aromaticity

I've done a bit of work over the last year or two stripping back some accumulated cruft in Open Babel. One question that's been on my mind during this process is how much can you simplify the toolkit's core but yet leave it capable of complex behaviour?

Take the case of aromaticity handling. Just this part of the toolkit on it's own raises many questions. Should aromaticity be lazily perceived or require an explicit function call? What happens when the user modifies a molecule? What about copying a molecule? Or just copying a subset of the atoms to another molecule? What should happen when you add two molecules together? If you read aromaticity from a SMILES string, what if it's a different model than the toolkit uses internally? Should the SMILES writer reperceive aromaticity, or just use it as presented?

Often the easiest solution to these problems is to always do the maximum amount of work possible (i.e. throwing away perceived aromaticity information at every opportunity), which I wanted to avoid at all costs. So I went through removing bits and simplifying things, making sure that aromaticity information was copied where possible, and hoping that in the end all of the complex behaviour that I wanted to maintain would still be possible without adding back fixes or kludges. And fortunately it was. I even managed to add additional behaviour, an option to keep the aromaticity information as read from a SMILES string.

I'm a firm believer that there's no point adding features or improving things if you don't write documentation explaining why/how/when they should use it. What's the point doing this work if no-one knows how they can take advantage of it? So, I've just written up documentation on how Open Babel handles aromaticity. The following (exclusive!) excerpt describes how aromaticity information is stored by Open Babel.

Handling of aromaticity

The purpose of this section is to give an overview of how Open Babel handles aromaticity. Given that atoms can be aromatic, bonds can be aromatic, and that molecules have a flag for aromaticity perceived, it’s important to understand how these all work together.

How is aromaticity information stored?

Like many other toolkits, Open Babel stores aromaticity information separate from bond order information. This means that there isn’t a special bond order to indicate aromatic bond. Instead, aromaticity is stored as a flag on an atom as well as a flag on a bond. You can access and set this information using the following API functions:

  • OBAtom::IsAromatic(), OBAtom::SetAromatic(), OBBond::UnsetAromatic()
  • OBBond::IsAromatic(), OBBond::SetAromatic(), OBBond::UnsetAromatic()

There is a catch though, or rather a key point to note. OBMols have a flag to indicate whether aromaticity has been perceived. This is set via the following API functions:

  • OBMol::SetAromaticPerceived(), OBMol::UnsetAromaticPerceived()

The value of this flag determines the behaviour of the OBAtom and OBBond IsAromatic() functions.

  • If the flag is set, then IsAromatic() simply returns the corresponding value of the atom or bond flag.
  • If unset, then IsAromatic() triggers aromaticity perception (from scratch), and then returns the value of the flag.

See for the nail-biting conclusion to this thrilling exposition.

Tuesday 10 July 2018

Clarifying the Cahn-Ingold-Prelog rules

What happens when four software developers get together and compare their implementations of the CIP rules? This is the background to a recent preprint deposited in ChemRxiv.

The CIP system is a series of rules that describe how to assign a stereodescriptor (e.g. R/S, E/Z) to a stereocentre. When Bob Hanson decided to add support for CIP to Jmol, rather than simply read the rules and implement it according to his interpretation as others have done, he decided to work with three other implementations to challenge each other on disagreements and clarify the wording of the rules.

The result is described in:
Algorithmic Analysis of Cahn-Ingold-Prelog Rules of Stereochemistry: Proposals for Revised Rules and a Guide for Machine Implementation. Robert M. Hanson, John W. Mayfield, Mikko J. Vainio, Andrey Yerin, Dmitry Redkin, and Sophia
Musacchio []

Essentially, the issue that the authors are addressing is the fact that existing implementations even in "highly respected software packages" disagree with each other (see John Mayfield's presentation). By comparing the implementations in Jmol, Centres, Balloon and ChemSketch they were able to identify cases where:
"(a) the disagreement was due to different interpretations of CIP rules among software developers, (b) there was a problem with an algorithm or its implementation in code, or (c) the CIP rules themselves were flawed."

In all cases, however, they were able to come to a consensus, which led to "the discovery of a small number of errors in the Blue Book, two minor flaws in the CIP rules, and a proposal for a new rule".

The paper walks through their discussions of each rule in turn, looking at any issues arising and clarifying any ambiguities. It also includes a validation suite (browse it here) that covers all aspects of the rules and will allow future CIP implementations to avoid the pitfalls that have beset the field in the past.

Thursday 14 June 2018

Cheminformatics for deep learners: Canonical SMILES and why you should avoid them

I've already gotten the "valid SMILES" issue off my chest, but there's something else that's been bugging me about deep learning papers in chemistry, and that's the use of canonical SMILES for training. Why would you use canonical SMILES? And if you did, wouldn't that introduce some problems?

This is one of those cases where this just feels like a mistake to me - and I'm going to do my best to articulate my concerns. However, I'm not familiar enough with the mechanics of DNNs to be sure, but I hope those in the field will consider the points I raise.

The problem

So, canonical SMILES. Behind the scenes, the canonicalisation procedure gives a unique label to each atom (typically 1 to N). These labels are then used to generate a canonical SMILES, typically by starting at the lowest label (but not necessarily). The canonicalisation procedure is based upon the attributes of the graph, with the result that the first atom in a canonical SMILES tends to favor particular atom types and avoid others.

For example, if you look at the atom types for the first atom in the canonical SMILES generated by RDKit for ChEMBL molecules, you will find that the second most common atom type in ChEMBL (namely, *-C(-*)=*) never appears as the first atom in a canonical SMILES string. This is by design and you'll see similar behaviour with other toolkits - SMILES strings tend to start with degree 1 atoms.

So what if the distribution of atom types is different for the first atom?

Well, firstly, I predict that as a result these atom types will be over-represented in structures generated by DNNs (and others under-represented). If you train on canonical SMILES, then the probabilities for the first atom will be determined by the atom types favored as starting atoms by canonical SMILES. Consider the extreme example where fluorine always occurs as the first atom in any canonical SMILES that contains it; you should see an increased number of fluorines as the first atom in the generated molecules. Now you could argue that the probabilities for the remaining atoms will be adjusted accordingly, but I believe that there is a strong edge affect and that any correction will attenuate as the SMILES string becomes longer.

Secondly, this bias makes the DNNs job harder. Instead of a relatively even distribution of atom types at all points in the SMILES string, the distribution will depend on the distance from the starting atom. It's now trying to learn something about the properties of canonical SMILES instead on concentrating on the task at hand...

...which bring me nicely to the third point. Predictive models attempt to deduce a property value from the structure, and a SMILES string is used by DNNs as a proxy for the structure. Using a canonical SMILES string is another step removed. What about a molecule with a very similar structure but very different canonical SMILES? Surely the goal of a robust model is to handle this well. Restricting good predictive power to only those structures that are both similar and have similar canonical SMILES is to develop a model with a reduced applicability. A fun task is do is to measure the degree to which this fitting to canonical SMILES occurs; this is left to the reader.

The solution

The solution is simple. Use random SMILES. A single one, or multiple. The use of multiple random SMILES has already been described by Thomas Bergwinkl and subsequently by Esben Jannik Bjerrum as a 'data augmentation technique', but I see it as just avoiding the inherent bias of using canonical SMILES. But either way, I like this quote from Thomas:
The output for alternative representations of a molecule should be the same, if you understand SMILES. Using alternative representations in the test data allows to verify if the neural network understands SMILES. Spoiler: After a while the output becomes very close for the alternatives!

So why do people use canonical SMILES in the first place? I have my theory on this.

I believe it's because the generative models more quickly converge to generation of syntactically valid SMILES strings when they train on canonical SMILES. And for some reason, the percentage of syntactically valid SMILES strings generated by the model has become a figure of merit.

But this makes no sense - who cares what this percentage is? Sure, we can all overfit to canonical SMILES and get high percentages quickly. But how is this a good thing? You know that feeling you get when you see a QSAR model with very high R2 on training data - that's how I feel when I see a high value for this percentage. If it's actually doing what it's supposed to be doing (i.e. learning the underlying structure rather than the training set of canonical SMILES), then the percentage should really be lower. What do I care if the percentage of syntactically valid SMILES is 1%? So long as that 1% solves my problem, it's irrelevant - these structures are spewed out of these models thousands per second (I presume, but even so).

Please let him stop talking now

Okay, okay - I'm done. What do you think?

Wednesday 6 June 2018

Notes from the 11th International Conference on Chemical Structures

Just back from the recent ICCS in Noordwijkerhout, the Netherlands. I really enjoyed it although, as you can see from the picture, it was hard work at times.

Here are my notes on the scientific program, which I have just extracted from Twitter. A big thank-you to ThreadReader, without which I couldn't have done the extraction. Naturally, all errors are my own - I may have misunderstood something or lost the thread of the talk. Also, just to note, I didn't take notes on all of the talks.

If you are interested in a particular talk, and paste the provided link into a browser, you can see if anyone on Twitter added a comment on the tweet.

If you want to follow the entire Twitter conversation on the meeting, which used the hashtag #11thICCS, go to this link. Again, if a particular Tweet has replies, you need to click on it to see them.

As well as taking notes, I also presented a poster entitled "Can we agree on the structure represented by a SMILES string? A benchmark dataset". For more info, follow the link over to the NextMove blog.

Note to self:
Next time include the hashtag with every tweet. Otherwise it's hard to extract, and hard for attendees to follow automatically.

Image credit:
Image courtesy of Jason Cole.

Thursday 24 May 2018

Reader's challenge: beat my SMILES genetic algorithm

I'm a big fan of genetic algorithms. I used one back in the day for inverse design of molecular wires. They are deceptively simple, so much so that when you see one in action, churning out molecules that seem to magically improve generation on generation, it's a real eye-opener.

A key part of a genetic algorithm is the application of the crossover and mutation operators. The idea is to take a population of existing molecules, somehow combine pairs (crossover) while changing the structure slightly (mutation). When dealing with chemical structures, you can imagine operators that work directly on the chemical graph, or those that work on a proxy such as a SMILES string. Applying a genetic algorithm to molecules isn't a new idea (there was a patent back in 1993 from Dave Weininger on this topic) but that doesn't mean it's not useful still. And in contrast to generative models for DNNs, which have recently taken the limelight, using GAs is fast, doesn't require GPUs, and is easy to understand and improve. Are the results worse then using DNNs? I note that it is so claimed.

If the idea of using a GA sounds interesting, you may find the following code useful as a starting point. I've tried to make it as simple as possible while still doing something useful: given N random molecules from ChEMBL, chosen to be around the size of abilify but with ECFP4 of less than 0.2 to abilify, can we optimise to a structure with ECFP4 close to abilify? In my implementation, the mutation operator is very basic - it simply swaps adjacent characters in the SMILES string.

The challenge for the reader, if you choose to accept it, is to change/improve this genetic algorithm to get better performance. If you succeed, post the code publicly and leave a comment below. My recommended place to start would be to write a more sophisticated mutation operator (e.g. one that can introduce and break rings, and introduce new genetic material).

  1. The SMILES strings generated by the GA can look a bit iffy. However, remember that the goal of the GA is not to create nice SMILES strings, but to optimise a chemical structure - how it gets there is somewhat irrelevant. To tidy up the SMILES for consumption by other programs, just run them through Open Babel.
  2. Requires the development version of Open Babel for ECFP support
  3. The code illustrates again the use of a valence check. Note that the valence check differs depending on the use case. Here I only allow valences present in abilify, whereas in the previous blog post I was trying to quality control a chemical database.
  4. The input database should consist of "anti-canonical" SMILES, e.g. generated as described in a previous blog post. Originally I did this on-the-fly but there's currently no way to control the random seed used in Open Babel (and I like reproducibility).
  5. The code above is not slow, but you could make it go faster if you used multitple processors to create the children.
  6. I realise it's something of a cop-out to use N SMILES from ChEMBL to seed the procedure. A prefect genetic algorithm would be able to start with a bunch of methane, water and ammonia and evolve to the desired optimum. The one I've implemented just manipulates the original genetic material, and so a cop-out is required.

Tuesday 8 May 2018

When all you want is a ring: Part II

The previous post used a fairly strict definition of a ring system; a set of atoms, all of which are in a ring, and which are connected by bonds, all of which are in a ring. But this ignores any attached atoms, and a case can be made that the removal of these attached atoms completely alters the identity of a ring system.

As a first step towards considering these atoms, a value of 2 can be passed to CopySubstructure as the fourth parameter. This sprouts dummy atoms to replace missing bonds, and so all para-substituted phenyl rings (for example) become *c1ccc(*)cc1.

To go beyond this and actually recover the identity of the attached atoms, well, that's a little bit trickier. To begin with, we need to think about which atoms we want to copy. Depending on your application, you might want to focus on atoms that influence a particular aromaticity model, or have tautomeric potential. Here I'm going to include any atom that's doubly bonded, and any non-C atoms that are singly-bonded.

The most obvious way to do this is to add these atoms to the atomsToCopy variable before doing the copy. Unfortunately, that approach quickly runs into difficulties as the atom might be attached to two rings (e.g. C1CC1OC2CCCC2), or worse still, the atom might itself be in a ring (e.g. C1CC1=C2CCCC2). In fact, the only way to do this "straightforward approach" is to first count the ring systems and label the atoms belonging to each (e.g. with a flood-fill algorithm), and then one-at-a-time copy each ring system and the attached atoms. It works, but elegant it ain't.

So let's do it a different way, a way that turns out to be quite a bit simpler and handles all of the ring systems in a single copy. Remember above when I mentioned para-substituted phenyl rings becoming *c1ccc(*)cc1? No? Well, I did. Anyway, let's suppose that we know the original atom to which each dummy atom corresponds; we could then change the dummy atom to have the same atomic number and charge as the original. Well, what d'yaknow - the fifth parameter to CopySubstructure can be used to find exactly this correspondence.

Oh yeah, there's one catch. Any stereo involving the attached atoms will be missing (I didn't implement this in CopyStructure). If this is important, then you'll need to use the first approach I described above. Anyway, here's the code, which should be used in combination with the code samples in the original post. For the two examples listed above, C1CC1OC2CCCC2 and C1CC1=C2CCCC2, it will produce C1CC1O.C1(CCCC1)O and C1CC1=C.C1(=C)CCCC1.
atomorder = ob.vectorUnsignedInt()

def CopyRingSystem_3(orig, copy):
    for atom in ob.OBMolAtomIter(orig):
        if atom.IsInRing():
    for bond in ob.OBMolBondIter(orig):
        if not bond.IsInRing():
    ok = orig.CopySubstructure(copy, atomsToCopy, bondsToExclude, 2, atomorder)
    assert ok

    mark_for_deletion = []
    for atom in ob.OBMolAtomIter(copy):
        if atom.GetAtomicNum() != 0:
        bond = next(ob.OBAtomBondIter(atom))
        bondorder = bond.GetBondOrder()
        nbr = bond.GetNbrAtom(atom)
        origatom = orig.GetAtom(atomorder[atom.GetIdx()-1])
        if bondorder == 2 or origatom.GetAtomicNum() != 6:
            # Turn the asterisk into the 'original atom'
            valence = origatom.BOSum() + origatom.GetImplicitHCount()
            atom.SetImplicitHCount(valence - bondorder)
            nbr.SetImplicitHCount(nbr.GetImplicitHCount() + bondorder)
    for atom in reversed(mark_for_deletion):

    return ok

Saturday 5 May 2018

When all you want is a ring

I've just added a method called CopySubstructure() to Open Babel. It's something I've missed for a while, the ability to copy part of a molecule into a new one. The basic usage is where the user gives a list of atoms to copy. When you copy the substructure, all bonds between atoms in the substructure are retained (and only those). By default, hydrogen counts are adjusted to account for missing bonds.

As an example of use, if you copy all atoms that are in rings, you will find all of the ring systems:
import pybel
ob = pybel.ob

atomsToCopy = ob.OBBitVec()
bondsToExclude = ob.OBBitVec()

def CopyRingSystem(orig, copy):
    for atom in ob.OBMolAtomIter(orig):
        if atom.IsInRing():
    return orig.CopySubstructure(copy, atomsToCopy)
So Clc1ccccc1OC2CCC2 will become c1ccccc1.C2CCC2. If you do this for all of ChEMBL, split on the dots, and collate using the canonical SMILES, you can find the frequencies of all ring systems. Well, almost...

Even for this 'simple' use case, there are some subtleties that may trap the unwary when copying substructures of a molecule. For example, aromaticity is copied as present in the original molecule, and this may or may not be what one wants - it's up to the user to decide, and to mark aromaticity as unperceived if they wish. In this context, it's worth bearing in mind that the Daylight aromaticity model includes contributions from some exocyclic double bonds (and these won't be present with the code above).

Another quirk is that while it may seem that the code above creates SMILES strings for isolated ring systems, that's not quite true. If two ring systems are connected by a bond, then that bond will be retained even where it itself is not in a ring (because all bonds between atoms in the substructure are retained - see the very first paragraph above). For example, c1ccccc1C2CCC2 will stay c1ccccc1C2CCC2.

Which brings us to the second argument to the method, a list of bonds to exclude. To get the behaviour we might expect, all we need to do is add all bonds that are not in a ring to this list of bonds to exclude:
...continued from above
def CopyRingSystem_2(orig, copy):
    for atom in ob.OBMolAtomIter(orig):
        if atom.IsInRing():
    for bond in ob.OBMolBondIter(orig):
        if not bond.IsInRing():
    return orig.CopySubstructure(copy, atomsToCopy, bondsToExclude)

Here is a main() that could be used with the code above:
if __name__ == "__main__":
    copied = ob.OBMol()
    with open("outputC.smi", "w") as out:
        for mol in pybel.readfile("smi", r"C:\Tools\LargeData\sortedbylength.smi"):
            ok = CopyRingSystem_2(mol.OBMol, copied)
            assert ok
            if copied.NumAtoms() > 0:
                smi = pybel.Molecule(copied).write("smi").rstrip()

Friday 27 April 2018

Finding Texas carbons and other unusual valencies

So, having disposed of the phrase "invalid molecule" in the previous post, let's find some :-). Step one is to define what I mean by "invalid molecule": I'm going to use this to mean any molecule that contains an atom that is not on a list of common charge states and valencies. Rather conveniently, I have a table to hand provided by Daniel Lowe which I can use for this purpose. If an element is not in the table, it's fine; but if it is, then it has to be in one of the listed charge states and valencies.

The code is shown below, but first, the results for ChEMBL 23 (you can copy the whole lot and paste into John Mayfield's CDKdepict to view). There are a couple of Texas carbons, oxygen radicals, and TEMPO-like nitrogens that should be neutral. That's not to say that everything found is dodgy - nitrogen monoxide is rumoured to be stable, for example - but they are definitely unusual and warrant further inspection.
F[Al-3](F)(F)(F)(F)F 181624
C[C@H]([C@@H](CO)NC(=O)[C@@H]1CSSC[C@@H](C(=O)N[C@H](C(=O)N[C@@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N1)C(C)O)CCCCN)CC2=CN=C3=CC=CC=C32)Cc4ccccc4)NC(=O)[C@@H](Cc5ccccc5)N)O 3349005
CC[C@@]1(C[C@@H]2C[C@@]([C]3C(=c4ccccc4=N3)CCN(C2)C1)(c5cc6c(cc5OC)N([C@@H]7[C@]68CCN9[C@H]8[C@@](C=CC9)([C@H]([C@@]7(C(=O)N)O)O)CC)C)C(=O)OC)O.OS(=O)(=O)O 3349007
CC(C)(C)OC(=O)NCCC(=O)N[C@@H](CC1=CN=C2=CC=CC=C21)C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3ccccc3)C(=O)N 3348969
CC(C)(C)c1cc(cc(c1[O])C(C)(C)C)CCNc2c3c(ncn2)n(cn3)[C@H]4[C@@H]([C@@H]([C@H](O4)CO)O)O 1098520
CC(C)(C)c1cc(cc(c1[O])C(C)(C)C)CCNc2c3c(ncn2)n(cn3)[C@H]4[C@@H]([C@@H]([C@H](O4)C(=O)NC)O)O 1098715
CC(C)(C)c1cc(cc(c1[O])C(C)(C)C)C(=O)NCCc2ccc(cc2)Nc3c4c(ncn3)n(cn4)[C@H]5[C@@H]([C@@H]([C@H](O5)C(=O)NC)O)O 1094994
[N]=O 1200689
c1cc(ccc1C(=N)N)OCCCCCOc2ccc(cc2)C(=N)N.C(CO[S](=O)=O)O.C(CO[S](=O)=O)O 1405011
OP1(=O)OP(=O)(OP(=O)(O1)O)O.[Al] 1433450
C1[C@@H]2[C@@H]([C@H](O1)[C@H]([CH]O2)O)C[C@H]3[C@@H]([C@H]([C@H]([C@H](O3)CO)OS(=O)(=O)O)O[C@@H]4[C@@H]([C@@H]5[C@H]([C@H](O4)CO5)O[C@H]6[C@@H]([C@H]([C@H]([C@H](O6)CO)O)[O])O)OS(=O)(=O)O)O 1525673
c1ccc(cc1)N(c2ccccc2)[N]c3c(cc(cc3[N+](=O)[O-])[N+](=O)[O-])[N+](=O)[O-] 1668332
[B].c1ccc(cc1)[Si](c2ccccc2)(c3ccccc3)OCCN4CC4 1985326
[B].CS(=O)(=O)O[C@H]1CN2C[C@H]([C@H]([C@H]2[C@H]1CO)O)O 1974383
[B]C(=O)NC(CC(C)C)C(=O)NC1c2ccsc2C(=O)C1O.CN(C)C 1974647
C(C(C(=O)O)S)(C(=O)O)S.[As] 1991929
[B]C(=O)NC(CC(C)C)C(=O)NCC(=O)NC1c2ccsc2C(=O)C1O.CN(C)C 1986958
[B].CN(C)C/C(=C(\c1ccc(c(c1)OC)OC)/Cl)/c2ccc(c(c2)OC)OC 2005771
CC[n+]1c2ccccc2sc1/C=C(/C)\C#C/C=C\3/N(c4ccccc4S3)CC.[F-][P+5]([F-])([F-])([F-])([F-])[F-] 1992520
CN(C)c1ccc(c(c1)[O-])N=O.CN(C)c1ccc(c(c1)[O-])N=O.[Si+4].Cl.[Cl-].[Cl-] 2146183
CC#N.C1CC[C@@H]([C@@H](C1)O)[O-].C1CC[C@@H]([C@@H](C1)O)[O-].[Cl-].[Cl-].[Te+4] 2146182
[H+].C1C[O-][Te+4][O-]1.N.[Cl-].[Cl-].[Cl-] 2146259
CCCCC1C[O-][Te+4][O-]1.N.[Cl-].[Cl-].[Cl-] 2146289
CCCCCCC1C[O-][Te+4][O-]1.N.[Cl-].[Cl-].[Cl-] 2146290
[C] 2106049
[S] 2105487
[F-].[F-].[F-].[F-].[F-].[F-].[Si+4] 2310952
CC[C@H](C)[C@@H]([C@@H](CC(=O)N1CCC[C@H]1[C@@H]([C@@H](C)C(=O)N[C@H](C)[C@H](c2ccccc2)O)OC)OC)N(C)C(=O)[C@H](C(C)C)NC(=O)[C@H](C(C)C)N(C)C(=O)OCc3ccc(cc3)NC(=O)[C@H](CCCNC(=O)N)NC(=O)[C@H](C(C)C)NC(=O)CCCCCN4C(=O)C[CH]C4=O 2364667
c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.[Si+4].Cl.Cl 2374292
c1ccc(c(=O)cc1)[O-].c1ccc(c(=O)cc1)[O-].c1ccc(c(=O)cc1)[O-].[Si+4].[Cl-] 2374293
CC(C)(C)C(=O)CC(=O)C(C)(C)C.CC(C)(C)C(=O)CC(=O)C(C)(C)C.[Si+4].Cl.Cl 2374299
c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].[Si+4].Cl 2374305
CN1C=NNC1(=S)c2c(c3c(c(nnc3s2)c4ccccc4)c5ccccc5)O 2299271
[Al+3].[P-3] 2272784
Cc1ccc(c2c1oc-3c(c(=O)c(c(c3n2)C(=O)N[C@H]4[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]5CCCN5C(=O)[C@H](NC4=O)C(C)C)C)C)C(C)C)C)NCCNC6CC([N+](C(C6)(C)C)[O-])(C)C)C)C(=O)N[C@H]7[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]8CCCN8C(=O)[C@H](NC7=O)C(C)C)C)C)C(C)C)C 3228736
Cc1ccc(c2c1oc-3c(c(=O)c(c(c3n2)C(=O)N[C@H]4[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]5CCCN5C(=O)[C@H](NC4=O)C(C)C)C)C)C(C)C)C)NC6CC([N+](C(C6)(C)C)[O-])(C)C)C)C(=O)N[C@H]7[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]8CCCN8C(=O)[C@H](NC7=O)C(C)C)C)C)C(C)C)C 3228735
Cc1ccc(c2c1oc-3c(c(=O)c(c(c3n2)C(=O)N[C@H]4[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]5CCCN5C(=O)[C@H](NC4=O)C(C)C)C)C)C(C)C)C)NCCCNC6CC([N+](C(C6)(C)C)[O-])(C)C)C)C(=O)N[C@H]7[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]8CCCN8C(=O)[C@H](NC7=O)C(C)C)C)C)C(C)C)C 3228737
[NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F 3182693
Cc1cn(c(=O)[nH]c1=O)C2C(C(C(O2)C(COC(=O)C)NC(=O)C(CC3=C4=CC=CC=C4N=C3)NC(=O)OC(C)(C)C)OC(=O)C)OC(=O)C 3187332
C1CCN2CC3CC(C2C1)CN4C3CCCC4=O.O[Cl+3]([O-])([O-])[O-] 3186825
F[Si-2](F)(F)(F)(F)F.[Na+].[Na+] 3184182
C[Si](C)O[Si](C)(C)O[Si](C)(C)O[Si](C)C 3184420
CN(C)[N](=NOc1cc(c(cc1[N+](=O)[O-])[N+](=O)[O-])ON=[N+](N2CCN(CC2)C(=O)c3cc(ccc3F)Cc4c5ccccc5c(=O)[nH]n4)[O-])O 3188868
CNc1ccc(cc1)C(=O)Oc2cc(c(cc2C#N)[N+](=O)[O-])ON=[N](N(C)C)O 3187972
C[Si](O[Si](C)(C)C)O[Si](C)(C)C 3189129
c1cc(cc(c1)NC(=O)C(=O)O)C2=NN=N[N]2 3188982
CC(C)N1c2c(c(ncn2)N)C(=C3C=c4cc(ccc4=N3)O)[N]1 3209955
c1ccn2c(c1)c(c(=O)n(c2=O)CCCCN3CCC(CC3)c4c[nH]c5c4F=C(C=C5)F)c6ccc(cc6)F 3397072
CC[C@@]1(C[C@@H]2C[C@@]([C]3C(=c4ccccc4=N3)CCN(C2)C1)(c5cc6c(cc5OC)N([C@@H]7[C@]68CCN9[C@H]8[C@@](C=CC9)([C@H]([C@@]7(C(=O)N)O)O)CC)C)C(=O)OC)O 3545878
c1ccc2c(c1)C(=O)O[Mg]3(O2)Oc4ccccc4C(=O)O3.O.O.O.O 3561635
[H]1C[C@@H](C[C@@H]2[C@]1([C@H]3C[C@H]([C@@]4([C@H](CC[C@@]4([C@@H]3CC2)O)C\5=CC(=O)O/C5=C\c6ccc(cc6)N(C)C)C)O)C)O[C@H]7C[C@@H]([C@@H]([C@H](O7)C)O[C@H]8C[C@@H]([C@@H]([C@H](O8)C)O[C@H]9C[C@@H]([C@@H]([C@H](O9)C)O)O)O)O 3594279
[CH]=CCc1c[nH]c2c1cccc2 3623239
[CH2]c1c[nH]c2c1cccc2 3623240
[CH]=CCc1c[nH]c2c1cc(cc2)F 3623241
c1ccc2c(c1)C(=O)O[Mg]3(O2)Oc4ccccc4C(=O)O3 3580437
C1C[O-][Te+4][O-]1 3558859
CCCCC1C[O-][Te+4][O-]1 3558860
CCCCCCC1C[O-][Te+4][O-]1 3558861
c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.[Si+4] 3559378
CC(C)(C)C(=O)CC(=O)C(C)(C)C.CC(C)(C)C(=O)CC(=O)C(C)(C)C.[Si+4] 3559379
c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].[Si+4] 3559380
CC#N.C1CC[C@@H]([C@@H](C1)O)O.C1CC[C@@H]([C@@H](C1)O)O.[Te+4] 3559382
CN(C)c1ccc(c(c1)O)N=O.CN(C)c1ccc(c(c1)O)N=O.[Si+4] 3559383
c1ccc(c(=O)cc1)O.c1ccc(c(=O)cc1)O.c1ccc(c(=O)cc1)O.[Si+4] 3559385
import sys
import pybel
ob = pybel.ob

import multiprocessing as mp

common_valencies = {1: {0: [1], 1: [0]},
 2: {0: [0]},
 3: {0: [1], 1: [0]},
 4: {0: [2], 1: [1], 2: [0]},
 5: {-2: [3], -1: [4], 0: [3], 1: [2], 2: [1]},
 6: {-2: [2], -1: [3], 0: [4], 1: [3], 2: [2]},
 7: {-2: [1], -1: [2], 0: [3, 5], 1: [4], 2: [3]},
 8: {-2: [0], -1: [1], 0: [2], 1: [3, 5]},
 9: {-1: [0], 0: [1], 1: [2], 2: [3, 5]},
 10: {0: [0]},
 11: {-1: [0], 0: [1], 1: [0]},
 12: {0: [2], 2: [0]},
 13: {-2: [3, 5], -1: [4], 0: [3], 1: [2], 2: [1], 3: [0]},
 14: {-2: [2], -1: [3, 5], 0: [4], 1: [3], 2: [2]},
 15: {-2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [4], 2: [3]},
 16: {-2: [0], -1: [1, 3, 5, 7], 0: [2, 4, 6], 1: [3, 5], 2: [4]},
 17: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 18: {0: [0]},
 19: {-1: [0], 0: [1], 1: [0]},
 20: {0: [2], 1: [1], 2: [0]},
 31: {-2: [3, 5], -1: [4], 0: [3], 1: [0], 2: [1], 3: [0]},
 32: {-2: [2, 4, 6], -1: [3, 5], 0: [4], 1: [3], 4: [0]},
 33: {-3: [0], -2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [4], 2: [3]},
 34: {-2: [0], -1: [1, 3, 5, 7], 0: [2, 4, 6], 1: [3, 5], 2: [4]},
 35: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 36: {0: [0, 2]},
 37: {-1: [0], 0: [1], 1: [0]},
 38: {0: [2], 1: [1], 2: [0]},
 49: {-2: [3, 5], -1: [2, 4], 0: [3], 1: [0], 2: [1], 3: [0]},
 50: {-2: [2, 4, 6], -1: [3, 5], 0: [2, 4], 1: [3], 2: [0], 4: [0]},
 51: {-2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [2, 4], 2: [3], 3: [0]},
 52: {-2: [0], -1: [1, 3, 5, 7], 0: [2, 4, 6], 1: [3, 5], 2: [2, 4]},
 53: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 54: {0: [0, 2, 4, 6, 8]},
 55: {-1: [0], 0: [1], 1: [0]},
 56: {0: [2], 1: [1], 2: [0]},
 81: {0: [1, 3]},
 82: {-2: [2, 4, 6], -1: [3, 5], 0: [2, 4], 1: [3], 2: [0]},
 83: {-2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [2, 4], 2: [3], 3: [0]},
 84: {0: [2, 4, 6]},
 85: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 86: {0: [0, 2, 4, 6, 8]},
 87: {0: [1], 1: [0]},
 88: {0: [2], 1: [1], 2: [0]}}

def IsAttachedToNitrogen(atom):
    nbr = next(ob.OBAtomAtomIter(atom))
    return nbr.GetAtomicNum() == 7

def HasCommonValence(mol):
    for atom in ob.OBMolAtomIter(mol):
        elem = atom.GetAtomicNum()
        if elem not in common_valencies:
            continue # just skip unusual elements
            # return False # unusual elem
        chg = atom.GetFormalCharge()
        data = common_valencies[elem]
        if chg not in data:
            return False # unusual charge state
        totalbonds = atom.BOSum() + atom.GetImplicitHCount()
        if totalbonds not in data[chg]:
            if not(elem==8 and chg==0 and totalbonds==1 and IsAttachedToNitrogen(atom)): # TEMPO-like
                return False # unusual valence (and not TEMPO-like)
    return True

def calculate(smi):
    mol = pybel.readstring("smi", smi).OBMol
    if not HasCommonValence(mol):
        return smi
    return None

if __name__ == "__main__":
    POOLSIZE = 6 # the number of CPUs
    CHUNKSIZE = 1000
    pool = mp.Pool(POOLSIZE)
    with open("output.txt", "w") as out:
        with open(r"D:\LargeData\ChEMBL\chembl_23.ism", "r") as inp:
            #for result in pool.imap(calculate, inp, CHUNKSIZE):
            for result in pool.imap_unordered(calculate, inp, CHUNKSIZE):
            # for result in map(calculate, inp): # no multiprocessing
                if result:

Wednesday 25 April 2018

Cheminformatics for deep learners: Valid SMILES and valid molecules

Recent papers on generating SMILES strings via deep-learning approaches have awoken the cheminformatics pedant in me when I see references to "valid SMILES" and/or "valid molecules". Do these terms make any sense? Do they mean what the authors think they mean?

As a concrete example here's a line from Rafael Gómez-Bombarelli et al (a ground-breaking paper in the field, which appeared on Arxiv back in 2016): "This increased the accuracy of generated SMILES strings, which resulted in higher fractions of valid SMILES strings."

"Valid SMILES", eh? Here's a nice example of a valid SMILES: C#C(#C)(#C)(#C)(#C), a carbon connected via triple bonds to five other carbons. It's a syntactically valid SMILES string that is happily read by many toolkits; for example, we can use Open Babel to calculate the molecular weight of the corresponding molecule::
> obabel -:"C#C(#C)(#C)(#C)(#C)" -otxt --append mw

An invalid SMILES might be missing a parenthesis or ring closure, or begin with a bond symbol, or contain the element Zz or any number of things. The problem here is that the terms "in/valid SMILES" are used by the authors above with some other meaning, presumably related to the likelihood of the existence of the corresponding molecule. As I hope I have demonstrated, the validity of a SMILES string has nothing to do with whether the corresponding molecule might exist or not.

What I'm really talking about here is the difference between syntax and semantics: the meaning of the SMILES string versus its symbolic construction. Elsewhere Rafael Gómez-Bombarelli et al refers to "the fragility of [SMILES] syntax (opening and closing cycles and branches, allowed valences, etc.)". They should have stopped at "branches" - the "allowed valences" (whatever this may mean - it's undefined) is nothing to do with the syntax of a SMILES string.

So maybe "valid molecule" is a better term for that? Or at least that seems to be what people think. But "valid molecule" is an even more nebulous term - what is a valid molecule? One that's not a radical? One that might exist at standard temperatures and pressures without decomposing in a millisecond? Who knows. I think that what people actually mean is that the atoms in the molecule are all in common valences and charge states, or perhaps they just mean that it is rejected by RDKit (which might be for a number of reasons unrelated to the validity of the molecule, e.g. kekulization failure). If that's what they mean, they should just say that.

So please, a bit more clarity and a bit less woolly language. Think of the children cheminformaticians.

Tuesday 24 April 2018

Running CUDA samples with Visual Studio 2017

I've been installing the CUDA drivers on a Windows 10 box with Visual Studio 2017, and trying to get the CUDA samples to compile. Although solution files are provided for VS2017 (among other VSs), you will get something similar to the following error when you attempt to compile:
error MSB8036: The Windows SDK version 10.0.15063.0 was not found. Install the required version of Windows SDK or change the SDK version in the project property pages or by right-clicking the solution and selecting "Retarget solution".

Right-clicking on the solution and retargeting gets you a bit further:
fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, 2015 and 2017 are supported!
...which is funny, because I am using VS2017. If you dig into it, it's the specific version that's the problem, and there doesn't seem to be an easy fix.

However, a nice feature (finally!) of VS2017 is that you can optionally install other compiler toolchains. If you rerun your VS2017 Installer, and find the Modify option (under More), you will see a whole bunch of extra features you can install under "Individual components". The one of interest here is "VC++ 2015.3 v140 toolset for desktop". Once installed, you can instead open the Visual Studio 2015 solutions, and the good news is that these successfully compile.

Saturday 14 April 2018

Generating multiple SMILES

While sometimes presented as a negative, the ability to generate multiple SMILES strings for the same molecule can also be a positive, particularly when you want to avoid bias (e.g. machine learning from SMILES - see here and here) or check that an algorithm is atom-order invariant.

Here are two different ways to generate multiple SMILES strings for the same molecule using Open Babel (without introducing dot disconnections). As an example, let's consider my favourite molecule: c1ccccc1C(=O)Cl.

The first approach is to use canonical SMILES...except that the canonical labels are generated randomly. You can do this directly at the commandline (see "obabel -Hsmi" for more info):
>obabel -:c1ccccc1C(=O)Cl -osmi -xC

Each time you do it, a different random SMILES string will be generated [1], up to a total of 16 variants (in this case):

We can generate even more variants by specifying the output order directly - this overrides some decisions that are usually left to the SMILES writer and allows us, for example, to force single bonds to be followed before double bonds:
>obabel -:c1ccccc1C(=O)Cl -osmi -xo 1-2-3-4-5-6-7-9-8

Using this approach, 32 variants can be generated:

In summary, these approaches allow you to generate all possible SMILES strings consistent with a depth-first ordering of atoms [2], starting from different points and choosing different routes at each branch point. For machine learning, I'd imagine that the first approach would be preferred as the second approach will generate SMILES strings that will contain substrings that would never be observed normally (in Open Babel SMILES).

Python code

import random
import pybel

def randomlabels(mol, N):
    ans = set()
    for i in range(N):
        ans.add(mol.write("smi", opt={"C":True}).rstrip())
    return sorted(list(ans))

def randomorder(mol, N):
    ans = set()
    numatoms = mol.OBMol.NumAtoms()
    for i in range(N):
        idxs = list(range(1, numatoms+1))
        optval = "-".join(str(x) for x in idxs)
        ans.add(mol.write("smi", opt={"o": optval}).rstrip())
    return sorted(list(ans))

if __name__ == "__main__":
    mol = pybel.readstring("smi", "c1ccccc1C(=O)Cl")

    print("Random canonical labels")
    randomsmis = randomlabels(mol, 500)
    for smi in randomsmis:
    print("Random output order")
    randomsmis = randomorder(mol, 500)
    for smi in randomsmis:

1. An alternative (but slower) way to generate these same SMILES would be to shuffle the atoms in the OBMol and then write it out as a SMILES string.
2. If dot disconnections are tolerated, then see Andrew Dalke's approach.

Tuesday 27 February 2018

Calling all students: Google Summer of Code and CSA Trust Grant both happening now!

Deadlines are fast approaching for the following:

Open Chemistry Google Summer of Code: If you're a student and interested in hacking on open source chemistry software (and get paid a bit for the privilege), then get on over to the Google Summer of Code (GSoC) page of the Open Chemistry project. A whole bunch of Open Source chemistry tools have gathered together (including Open Babel, natch) and come up with project ideas that hopefully will spark interest. If you've always wanted to get involved with Open Source but didn't know how, this is a good chance to do so.

CSA Trust Grant: I was a recipient of a CSA Trust Grant myself, and funny story, I'm now a CSA Trustee and on the Grant Committee. Applications are now invited - it's pretty straightforward. If you look at the details of previous recipients you can see the sorts of things they applied for. The success rate is pretty high, so I really do encourage you to apply. And if you don't get it this year, well, try again next year (it worked for me).

Saturday 3 February 2018

Using those other processors

This year, I've decided it's time to make more use of those other processors that my PC has. At some point, it gets embarrassing to see my single CPU job trundling along at its max speed, while 80 or 90% of the processing power is just sitting idle.

My typical input is a SMILES file such as ChEMBL or PubChem, or a file I've created by processing these, or a set of reaction SMILES extracted from patents. In any case, the problem is the same - a large number of inputs that can be processed in parallel. Sure, you can split the file (see the 'split' command, though be careful to split on lines) but I tend to avoid that as it's a bit clunky. If on Linux, there's the magic "GNU parallel". But what I really should be doing is making use of multiple CPUs directly in my Python scripts.

And actually, that's not too hard. The multiprocessing module makes it pretty easy once you've figured it out. Here's my goto template for these calculations. I try to keep it as slimmed down as possible, because there's some magic going on behind the scenes and I don't want to have to debug any complex problems. (Note that there's also a multithreading module, but CPython cannot parallelize CPU-bound calculations using threads due to its Global Interpreter Lock.)

import multiprocessing as mp
import pybel

def calculate(data):
    return pybel.readstring("smi", data).molwt

if __name__ == "__main__":
    POOLSIZE = 4 # the number of CPUs
    CHUNKSIZE = 1000
    pool = mp.Pool(POOLSIZE)
    with open("output.txt", "w") as out:
        with open(r"C:\Tools\LargeData\chembl_23.smi", "r") as inp:
            for result in pool.imap(calculate, inp, CHUNKSIZE):
            # for result in pool.imap_unordered(calculate, inp, CHUNKSIZE):
            # for result in map(calculate, inp): # no multiprocessing
                out.write("%f\n" % result)

1. This blog is now Python 3.
2. By editing the commented-out code, you can choose one of three variations. During development, you can avoid multiprocessing entirely by just using a regular map (equivalent to itertools.imap in Python 2). If using multiprocessing, you can choose to have the output in the same order as the input, or just have it as it comes; the latter is presumably faster.
3. If you need to kill a multiprocessing job, CTRL+C just won't do it, as it spawns an additional process to replace it (if anyone knows of a way to make this work please let me know). You need to use the Process Manager, choose the master Python process and kill that one.

Tuesday 23 January 2018

Which came first, the atom or the bond?

Let's suppose you need to iterate over the neighbours of an atom, and you want to know both the corresponding bond order and the atomic number of the neighbour. Given that toolkits typically provide iterators over the attached bonds or the attached atoms, but not both simultaneously, how exactly should you do this?

Should you:
  • (a) iterate over the neighbouring atoms, and then request the bond joining the two atoms?
  •     for nbr in ob.OBAtomAtomIter(atom):
            bond = atom.GetBond(nbr)
  • or (b) iterate over the attached bonds, and ask for the atom at the other end?
  •     for bond in ob.OBAtomBondIter(atom):
            nbr = bond.GetNbrAtom(atom)

Obviously, either way you get your atom and bond. But which is more efficient? Clearly, the answer to this depends on the internals of the toolkit. But if you assume that each atom knows its attached atoms and bonds, then it's only the second step that determines the relative efficiency. That is:
  • (a) given two atoms find the bond that joins them, versus
  • (b) given an atom and a bond find the atom at the other end

Since the implementation of (a) will probably involve the same test that (b) is doing plus additional work, it follows that (b) must be more efficient. I never really thought about this before until I was writing the kekulization code for Open Babel. It's the sort of thing that's useful to work out once and then apply in future without thinking. Sure, the speed difference may be minimal but given that you have to choose, you might as well write it the more efficient way.

Thursday 18 January 2018

Implementing the Sayle tautomer hash with Open Babel

One of the consequences of last year's overhaul of handling of kekulization, aromaticity and implicit hydrogens is the ability to easily calculate structure hashes such as Roger Sayle's tautomer hash, which I wrote up on the NextMove blog a while ago.

I routinely use this hash to handle tautomers, particularly when dealing with R groups. It doesn't solve all tautomer issues (e.g. ones that involve carbon) but it can quickly bring you from having no support for handling tautomers to getting you 95% of the way. In fact, I've been thinking about adding this (and some of the other hashes that Roger has come up with) as cansmi output options. Anyhoo, here's an implementation in Python:
import pybel
ob = pybel.ob

def tautomerhash(smi):
    """Take a SMILES and return the Sayle tautomer hash:
    mol = pybel.readstring("smi", smi).OBMol
    mol.DeleteHydrogens() # just in case
    formalcharges = 0
    hcount = 0
    for atom in ob.OBMolAtomIter(mol):
        formalcharges += atom.GetFormalCharge()
        if atom.GetAtomicNum() != 6: # non-carbon
            hcount += atom.GetImplicitHCount()
    for bond in ob.OBMolBondIter(mol):
    mol.SetAromaticPerceived() # no point triggering perception
    return "%s_%d" % (pybel.Molecule(mol).write("can").rstrip(), hcount-formalcharges)

if __name__ == "__main__":
    smis = ["*c1c(c(C(=N)O)cc2nc([nH]c12)C(=O)[O-])N(=O)=O",
    for smi in smis:
The two SMILES in the example code above are those from the original blog post. Here they give the following two identical hashes (note to self: 'fix' the bracketed asterisk):