Monday, 3 June 2019

Building a PubChem query

Evan Bolton was visiting recently and I got to watch over his shoulder as he built up a PubChem query on the website step-by-step. I helpfully added comments like "wait a second - how did you bring that up?" and "you can do that?", not to mention "ah - so that's how you do that".

The general idea is that any search you can do, you can combine with other searches with Boolean operators. It's a little bit confusing that the page that you use for combining them is not necessarily the same page you are using for doing the search, but once you're happy with that, it makes sense.

As an example, I am interested in those entries in the peptide dataset I uploaded to PubChem that do not have a biological annotation. This is a little bit complicated in that the biological annotations are a feature of PubChem Compound, but the datasets are all in PubChem Substance, so we are going to have to link from one to the other.

Here is a step-by-step guide which you may find useful to work through:

  1. First of all let's find all molecules in PubChem Compound that have a biological annotation. Go to, click "Browse Data", select classification "PubChem/PubChem Compound TOC". In the tree that appears at the bottom of the page click on the number next to "Biologic Description" to do the search. (May have to reload the page if it gives an error first time.)
  2. Next, let's find all molecules in PubChem Compound that came from my dataset. Go to, click "Explore Data Sources" towards the bottom right, type "NextMove" into the "Search Sources" box, and then "8,699 Live Substances" to do the search.
  3. The previous search was of PubChem Substance, which is where datasets live. We want to find the corresponding molecules from PubChem Compound, so on the right under "Find related data" select "PubChem Compound", and then "PubChem Same Compound" in the Option box that appears. Click "Find Items" to do the search.
  4. Finally, we need to get the intersection between the search described in 3 and all those not in 1. For this go to and click "Advanced". You will see under "History" the searches you have done. In my case, the relevant searches are #1 (step 1 above) and #4 (step 3 above), so I type into the builder box "#4", change "AND" to "NOT" on the next line and enter #1. Finally, I clicked "Search" to do the search. (261 results)

Sunday, 17 February 2019

Transitive reduction of large graphs

There are a number of graph algorithms with not-so-obvious names, and "transitive reduction" is one of those. When I first needed to implement this algorithm, I couldn't figure out whether it had a name, and I can tell you that it's hard to google for something when you don't know what it's called.

Anyway, what transitive reduction means is to remove the maximum number of edges from a directed acylic graph while keeping the 'reachability'; that is, you can remove any edge A->B in the graph if there's some other way to get from A to B. In practice, this is super-useful in tidying up a graph representing pairwise relationships, e.g. a partially-ordered set. For example, suppose you know that A>B, B>C and A>C. The graph showing this would have three edges, but you can reduce it to just two, A->B->C, that summarises the entire ordering.

I used this algorithm in my fingerprint paper, but more recently I've been working on another problem that also requires it, but this time the graphs are much bigger. At first, I used the tred program that comes with graphviz, but the graphs became too large for tred to finish within a reasonable amount of time. Next I used NetworkX, first from CPython and then from PyPy. The latter could just about handle it but required 30GB of memory and ran overnight, and so I moved onto C++. First I tried with igraph. This was painfuly slow - slower than the Python. It all comes down to the data structures that the library is using; I'm deleting 100s of millions of edges and though I batched up the deletions, it was just too slow.

Finally, I ended up at Boost Graph Library. This library is unusual in that you get to choose the data structures that the graph uses. If you need fast edge removal, then you can choose a linked list in which to store the edges. For the graphs of interest, the code ran in a matter of minutes (after a bit of profiling and optimisation) using a single GB of RAM.
#include <iostream>

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/graphviz.hpp>

using namespace boost;

typedef adjacency_list<listS, vecS, directedS> Graph;
typedef typename graph_traits<Graph>::vertex_descriptor Vertex;
typedef typename graph_traits<Graph>::edge_descriptor Edge;

class DFS
  DFS(Graph &graph): m_graph(graph)
    seen = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    children = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    child_edges = (Edge*) malloc(num_vertices(m_graph) * sizeof(Edge));
  void visit(Vertex root)
  void setParent(Vertex parent)

    typename graph_traits<Graph>::out_edge_iterator ei, ei_end;
    for (tie(ei, ei_end) = out_edges(parent, m_graph); ei != ei_end; ++ei) {
      Edge edge = *ei;
      Vertex src = source(edge, m_graph);
      if (src == parent) {
        Vertex tget = target(edge, m_graph);
        children[tget] = CHILD;
        child_edges[tget] = edge;
      else {
        children[src] = CHILD;
        child_edges[src] = edge;


  void dfs_visit(Vertex v)
    seen[v] = SEEN;
    if (children[v] == CHILD)
      remove_edge(child_edges[v], m_graph);

    std::pair<graph_traits<Graph>::adjacency_iterator, graph_traits<Graph>::adjacency_iterator> adj_vertices_it;
    for (adj_vertices_it = adjacent_vertices(v, m_graph); adj_vertices_it.first != adj_vertices_it.second; ++adj_vertices_it.first) {
      Vertex n2 = *(adj_vertices_it.first);
      if (seen[n2] != SEEN)

  Graph &m_graph;
  long *seen;
  long SEEN = 0; // flag used to mark visited vertices
  long *children;
  long CHILD = 0; // flag used to mark children of parent
  Edge *child_edges;

void help()
  std::cout << "Usage: tred input.graph\n"
            << "\n"
            << "Where the input.graph is a DAG described with the following format:\n"
            << "   NUM_VERTICES\n"
            << "   SRC_1 DST_1\n"
            << "   SRC_2 DST_2\n"
            << "   ...\n"
            << "\nVertices are assumed to be numbered from 0 to NUM_VERTICES-1\n";

int main(int argc, char* argv[])
  if (argc != 3)

  unsigned VERTEX_COUNT;

  std::cout << "Reading...\n";
  std::ifstream inputfile(argv[1]);
  if (!inputfile) {
    std::cout << "ERROR: Cannot open " << argv[1] << " for reading\n";

  inputfile >> VERTEX_COUNT;
  Graph graph(VERTEX_COUNT);
  int a, b;
  while (inputfile >> a >> b)
    add_edge(a, b, graph);

  DFS dfs(graph);

  typename graph_traits<Graph>::adjacency_iterator ai, ai_end, bi, bi_end;
  typename graph_traits<Graph>::vertex_iterator vi, vi_end;

  std::cout << "Reducing...\n";
  for (tie(vi, vi_end) = vertices(graph); vi != vi_end; ++vi) {
    Vertex n1 = *vi;
    if (n1 % 100 == 0)
      std::cout << n1 << "\n";
    for (tie(ai, ai_end) = adjacent_vertices(n1, graph); ai != ai_end; ++ai) {
      Vertex n2 = *ai;
      for (tie(bi, bi_end) = adjacent_vertices(n2, graph); bi != bi_end; ++bi) {
        Vertex n3 = *bi;

  std::cout << "Writing...\n";
  std::ofstream outputfile(argv[2]);
  if (!outputfile) {
    std::cout << "ERROR: Cannot open " << argv[2] << " for writing\n";
  write_graphviz(outputfile, graph);

  return 0;

Sunday, 25 November 2018

SMARTS for dummies

One of Roger's observations came to mind recently when I was looking at some SMARTS patterns in Open Babel that were causing some problems: "People cannot write SMARTS correctly." Instead, he argues, they should use tools to create SMARTS for them. The problem is not so much that patterns don't match molecules of interest, but that they match additional molecules (false positives) and may miss others (false negatives). False positives are a particular problem when transformations (e.g. SMIRKS) are applied to matching molecules, as they can result in exotic valence and charge states that were unintended.

The following is a description of a tool I put together to help craft SMARTS patterns. It is based upon the idea that a SMARTS pattern is completely defined by the molecules it matches. Thus, by looking at the matches in a database of sufficient size, we can see whether there are any mismatches.

So far, so obvious. What's novel is that the tool only returns distinct matches based upon simple atom types of the matching atoms, and sorts them by probability of those atom types (most unusual first). Thus, even for common substructures the number of hits is small and information-rich. Furthermore, by using my favourite trick of sorting the reference database by size (ChEMBL here), this means that the first hit found for each distinct match is likely to be small in size.

An example will illustrate: let's suppose we want to create a SMARTS pattern to match alcohols. I know that "CO" in SMILES indicates this, and so let's start with that. Only 28 hits are returned (from a 10% subset of ChEMBL), of which the first three are:
Ok, obviously we don't want a charged oxygen, so let's try "C[O+0]" and look at the first three hits:
And the oxygen should have a hydrogen, "C[Oh1+0]":
Now, about that carbon. I only had sp3-hybridised carbons in mind, so how about "[CX4][Oh1+0]"? Here is the full list of hits at this point:
At this point the hits are what I expected to see. Does this mean I should stop here? No, but it's a reasonable start.

After using this tool for a while, you begin to appreciate the importance of nailing down exactly what you want to see: if you only want to match neutral carbons, you should say so; if you don't want the oxygen to have additional neighbours, you should say so. Once every degree of freedom is nailed down...then it's finished. You might think this is going too far, but the molecules that exist in databases don't necessarily conform to typical valence rules - they may be unusual (consider TEMPO) or downright incorrect. Your SMARTS pattern should match exactly what you had in mind and not leave any wiggle room to match anything more: "[Cv4X4+0]-[OD1h1+0]"

This tool is by no means a panacea; you still need to think for yourself, and consider for example what to do about aromaticity, or the presence of explicit hydrogens or hydrogen isotopes, or whether to reorder the pattern for speed. But one could imagine a tool that could ease the burden further and do some of this for the user, or for example allow the user to mark "match" or "don't match" for the set of hits returned to automatically refine the query.

Tuesday, 2 October 2018

DeepSMILES - a SMILES-like syntax for use in machine-learning

At the last ICCS, Andrew Dalke and I got talking. We had both been around the poster session, and had heard first-hand the problems of "invalid SMILES" that were affecting the generation of SMILES strings using deep neural networks. I remember Xuhan Liu in particular asking me whether I had a solution to this. Well, I had been turning over an idea in my mind for a little while, and by the time I got talking to Andrew, I had decided to do something about it. Now Andrew knows a thing or two about SMILES himself, so I shouldn't have been surprised that he had also come up with something (along with two references from 1964 to back it up). We put the two ideas together and called it DeepSMILES.

The problematic areas of SMILES syntax involve paired ring closure symbols for cycles, and parentheses for branches. These particular aspects of the syntax are difficult to reproduce correctly when generating SMILES strings using machine-learning methods, and so a certain percentage of generated SMILES tend to fail basic syntax checks. While there have been a variety of approaches aimed at improving the SMILES generation (with quite some success), it is reasonable to assume that the syntax also causes difficulties during the learning phase.

Our approach is not to use SMILES, but instead an alternative syntax that does not have these problems. Paired ring closures are replaced by a single digit, the ring size; paired parentheses are replaced by close parentheses indicating branch size. See the talk below, the preprint, or the GitHub site for more information. Feedback (positive or negative) is welcome, either here or on GitHub.

Friday, 20 July 2018

Streamlining the handling of aromaticity

I've done a bit of work over the last year or two stripping back some accumulated cruft in Open Babel. One question that's been on my mind during this process is how much can you simplify the toolkit's core but yet leave it capable of complex behaviour?

Take the case of aromaticity handling. Just this part of the toolkit on it's own raises many questions. Should aromaticity be lazily perceived or require an explicit function call? What happens when the user modifies a molecule? What about copying a molecule? Or just copying a subset of the atoms to another molecule? What should happen when you add two molecules together? If you read aromaticity from a SMILES string, what if it's a different model than the toolkit uses internally? Should the SMILES writer reperceive aromaticity, or just use it as presented?

Often the easiest solution to these problems is to always do the maximum amount of work possible (i.e. throwing away perceived aromaticity information at every opportunity), which I wanted to avoid at all costs. So I went through removing bits and simplifying things, making sure that aromaticity information was copied where possible, and hoping that in the end all of the complex behaviour that I wanted to maintain would still be possible without adding back fixes or kludges. And fortunately it was. I even managed to add additional behaviour, an option to keep the aromaticity information as read from a SMILES string.

I'm a firm believer that there's no point adding features or improving things if you don't write documentation explaining why/how/when they should use it. What's the point doing this work if no-one knows how they can take advantage of it? So, I've just written up documentation on how Open Babel handles aromaticity. The following (exclusive!) excerpt describes how aromaticity information is stored by Open Babel.

Handling of aromaticity

The purpose of this section is to give an overview of how Open Babel handles aromaticity. Given that atoms can be aromatic, bonds can be aromatic, and that molecules have a flag for aromaticity perceived, it’s important to understand how these all work together.

How is aromaticity information stored?

Like many other toolkits, Open Babel stores aromaticity information separate from bond order information. This means that there isn’t a special bond order to indicate aromatic bond. Instead, aromaticity is stored as a flag on an atom as well as a flag on a bond. You can access and set this information using the following API functions:

  • OBAtom::IsAromatic(), OBAtom::SetAromatic(), OBBond::UnsetAromatic()
  • OBBond::IsAromatic(), OBBond::SetAromatic(), OBBond::UnsetAromatic()

There is a catch though, or rather a key point to note. OBMols have a flag to indicate whether aromaticity has been perceived. This is set via the following API functions:

  • OBMol::SetAromaticPerceived(), OBMol::UnsetAromaticPerceived()

The value of this flag determines the behaviour of the OBAtom and OBBond IsAromatic() functions.

  • If the flag is set, then IsAromatic() simply returns the corresponding value of the atom or bond flag.
  • If unset, then IsAromatic() triggers aromaticity perception (from scratch), and then returns the value of the flag.

See for the nail-biting conclusion to this thrilling exposition.

Tuesday, 10 July 2018

Clarifying the Cahn-Ingold-Prelog rules

What happens when four software developers get together and compare their implementations of the CIP rules? This is the background to a recent preprint deposited in ChemRxiv.

The CIP system is a series of rules that describe how to assign a stereodescriptor (e.g. R/S, E/Z) to a stereocentre. When Bob Hanson decided to add support for CIP to Jmol, rather than simply read the rules and implement it according to his interpretation as others have done, he decided to work with three other implementations to challenge each other on disagreements and clarify the wording of the rules.

The result is described in:
Algorithmic Analysis of Cahn-Ingold-Prelog Rules of Stereochemistry: Proposals for Revised Rules and a Guide for Machine Implementation. Robert M. Hanson, John W. Mayfield, Mikko J. Vainio, Andrey Yerin, Dmitry Redkin, and Sophia
Musacchio []

Essentially, the issue that the authors are addressing is the fact that existing implementations even in "highly respected software packages" disagree with each other (see John Mayfield's presentation). By comparing the implementations in Jmol, Centres, Balloon and ChemSketch they were able to identify cases where:
"(a) the disagreement was due to different interpretations of CIP rules among software developers, (b) there was a problem with an algorithm or its implementation in code, or (c) the CIP rules themselves were flawed."

In all cases, however, they were able to come to a consensus, which led to "the discovery of a small number of errors in the Blue Book, two minor flaws in the CIP rules, and a proposal for a new rule".

The paper walks through their discussions of each rule in turn, looking at any issues arising and clarifying any ambiguities. It also includes a validation suite (browse it here) that covers all aspects of the rules and will allow future CIP implementations to avoid the pitfalls that have beset the field in the past.

Thursday, 14 June 2018

Cheminformatics for deep learners: Canonical SMILES and why you should avoid them

I've already gotten the "valid SMILES" issue off my chest, but there's something else that's been bugging me about deep learning papers in chemistry, and that's the use of canonical SMILES for training. Why would you use canonical SMILES? And if you did, wouldn't that introduce some problems?

This is one of those cases where this just feels like a mistake to me - and I'm going to do my best to articulate my concerns. However, I'm not familiar enough with the mechanics of DNNs to be sure, but I hope those in the field will consider the points I raise.

The problem

So, canonical SMILES. Behind the scenes, the canonicalisation procedure gives a unique label to each atom (typically 1 to N). These labels are then used to generate a canonical SMILES, typically by starting at the lowest label (but not necessarily). The canonicalisation procedure is based upon the attributes of the graph, with the result that the first atom in a canonical SMILES tends to favor particular atom types and avoid others.

For example, if you look at the atom types for the first atom in the canonical SMILES generated by RDKit for ChEMBL molecules, you will find that the second most common atom type in ChEMBL (namely, *-C(-*)=*) never appears as the first atom in a canonical SMILES string. This is by design and you'll see similar behaviour with other toolkits - SMILES strings tend to start with degree 1 atoms.

So what if the distribution of atom types is different for the first atom?

Well, firstly, I predict that as a result these atom types will be over-represented in structures generated by DNNs (and others under-represented). If you train on canonical SMILES, then the probabilities for the first atom will be determined by the atom types favored as starting atoms by canonical SMILES. Consider the extreme example where fluorine always occurs as the first atom in any canonical SMILES that contains it; you should see an increased number of fluorines as the first atom in the generated molecules. Now you could argue that the probabilities for the remaining atoms will be adjusted accordingly, but I believe that there is a strong edge affect and that any correction will attenuate as the SMILES string becomes longer.

Secondly, this bias makes the DNNs job harder. Instead of a relatively even distribution of atom types at all points in the SMILES string, the distribution will depend on the distance from the starting atom. It's now trying to learn something about the properties of canonical SMILES instead on concentrating on the task at hand...

...which bring me nicely to the third point. Predictive models attempt to deduce a property value from the structure, and a SMILES string is used by DNNs as a proxy for the structure. Using a canonical SMILES string is another step removed. What about a molecule with a very similar structure but very different canonical SMILES? Surely the goal of a robust model is to handle this well. Restricting good predictive power to only those structures that are both similar and have similar canonical SMILES is to develop a model with a reduced applicability. A fun task is do is to measure the degree to which this fitting to canonical SMILES occurs; this is left to the reader.

The solution

The solution is simple. Use random SMILES. A single one, or multiple. The use of multiple random SMILES has already been described by Thomas Bergwinkl and subsequently by Esben Jannik Bjerrum as a 'data augmentation technique', but I see it as just avoiding the inherent bias of using canonical SMILES. But either way, I like this quote from Thomas:
The output for alternative representations of a molecule should be the same, if you understand SMILES. Using alternative representations in the test data allows to verify if the neural network understands SMILES. Spoiler: After a while the output becomes very close for the alternatives!

So why do people use canonical SMILES in the first place? I have my theory on this.

I believe it's because the generative models more quickly converge to generation of syntactically valid SMILES strings when they train on canonical SMILES. And for some reason, the percentage of syntactically valid SMILES strings generated by the model has become a figure of merit.

But this makes no sense - who cares what this percentage is? Sure, we can all overfit to canonical SMILES and get high percentages quickly. But how is this a good thing? You know that feeling you get when you see a QSAR model with very high R2 on training data - that's how I feel when I see a high value for this percentage. If it's actually doing what it's supposed to be doing (i.e. learning the underlying structure rather than the training set of canonical SMILES), then the percentage should really be lower. What do I care if the percentage of syntactically valid SMILES is 1%? So long as that 1% solves my problem, it's irrelevant - these structures are spewed out of these models thousands per second (I presume, but even so).

Please let him stop talking now

Okay, okay - I'm done. What do you think?