Sunday, 15 September 2019

partialsmiles - A validating parser for partial SMILES

I've just released version 1.0 of partialsmiles, which is available via pip. Documentation can be found here and the project is hosted by GitHub.

partialsmiles is a Python library that provides a validating SMILES parser with support for partial SMILES. The parser checks syntax, ability to kekulize aromatic system, and checks whether atoms’ valences appear on a list of allowed valences.

The main use of the library is to accept or reject SMILES strings while they are being generated character-by-character by a machine-learning method. Previously the only way to check such strings was after the entire string was generated, a process that could take 10s of seconds. Using partialsmiles, SMILES strings with invalid syntax or problematic semantics can be detected while they are being generated, allowing them to be discarded or for alternate tokens to be suggested.

Some other toolkits (such as OEChem and the CDK) can read partial SMILES; should you use these instead? Certainly, if they meet your needs. Their goal is to return a molecule object that captures bonding information so far (useful for ‘depiction as you type’, for example), perhaps converting partial aromatic rings to single bonds and unmatched ring connections to dangling bonds. This is in contrast to the goal of partialsmiles which is to work out whether the partial SMILES string could be the prefix of a complete SMILES string that would be accepted by the parser. For example, "CC(C)(" may be read as propane by other readers, but partialsmiles will reject it with a valence error as completion of the string will lead to a five or more valent carbon (see the docs).

A secondary goal of this project is to make available a permissively-licensed SMILES parser that correctly reads all aromatic systems in the SMILES reading benchmark. This is with the hope to encourage improvements in existing parsers.

And finally, I just want to thank Xuhan Liu for providing the suggestion that such a project would be useful.

Image credit: half by MandoBarista (licensed CC BY-SA)

Saturday, 14 September 2019

Moving to pastures new, but still in the same field Part III

Following on from my previous post on the topic, 30th August was my last day at NextMove Software.

Monday week I'll be joining the Computational Chemistry team at Sosei Heptares which is lead by Chris de Graaf. If you're not familiar with the company, the R&D site is based just outside Cambridge (UK) and works on structure-based drug discovery with a focus on GPCRs. I'm very excited about this move as I'm really looking forward to having a chance to apply my skills more directly to drug discovery.

Here's a high-level overview of what the company does, which I found quite useful (jump to 1:36 for comp chem):

Friday, 13 September 2019

R you similar? Inferring R group similarity from medicinal chemistry data

While molecular similarity is widely-studied (even by me!), the question of how to measure R group similarity is much less so, despite the fact that R group replacement is very commonly applied in medicinal chemistry projects.

My motivating example is fluoro and chloro - are these R groups similar? Well, a chemical fingerprint would say they have no bits in common and hence have zero similarity. But we know they're similar. Not just because they are nearby in the periodic table, but because chemists tend to make molecules that have Cl replaced with F.

In other words, we can use medicinal chemistry project data to infer a measure of R group similarity by looking at what R groups co-occur in medicinal chemistry projects as matched pairs. This measure can be used to suggest changes, form the basis of an enumeration strategy, or search for similar molecules. Because this similarity measure is derived from medicinal chemistry data, the results should be reasonable and interpretable.

At the recent ACS meeting in San Diego, I described my work on a measure of R group similarity:
Measuring R group similarity using medicinal chemistry data

I found it difficult to jam everything into my 20+5min talk, so let me add a bit of discussion here on how it compares to previous work...

There certainly has been some interesting work on R group similarity from Sheffield (e.g. Holliday et al), and more recently from this year's Skolnik Award winner Kimito Funatsu (Tamura et al), among others. But perhaps the most similar work is that on large-scale identification of bioisosteric replacements from the Bajorath group (e.g. Wassermann and Bajorath) and Swiss-Bioisotere (Wirth et al). Others have used 3D data from the PDB to further refine predictions, e.g. Seddon et al and LeŇ°nik et al.

The thing is, most of the time in a med chem project (correct me if wrong) you're not looking for bioisosteric replacements - you want to improve potency, or you want just want to probe the binding site, e.g. see can you grow in this direction or that direction. The changes people make are often to R groups around the same size (maybe a bit bigger), but not always to ones with the same properties. For example, if an electron-withdrawing group doesn't work, time to test an electron-donating group. Of course, towards the end, you might have potency nailed and are trying to work around some physchem properties - at this point, bioisosteric replacements come into play.

Something else that I feel previous work has missed is the time dimension. The path of R group replacements can be envisaged as starting with a hydrogen or methyl, and through an iterative process ending up somewhere a whole lot more complicated. Although methyl co-occurs with the vast majority of R groups (as a matched molecular pair), after a certain point in a project that's not a useful suggestion; it's already been done, or already been ruled out. What you want to know is what R groups are typically made before versus after. You can easily work this out if you have time series information from med chem projects, but as I show in the talk, even in the absence of explicit time series data, it's possible to infer the approximate order in which R groups are made if you have a large number of sets of med chem project data.

I had a couple of questions about how to turn this method of finding similar R groups into a similarity metric. I turned this question back on the questioners: "Why do you want that? My guess is to use it to find similar molecules." But you can find similar molecules without turning the similarity into a number between 0 and 1; the talk describes one way to do this.

Oh, and here's the poster on the same topic I showed earlier this year at Sheffield and subsequently at the ACS.

Thursday, 12 September 2019

Talk about making a hash of a molecule

At the recent ACS meeting in San Diego, I presented a talk entitled:
Making a hash of it: The advantage of selectively leaving out structural information

Firstly, this is an overview of a general technique for solving a wide range of cheminformatics problems; namely, the calculation and comparison of hashes of molecules to find duplicates with respect to some structural property of interest.

It also introduces MolHash to the world, a command-line application from NextMove Software which is freely available from GitHub (*).

* If you don't fancy compiling it yourself, you may find Matt Swain's conda version useful.

Monday, 3 June 2019

Building a PubChem query

Evan Bolton was visiting recently and I got to watch over his shoulder as he built up a PubChem query on the website step-by-step. I helpfully added comments like "wait a second - how did you bring that up?" and "you can do that?", not to mention "ah - so that's how you do that".

The general idea is that any search you can do, you can combine with other searches with Boolean operators. It's a little bit confusing that the page that you use for combining them is not necessarily the same page you are using for doing the search, but once you're happy with that, it makes sense.

As an example, I am interested in those entries in the peptide dataset I uploaded to PubChem that do not have a biological annotation. This is a little bit complicated in that the biological annotations are a feature of PubChem Compound, but the datasets are all in PubChem Substance, so we are going to have to link from one to the other.

Here is a step-by-step guide which you may find useful to work through:

  1. First of all let's find all molecules in PubChem Compound that have a biological annotation. Go to, click "Browse Data", select classification "PubChem/PubChem Compound TOC". In the tree that appears at the bottom of the page click on the number next to "Biologic Description" to do the search. (May have to reload the page if it gives an error first time.)
  2. Next, let's find all molecules in PubChem Compound that came from my dataset. Go to, click "Explore Data Sources" towards the bottom right, type "NextMove" into the "Search Sources" box, and then "8,699 Live Substances" to do the search.
  3. The previous search was of PubChem Substance, which is where datasets live. We want to find the corresponding molecules from PubChem Compound, so on the right under "Find related data" select "PubChem Compound", and then "PubChem Same Compound" in the Option box that appears. Click "Find Items" to do the search.
  4. Finally, we need to get the intersection between the search described in 3 and all those not in 1. For this go to and click "Advanced". You will see under "History" the searches you have done. In my case, the relevant searches are #1 (step 1 above) and #4 (step 3 above), so I type into the builder box "#4", change "AND" to "NOT" on the next line and enter #1. Finally, I clicked "Search" to do the search. (261 results)

Sunday, 17 February 2019

Transitive reduction of large graphs

There are a number of graph algorithms with not-so-obvious names, and "transitive reduction" is one of those. When I first needed to implement this algorithm, I couldn't figure out whether it had a name, and I can tell you that it's hard to google for something when you don't know what it's called.

Anyway, what transitive reduction means is to remove the maximum number of edges from a directed acylic graph while keeping the 'reachability'; that is, you can remove any edge A->B in the graph if there's some other way to get from A to B. In practice, this is super-useful in tidying up a graph representing pairwise relationships, e.g. a partially-ordered set. For example, suppose you know that A>B, B>C and A>C. The graph showing this would have three edges, but you can reduce it to just two, A->B->C, that summarises the entire ordering.

I used this algorithm in my fingerprint paper, but more recently I've been working on another problem that also requires it, but this time the graphs are much bigger. At first, I used the tred program that comes with graphviz, but the graphs became too large for tred to finish within a reasonable amount of time. Next I used NetworkX, first from CPython and then from PyPy. The latter could just about handle it but required 30GB of memory and ran overnight, and so I moved onto C++. First I tried with igraph. This was painfuly slow - slower than the Python. It all comes down to the data structures that the library is using; I'm deleting 100s of millions of edges and though I batched up the deletions, it was just too slow.

Finally, I ended up at Boost Graph Library. This library is unusual in that you get to choose the data structures that the graph uses. If you need fast edge removal, then you can choose a linked list in which to store the edges. For the graphs of interest, the code ran in a matter of minutes (after a bit of profiling and optimisation) using a single GB of RAM.
#include <iostream>

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/graphviz.hpp>

using namespace boost;

typedef adjacency_list<listS, vecS, directedS> Graph;
typedef typename graph_traits<Graph>::vertex_descriptor Vertex;
typedef typename graph_traits<Graph>::edge_descriptor Edge;

class DFS
  DFS(Graph &graph): m_graph(graph)
    seen = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    children = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    child_edges = (Edge*) malloc(num_vertices(m_graph) * sizeof(Edge));
  void visit(Vertex root)
  void setParent(Vertex parent)

    typename graph_traits<Graph>::out_edge_iterator ei, ei_end;
    for (tie(ei, ei_end) = out_edges(parent, m_graph); ei != ei_end; ++ei) {
      Edge edge = *ei;
      Vertex src = source(edge, m_graph);
      if (src == parent) {
        Vertex tget = target(edge, m_graph);
        children[tget] = CHILD;
        child_edges[tget] = edge;
      else {
        children[src] = CHILD;
        child_edges[src] = edge;


  void dfs_visit(Vertex v)
    seen[v] = SEEN;
    if (children[v] == CHILD)
      remove_edge(child_edges[v], m_graph);

    std::pair<graph_traits<Graph>::adjacency_iterator, graph_traits<Graph>::adjacency_iterator> adj_vertices_it;
    for (adj_vertices_it = adjacent_vertices(v, m_graph); adj_vertices_it.first != adj_vertices_it.second; ++adj_vertices_it.first) {
      Vertex n2 = *(adj_vertices_it.first);
      if (seen[n2] != SEEN)

  Graph &m_graph;
  long *seen;
  long SEEN = 0; // flag used to mark visited vertices
  long *children;
  long CHILD = 0; // flag used to mark children of parent
  Edge *child_edges;

void help()
  std::cout << "Usage: tred input.graph\n"
            << "\n"
            << "Where the input.graph is a DAG described with the following format:\n"
            << "   NUM_VERTICES\n"
            << "   SRC_1 DST_1\n"
            << "   SRC_2 DST_2\n"
            << "   ...\n"
            << "\nVertices are assumed to be numbered from 0 to NUM_VERTICES-1\n";

int main(int argc, char* argv[])
  if (argc != 3)

  unsigned VERTEX_COUNT;

  std::cout << "Reading...\n";
  std::ifstream inputfile(argv[1]);
  if (!inputfile) {
    std::cout << "ERROR: Cannot open " << argv[1] << " for reading\n";

  inputfile >> VERTEX_COUNT;
  Graph graph(VERTEX_COUNT);
  int a, b;
  while (inputfile >> a >> b)
    add_edge(a, b, graph);

  DFS dfs(graph);

  typename graph_traits<Graph>::adjacency_iterator ai, ai_end, bi, bi_end;
  typename graph_traits<Graph>::vertex_iterator vi, vi_end;

  std::cout << "Reducing...\n";
  for (tie(vi, vi_end) = vertices(graph); vi != vi_end; ++vi) {
    Vertex n1 = *vi;
    if (n1 % 100 == 0)
      std::cout << n1 << "\n";
    for (tie(ai, ai_end) = adjacent_vertices(n1, graph); ai != ai_end; ++ai) {
      Vertex n2 = *ai;
      for (tie(bi, bi_end) = adjacent_vertices(n2, graph); bi != bi_end; ++bi) {
        Vertex n3 = *bi;

  std::cout << "Writing...\n";
  std::ofstream outputfile(argv[2]);
  if (!outputfile) {
    std::cout << "ERROR: Cannot open " << argv[2] << " for writing\n";
  write_graphviz(outputfile, graph);

  return 0;

Sunday, 25 November 2018

SMARTS for dummies

One of Roger's observations came to mind recently when I was looking at some SMARTS patterns in Open Babel that were causing some problems: "People cannot write SMARTS correctly." Instead, he argues, they should use tools to create SMARTS for them. The problem is not so much that patterns don't match molecules of interest, but that they match additional molecules (false positives) and may miss others (false negatives). False positives are a particular problem when transformations (e.g. SMIRKS) are applied to matching molecules, as they can result in exotic valence and charge states that were unintended.

The following is a description of a tool I put together to help craft SMARTS patterns. It is based upon the idea that a SMARTS pattern is completely defined by the molecules it matches. Thus, by looking at the matches in a database of sufficient size, we can see whether there are any mismatches.

So far, so obvious. What's novel is that the tool only returns distinct matches based upon simple atom types of the matching atoms, and sorts them by probability of those atom types (most unusual first). Thus, even for common substructures the number of hits is small and information-rich. Furthermore, by using my favourite trick of sorting the reference database by size (ChEMBL here), this means that the first hit found for each distinct match is likely to be small in size.

An example will illustrate: let's suppose we want to create a SMARTS pattern to match alcohols. I know that "CO" in SMILES indicates this, and so let's start with that. Only 28 hits are returned (from a 10% subset of ChEMBL), of which the first three are:
Ok, obviously we don't want a charged oxygen, so let's try "C[O+0]" and look at the first three hits:
And the oxygen should have a hydrogen, "C[Oh1+0]":
Now, about that carbon. I only had sp3-hybridised carbons in mind, so how about "[CX4][Oh1+0]"? Here is the full list of hits at this point:
At this point the hits are what I expected to see. Does this mean I should stop here? No, but it's a reasonable start.

After using this tool for a while, you begin to appreciate the importance of nailing down exactly what you want to see: if you only want to match neutral carbons, you should say so; if you don't want the oxygen to have additional neighbours, you should say so. Once every degree of freedom is nailed down...then it's finished. You might think this is going too far, but the molecules that exist in databases don't necessarily conform to typical valence rules - they may be unusual (consider TEMPO) or downright incorrect. Your SMARTS pattern should match exactly what you had in mind and not leave any wiggle room to match anything more: "[Cv4X4+0]-[OD1h1+0]"

This tool is by no means a panacea; you still need to think for yourself, and consider for example what to do about aromaticity, or the presence of explicit hydrogens or hydrogen isotopes, or whether to reorder the pattern for speed. But one could imagine a tool that could ease the burden further and do some of this for the user, or for example allow the user to mark "match" or "don't match" for the set of hits returned to automatically refine the query.