Monday 30 December 2019

No charge - A simple approach to neutralising charged molecules

There are several reasons you might want to convert a molecule with charges to its neutral form. For example, as a step in standardising a representation for registration, to generate a parent form during registration, to deduplicate vendor-supplied databases, or to simplify SMARTS pattern searches for a particular ionizable group. Here I describe a simple procedure to generate the neutral form of a molecule, and briefly compare it to existing approaches. While a more nuanced approach may be more appropriate to prepare molecules for registration, the described method appears to be suitable for the other applications listed above.

Algorithm
To a first approximation the procedure is simply to identify all atoms with either a +1 or -1 charge, set their charges to zero and adjust their hydrogen counts by -1 or +1 (i.e. we are adding/removing H+). The first minor wrinkle is that +1 charged atoms must have a hydrogen, or otherwise we can't remove H+. The second wrinkle is that we must avoid altering charge-separated representations of multiple bonds, such as nitro which is often represented as [N+](=O)[O-]. I do this by checking whether a neighbor atom has a negative charge (for the +1 atoms) or a positive charge (for the -1 atoms).

Test set
For the purposes of timing implementations of the algorithm and comparing results to another algorithm, I created a test set from a ChEMBL 23 SMILES file I had to hand:
> obabel chembl_23.smi -r -O singlecomp.smi
> obabel singlecomp.smi -aa -s "[!+0]" -O charged.smi
> C:\cygwin\bin\gawk '{print length($0), $1, $2 }' < charged.smi | C:\cygwin\bin\sort -n > tmp.txt
> C:\cygwin\bin\cut -f 2,3 -d " " tmp.txt > charged.sorted.smi

Preamble
First, my usual boilerplate. The code below does not actually use Pybel, but I find it convenient to import both Pybel and openbabel as follows:
from openbabel import pybel
ob = pybel.ob
Implementations
I will describe 4 implementations of the algorithm above which show my progress in optimising the code for speed. Each implementation gives identical results on the test set, but runs faster. Note that hydrogens are assumed to be implicit (this will be the case if SMILES written by Open Babel are used as input, as in the test set above); if not, then they must either be converted to implicit, or the algorithm adjusted to take this into account.

Version 1
This would be the fastest version, if this were written in C++. When applied to the 125,536 molecules in the test set, this version spends 11.3s in the neutralize_v1 function.
def NoNegativelyChargedNbr(atom):
    for nbr in ob.OBAtomAtomIter(atom):
        if nbr.GetFormalCharge() < 0:
            return False
    return True

def NoPositivelyChargedNbr(atom):
    for nbr in ob.OBAtomAtomIter(atom):
        if nbr.GetFormalCharge() > 0:
            return False
    return True

def neutralize_v1(mol):
    """Neutralize charges of +1 or -1"""
    for atom in ob.OBMolAtomIter(mol):
        chg = atom.GetFormalCharge()
        if chg == 1:
            hcount = atom.GetImplicitHCount()
            if hcount >= 1 and NoNegativelyChargedNbr(atom):
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount - 1)
        elif chg == -1: # -ve charge
            hcount = atom.GetImplicitHCount()
            if NoPositivelyChargedNbr(atom): # avoid nitro, etc.
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount + 1)

Version 2
Iterating over all of the atoms of all of the molecules in Python takes some time due to the memory allocation for each atom on the Python side. We can avoid this by using a SMARTS pattern to find just those atoms that have a +1/-1 charge; now the Python code will only have to iterate over the matched atoms, usually zero or a small number.

With this change, only 5.3s is spent in the neutralize function.
pat_v2 = ob.OBSmartsPattern()
pat_v2.Init("[+1,-1]")

def neutralize_v2(mol):
    """Neutralize charges of +1 or -1"""
    pat_v2.Match(mol)
    matches = pat_v2.GetMapList()
    for match in matches:
        atom = mol.GetAtom(match[0])
        chg = atom.GetFormalCharge()
        if chg == 1: # +1 charge
            hcount = atom.GetImplicitHCount()
            if hcount >= 1 and NoNegativelyChargedNbr(atom):
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount - 1)
        else: # -1 charge
            hcount = atom.GetImplicitHCount()
            if NoPositivelyChargedNbr(atom):
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount + 1)

Version 3
Now that we're using a SMARTS pattern, we might as well jam more of the logic into it. The first is the hydrogen check for +1 charged atoms; the second is the test to ensure that there isn't a neighboring atom with a negative charge (for +1 atoms) or positive charge (for -1 atoms). Unfortunately SMARTS does not provide a direct test for an atom bearing positive or negative charge, so I just end up listing a bunch of charges.

With these changes, the helper functions for checking charges on neighboring atoms are no longer needed, and in addition the 'chg' branches in the code can be unified. As a result, Version 3 is considerably shorter than earlier versions and faster as well, taking 3.5s.
pat_v3 = ob.OBSmartsPattern()
pat_v3.Init("[+1!h0!$([*]~[-1,-2,-3,-4]),-1!$([*]~[+1,+2,+3,+4])]")

def neutralize_v3(mol):
    """Neutralize charges of +1 or -1"""
    pat_v3.Match(mol)
    for match in pat_v3.GetMapList():
        atom = mol.GetAtom(match[0])
        chg = atom.GetFormalCharge()
        hcount = atom.GetImplicitHCount()
        atom.SetFormalCharge(0)
        atom.SetImplicitHCount(hcount - chg)

Version 4
I was going to leave it there, but on checking the OBSmartsPattern API I noticed a method that enables re-use of the result vector (rather than returning a new one each time). This is useful, because it means that it's possible to have a single memory allocation for the vector up-front, and by re-using it, avoid memory allocations for all subsequent molecules (strictly speaking, there may be a handful of reallocations to make it bigger if necessary, but you're talking ~10 memory allocations versus 125K). I also realised that it's probably a good idea to check the Boolean returned by the Match method.

With these changes, Version 4 spends 1.5s in neutralise, a considerable improvement compared to the original.
matchvector_v4 = ob.vectorvInt()

def neutralize_v4(mol):
    """Neutralize charges of +1 or -1"""
    match_found = pat_v3.Match(mol, matchvector_v4)
    if match_found:
        for match in matchvector_v4:
            atom = mol.GetAtom(match[0])
            chg = atom.GetFormalCharge()
            hcount = atom.GetImplicitHCount()
            atom.SetFormalCharge(0)
            atom.SetImplicitHCount(hcount - chg)

Comparison to other algorithms
I wrote this code as a baseline implementation of neutralisation with which to think about the problem, identify corner cases, and consider existing implementations. However, it seems to be good enough as is to be ready for use. Let's look at the alternatives...

Both Francis Atkinson's Standardiser and Matt Swain's MolVS appear to be aimed at preparing a standard form for registering, and think about charge balancing and so forth. This is not the goal here, and this method is not suitable for that purpose. The form of neutralisation described above could be considered a "per-atom" neutralisation, rather than "whole-molecule"; I want all carboxylates present to be turned into COOH regardless of what else is going on in the molecule; in fact, by the time I run the neutralisation I have already chopped off all but the single largest component, and so it's likely to be unbalanced. I should look into those methods further though, and try to understand the rationale...or maybe just ask the authors at the next Cambridge Cheminformatics Network Meeting. :-) I did compare my results to MolVS and apart from charge balancing differences (e.g. for "C[S+](C)CCCC(=O)[O-]"), I note that MolVS handles [N+]([O-])[O-] better (e.g. "c1cc(cc(c1)F)NC(=O)C(=[N+]([O-])[O-])C#N" derived from CHEMBL1651804); I leave it alone because it appears to be a charge-separated form, but MolVS protonates one of the oxygens.

In terms of its intended goal, the code from Hans de Winter included in the RDKit Cookbook is the most similar. One difference is that that code sticks to a specific list of ionizable groups (and thus misses at least some, e.g. hydroxamic acids) whereas the code above is completely general. Comparing the results, I note that the HdW code incorrectly affects some charge-separated forms (e.g. "CC[S+](C)[O-]" to "CC[S+](O)C", "[O-][O+]=O" (ozone) to "O[O+]=O") but some tweaking of the SMARTS patterns will fix this, and there are some oddities (e.g. "[O-][Cl]=O" converted to "O[Cl+]O").

In short, for my purposes, I find that the simple code above does what I need. If I need something more sophisticated down the line, I'll look more closely at what problems the Standardiser and MolVS codes are attempting to solve.

Using this algorithm
If you are interested in applying this algorithm to your own data, you can find the code in a more convenient form at http://github.com/baoilleach/nocharge. If applying to millions of molecules, you may wish to combine this with my multiprocessing receipe to speed things up N times (where N is the number of cores you have).

[Update 02/01/2020]....or...you can wait until it arrives at an Open Babel near you. I've just added a "--neutralize" operation to my development version. When called from Python (see neutralize_v5 at the Github link above) this takes ~0.5s. Note that the C++ code is the same as in the Python v1.

[Update 06/01/2020] Changed H0 to h0 in the SMARTS pattern (doh!) to avoid an error when handling, for example, deuterated N+.

Monday 25 November 2019

Comparing chemical spaces

In a recent blog post, Pat Walters describes using visualisation of chemical space to answer questions about the overlap and complementarity of different chemical datasets. His introductory paragraph gives two examples of where this is useful: considering the purchase of an additional set of screening compounds, and secondly comparing training and test sets.

Now usually I'd be the first to reach for a visualisation technique to answer a question, but here I'm going to argue that at least for the first example, where the within-set diversity is high, this might not be the best approach. I would be worried that the layout of molecules in the visualisation would largely be determined by low-similarity values, and that appearing close together in the depiction would not be meaningful.

The approach I used recently to tackle this problem involved clustering with ChemFP's Taylor-Butina (TB) implementation. I used the provided TB script to calculate the number of clusters for each of the two datasets on its own. Then I concatenated the fingerprint files and using the same settings, I generated clusters for the two datasets together. Finally I processed the output of the combined clustering to work out how many clusters only contained dataset A, only dataset B or contained a mixture.

The result is conceptually a Venn diagram of A and B, describing (a) the number of entries in dataset B that are within the similarity threshold to entries in dataset A (and whose purchase is potentially a waste of money) and (b) the number that are outside this threshold. For these latter, you also know how many clusters they form, which is important as you are looking for a diverse set that is distinct from dataset A.

And I'll finish with a shout-out to Andrew Dalke, whose ChemFP made this analysis possible.

Notes:
1. It's not strictly necessary to cluster each dataset on its own, but it's a useful sanity check. The number of clusters found for each dataset on its own was roughly the same as those containing that dataset in the combined clustering.
2. I can't recall the similarity threshold I used, but likely something in the 0.30-0.40 ECFP4 region.
3. I considered true singletons to be clusters of their own in this analysis, and ignored false singletons. After the fact I realised that duplicates by fingerprint (or better still perhaps by InChI) should probably be eliminated before doing the TB clustering, as their presence may convert a false singleton into a true cluster of size two. It's kind of arbitrary, but duplicating a dataset should not change the results of a clustering algorithm.

Thursday 24 October 2019

Open Babel 3.0 released

As announced by Geoff on the mailing list, Open Babel 3.0 is now available for download:

I'm pleased to announce the release of Open Babel 3.0.0 [finally]:

This release represents a major update and is strongly recommended for all users.

It also removes deprecated components and breaks the API in a few places. For information on migrating from the previous version, please see:
https://open-babel.readthedocs.io/en/latest/UseTheLibrary/migration.html#migrating-to-3-0

We intend to move to semi-annual releases in Spring and Fall, with bug fix releases as needed.

A sample of major new features:
  • Code for handling implicit hydrogens and kekulization has been entirely replaced. As well as being accurate, the new approach is much faster.
  • Speed of reading and writing SMILES has been improved by more than 50-fold.
  • Removal of the old 'babel' binary in favor of the newer 'obabel' command-line tool.
  • New improved fragment-based 3D coordinate generation code as part of Google Summer of code 2018/2019. Significantly faster and more accurate: https://doi.org/10.1186/s13321-019-0372-5
  • (Please cite J. Cheminf. (2019) v11, article 49 if you use the new 3D coordinate generation.)
  • New API for handling reactions stored as molecules (e.g. Reaction InChI, etc.)
  • New API for copying part of an OBMol as a substructure
  • Support for Maestro file format, contributed by Patrick Lorton of Schrodinger

There are an incredible number of improvements, minor features and many bug fixes.

For a full list of changes and to download source packages:
https://open-babel.readthedocs.io/en/latest/ReleaseNotes/ob300.html
https://github.com/openbabel/openbabel/releases

Thanks to a cast of many for this release, particularly including Noel O'Boyle:
aandi, adalke (Andrew Dalke), adamjstewart (Adam J. Stewart), afonari (Alexandr Fonari), artoria2e5 (Mingye Wang), baoilleach (Noel O’Boyle), barrymoo (Barry Moore), bbucior (Ben Bucior), boryszef (Borys Szefczyk), camannguyen (An Nguyen), cmanion (Charles A. Manion), cowsandmilk (David Hall), cstein (Casper Steinmann), derekharmon (Derek Harmon), djhogan (Daniel Hogan), dkoes (David Koes), e-kwsm (Eisuke Kawashima), eloyfelix (Eloy Felix), fredrikw (Fredrik Wallner), ghutchis (Geoff Hutchison), hille721 (Christoph Hille), hseara (Hector Martinez-Seara), jasonychuang (Jason Huang), jeffjanes (Jeff Janes), johnmay (John Mayfield), katrinleinweber (Katrin Leinweber), keipertk (Kristopher Keipert), kyle-roberts-arzeda, langner (Karol M. Langner), lorton (Pat Lorton), mcs07 (Matt Swain), merkys (Andrius Merkys), mkrykunov, mmghahremanpour (Mohammad Ghahremanpour), mwojcikowski (Maciej Wójcikowski), n-yoshikawa (Naruki Yoshikawa), nakatamaho (Nakata Maho), nsoranzo (Nicola Soranzo), oititov (Titov Oleg), orex (Kirill Okhotnikov), pbecherer (Paul Becherer), peawagon (Jen), philthiel (Philipp Thiel), psavery (Patrick Avery), rmeli (Rocco Meli), serval2412 (Julien Nabet), sunoru, susilehtola (Susi Lehtola), tgaudin (Théophile Gaudin), theavey (Thomas Heavey), timvdm (Tim Vandermeersch), torcolvin (Tor Colvin), wojdyr (Marcin Wojdyr), xomachine (Dmitriy Fomichev), yishutu (Yi-Shu Tu)
If you haven't already been using the development version, now's the time.

Tuesday 17 September 2019

Least Publishable Unit #3 - Which similarity metric is best?

Following on from #1 and #2, here's some work I began (maybe twice) but never completed.

Back in 2016 I published a paper entitled "Which fingerprint is best?" - actually, no, someone talked me down into the title "Comparing structural fingerprints using a literature-based similarity benchmark", a title which accurately describes what we did, but not why we did it. Anyway, it's work I'm very proud of - for example, it shows there is a detectable performance difference between 16384 and 4096 bit ECFP fingerprints, and that the fingerprints most appropriate for finding very close analogs are different than for those more distant.

The obvious follow-up to this work would be compare similarity metrics, for some particular fingerprint, e.g. the 16K ECFP4 fingerprint. There's a great paper by Todeschini et al that compares 51 similarity metrics and has a nice table where the equations are all given in one place. So I took these as the starting point for my own work.

After coding them all up and running them against the benchmark, I found that a number of the metrics gave identical rank orders across the entire dataset. That part wasn't surprising - the paper itself notes that several metrics are correlated, and the Riniker and Landrum paper has an appendix with a worked proof that shows that Tanimoto and Dice are correlated. What was surprising is that the ones I was finding as correlated were not necessarily the same as those in the paper (there was some overlap, but...).

Regarding the main reason for doing the paper, Tanimoto surprisingly (or not) did turn out to be one of the best (if not the best, I can't remember exactly). Perhaps more interesting were the results I got from looking at the effect of changing the weighting of the Tversky similarity; I can't remember the details but the best results were not where I expected them to be, and I never got to the bottom of why.

Monday 16 September 2019

Mutation testing partialsmiles

Many Python programmers will be familiar with the concept of achieving high test coverage (e.g. via an analysis tool such as coverage.py), but they should also know that high coverage does not mean that everything is bug free. So where can you go from there?

One option is to use mutation testing. This is based around the concept that at least one test should fail if the code is "mutated". Now, mutated is a bit vague but consider changing a plus to a minus, or a "break" to a "continue" in a Python script. If your tests still all pass, it means that they don't sufficiently cover all scenarios.

Let's take a look at the codebase I just released, partialsmiles 1.0. I'm going to use cosmic-ray to generate and test mutants. To begin with I've installed it into a virtualenv. The Scripts folder contains the executables I'm going to use:

1. Create the config file: cosmic-ray new-config config.toml
2. Edit it to only test smiparser.py: module-path = "partialsmiles/smiparser.py"
3. Edit it to run the test suite: test-command = "python suite.py"
4. Analyse the code and create the mutants: cosmic-ray init config.toml session.sqlite
5. Generate the mutated code and test the mutants: cosmic-ray exec session.sqlite
6. Generate a HTML report: cr-html session.sqlite > cosmic-ray.html

Note that on Windows I've found that it doesn't always stop all of the running Python processes so I needed to kill them directly through the Process Explorer.

Looking at the report, 1514 mutants were generated and 25% survived. Yikes. Here's an example:

The plus was changed into a minus and all of the tests still passed. The line itself is covered by the tests - I have a test for a two-digit isotope - but I am not testing the returned value.

So there you go. Another tool for the toolbox. And some work for me to do.

Sunday 15 September 2019

partialsmiles - A validating parser for partial SMILES

I've just released version 1.0 of partialsmiles, which is available via pip. Documentation can be found here and the project is hosted by GitHub.

partialsmiles is a Python library that provides a validating SMILES parser with support for partial SMILES. The parser checks syntax, ability to kekulize aromatic system, and checks whether atoms’ valences appear on a list of allowed valences.

The main use of the library is to accept or reject SMILES strings while they are being generated character-by-character by a machine-learning method. Previously the only way to check such strings was after the entire string was generated, a process that could take 10s of seconds. Using partialsmiles, SMILES strings with invalid syntax or problematic semantics can be detected while they are being generated, allowing them to be discarded or for alternate tokens to be suggested.

Some other toolkits (such as OEChem and the CDK) can read partial SMILES; should you use these instead? Certainly, if they meet your needs. Their goal is to return a molecule object that captures bonding information so far (useful for ‘depiction as you type’, for example), perhaps converting partial aromatic rings to single bonds and unmatched ring connections to dangling bonds. This is in contrast to the goal of partialsmiles which is to work out whether the partial SMILES string could be the prefix of a complete SMILES string that would be accepted by the parser. For example, "CC(C)(C)(" may be read as isobutane by other readers, but partialsmiles will reject it with a valence error as completion of the string will lead to a five or more valent carbon (see the docs).

A secondary goal of this project is to make available a permissively-licensed SMILES parser that correctly reads all aromatic systems in the SMILES reading benchmark. This is with the hope to encourage improvements in existing parsers.

And finally, I just want to thank Xuhan Liu for providing the suggestion that such a project would be useful.

Image credit: half by MandoBarista (licensed CC BY-SA)

Saturday 14 September 2019

Moving to pastures new, but still in the same field Part III

Following on from my previous post on the topic, 30th August was my last day at NextMove Software.

Monday week I'll be joining the Computational Chemistry team at Sosei Heptares which is lead by Chris de Graaf. If you're not familiar with the company, the R&D site is based just outside Cambridge (UK) and works on structure-based drug discovery with a focus on GPCRs. I'm very excited about this move as I'm really looking forward to having a chance to apply my skills more directly to drug discovery.

Here's a high-level overview of what the company does, which I found quite useful (jump to 1:36 for comp chem):

Friday 13 September 2019

R you similar? Inferring R group similarity from medicinal chemistry data

While molecular similarity is widely-studied (even by me!), the question of how to measure R group similarity is much less so, despite the fact that R group replacement is very commonly applied in medicinal chemistry projects.

My motivating example is fluoro and chloro - are these R groups similar? Well, a chemical fingerprint would say they have no bits in common and hence have zero similarity. But we know they're similar. Not just because they are nearby in the periodic table, but because chemists tend to make molecules that have Cl replaced with F.

In other words, we can use medicinal chemistry project data to infer a measure of R group similarity by looking at what R groups co-occur in medicinal chemistry projects as matched pairs. This measure can be used to suggest changes, form the basis of an enumeration strategy, or search for similar molecules. Because this similarity measure is derived from medicinal chemistry data, the results should be reasonable and interpretable.

At the recent ACS meeting in San Diego, I described my work on a measure of R group similarity:
Measuring R group similarity using medicinal chemistry data

I found it difficult to jam everything into my 20+5min talk, so let me add a bit of discussion here on how it compares to previous work...

There certainly has been some interesting work on R group similarity from Sheffield (e.g. Holliday et al), and more recently from this year's Skolnik Award winner Kimito Funatsu (Tamura et al), among others. But perhaps the most similar work is that on large-scale identification of bioisosteric replacements from the Bajorath group (e.g. Wassermann and Bajorath) and Swiss-Bioisotere (Wirth et al). Others have used 3D data from the PDB to further refine predictions, e.g. Seddon et al and Lešnik et al.

The thing is, most of the time in a med chem project (correct me if wrong) you're not looking for bioisosteric replacements - you want to improve potency, or you want just want to probe the binding site, e.g. see can you grow in this direction or that direction. The changes people make are often to R groups around the same size (maybe a bit bigger), but not always to ones with the same properties. For example, if an electron-withdrawing group doesn't work, time to test an electron-donating group. Of course, towards the end, you might have potency nailed and are trying to work around some physchem properties - at this point, bioisosteric replacements come into play.

Something else that I feel previous work has missed is the time dimension. The path of R group replacements can be envisaged as starting with a hydrogen or methyl, and through an iterative process ending up somewhere a whole lot more complicated. Although methyl co-occurs with the vast majority of R groups (as a matched molecular pair), after a certain point in a project that's not a useful suggestion; it's already been done, or already been ruled out. What you want to know is what R groups are typically made before versus after. You can easily work this out if you have time series information from med chem projects, but as I show in the talk, even in the absence of explicit time series data, it's possible to infer the approximate order in which R groups are made if you have a large number of sets of med chem project data.

I had a couple of questions about how to turn this method of finding similar R groups into a similarity metric. I turned this question back on the questioners: "Why do you want that? My guess is to use it to find similar molecules." But you can find similar molecules without turning the similarity into a number between 0 and 1; the talk describes one way to do this.

Oh, and here's the poster on the same topic I showed earlier this year at Sheffield and subsequently at the ACS.

Thursday 12 September 2019

Talk about making a hash of a molecule

At the recent ACS meeting in San Diego, I presented a talk entitled:
Making a hash of it: The advantage of selectively leaving out structural information

Firstly, this is an overview of a general technique for solving a wide range of cheminformatics problems; namely, the calculation and comparison of hashes of molecules to find duplicates with respect to some structural property of interest.

It also introduces MolHash to the world, a command-line application from NextMove Software which is freely available from GitHub (*).


* If you don't fancy compiling it yourself, you may find Matt Swain's conda version useful.

Monday 3 June 2019

Building a PubChem query

Evan Bolton was visiting recently and I got to watch over his shoulder as he built up a PubChem query on the website step-by-step. I helpfully added comments like "wait a second - how did you bring that up?" and "you can do that?", not to mention "ah - so that's how you do that".

The general idea is that any search you can do, you can combine with other searches with Boolean operators. It's a little bit confusing that the page that you use for combining them is not necessarily the same page you are using for doing the search, but once you're happy with that, it makes sense.

As an example, I am interested in those entries in the peptide dataset I uploaded to PubChem that do not have a biological annotation. This is a little bit complicated in that the biological annotations are a feature of PubChem Compound, but the datasets are all in PubChem Substance, so we are going to have to link from one to the other.

Here is a step-by-step guide which you may find useful to work through:

  1. First of all let's find all molecules in PubChem Compound that have a biological annotation. Go to https://pubchem.ncbi.nlm.nih.gov, click "Browse Data", select classification "PubChem/PubChem Compound TOC". In the tree that appears at the bottom of the page click on the number next to "Biologic Description" to do the search. (May have to reload the page if it gives an error first time.)
  2. Next, let's find all molecules in PubChem Compound that came from my dataset. Go to https://pubchem.ncbi.nlm.nih.gov/, click "Explore Data Sources" towards the bottom right, type "NextMove" into the "Search Sources" box, and then "8,699 Live Substances" to do the search.
  3. The previous search was of PubChem Substance, which is where datasets live. We want to find the corresponding molecules from PubChem Compound, so on the right under "Find related data" select "PubChem Compound", and then "PubChem Same Compound" in the Option box that appears. Click "Find Items" to do the search.
  4. Finally, we need to get the intersection between the search described in 3 and all those not in 1. For this go to https://www.ncbi.nlm.nih.gov/pccompound and click "Advanced". You will see under "History" the searches you have done. In my case, the relevant searches are #1 (step 1 above) and #4 (step 3 above), so I type into the builder box "#4", change "AND" to "NOT" on the next line and enter #1. Finally, I clicked "Search" to do the search. (261 results)

Sunday 17 February 2019

Transitive reduction of large graphs

There are a number of graph algorithms with not-so-obvious names, and "transitive reduction" is one of those. When I first needed to implement this algorithm, I couldn't figure out whether it had a name, and I can tell you that it's hard to google for something when you don't know what it's called.

Anyway, what transitive reduction means is to remove the maximum number of edges from a directed acylic graph while keeping the 'reachability'; that is, you can remove any edge A->B in the graph if there's some other way to get from A to B. In practice, this is super-useful in tidying up a graph representing pairwise relationships, e.g. a partially-ordered set. For example, suppose you know that A>B, B>C and A>C. The graph showing this would have three edges, but you can reduce it to just two, A->B->C, that summarises the entire ordering.

I used this algorithm in my fingerprint paper, but more recently I've been working on another problem that also requires it, but this time the graphs are much bigger. At first, I used the tred program that comes with graphviz, but the graphs became too large for tred to finish within a reasonable amount of time. Next I used NetworkX, first from CPython and then from PyPy. The latter could just about handle it but required 30GB of memory and ran overnight, and so I moved onto C++. First I tried with igraph. This was painfuly slow - slower than the Python. It all comes down to the data structures that the library is using; I'm deleting 100s of millions of edges and though I batched up the deletions, it was just too slow.

Finally, I ended up at Boost Graph Library. This library is unusual in that you get to choose the data structures that the graph uses. If you need fast edge removal, then you can choose a linked list in which to store the edges. For the graphs of interest, the code ran in a matter of minutes (after a bit of profiling and optimisation) using a single GB of RAM.
#include <iostream>

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/graphviz.hpp>

using namespace boost;

typedef adjacency_list<listS, vecS, directedS> Graph;
typedef typename graph_traits<Graph>::vertex_descriptor Vertex;
typedef typename graph_traits<Graph>::edge_descriptor Edge;

class DFS
{
public:
  DFS(Graph &graph): m_graph(graph)
  {
    seen = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    children = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    child_edges = (Edge*) malloc(num_vertices(m_graph) * sizeof(Edge));
  }
  ~DFS()
  {
    free(seen);
    free(children);
    free(child_edges);
  }
  void visit(Vertex root)
  {
    SEEN++;
    dfs_visit(root);
  }
  void setParent(Vertex parent)
  {
    CHILD++;

    typename graph_traits<Graph>::out_edge_iterator ei, ei_end;
    for (tie(ei, ei_end) = out_edges(parent, m_graph); ei != ei_end; ++ei) {
      Edge edge = *ei;
      Vertex src = source(edge, m_graph);
      if (src == parent) {
        Vertex tget = target(edge, m_graph);
        children[tget] = CHILD;
        child_edges[tget] = edge;
      }
      else {
        children[src] = CHILD;
        child_edges[src] = edge;
      }
    }

  }

private:
  void dfs_visit(Vertex v)
  {
    seen[v] = SEEN;
    if (children[v] == CHILD)
      remove_edge(child_edges[v], m_graph);

    std::pair<graph_traits<Graph>::adjacency_iterator, graph_traits<Graph>::adjacency_iterator> adj_vertices_it;
    for (adj_vertices_it = adjacent_vertices(v, m_graph); adj_vertices_it.first != adj_vertices_it.second; ++adj_vertices_it.first) {
      Vertex n2 = *(adj_vertices_it.first);
      if (seen[n2] != SEEN)
        dfs_visit(n2);
    }
  }

  Graph &m_graph;
  long *seen;
  long SEEN = 0; // flag used to mark visited vertices
  long *children;
  long CHILD = 0; // flag used to mark children of parent
  Edge *child_edges;
};

void help()
{
  std::cout << "Usage: tred input.graph output.dot\n"
            << "\n"
            << "Where the input.graph is a DAG described with the following format:\n"
            << "   NUM_VERTICES\n"
            << "   SRC_1 DST_1\n"
            << "   SRC_2 DST_2\n"
            << "   ...\n"
            << "\nVertices are assumed to be numbered from 0 to NUM_VERTICES-1\n";
  exit(1);
}

int main(int argc, char* argv[])
{
  if (argc != 3)
    help();

  unsigned VERTEX_COUNT;

  std::cout << "Reading...\n";
  std::ifstream inputfile(argv[1]);
  if (!inputfile) {
    std::cout << "ERROR: Cannot open " << argv[1] << " for reading\n";
    exit(1);
  }

  inputfile >> VERTEX_COUNT;
  Graph graph(VERTEX_COUNT);
  int a, b;
  while (inputfile >> a >> b)
    add_edge(a, b, graph);

  DFS dfs(graph);

  typename graph_traits<Graph>::adjacency_iterator ai, ai_end, bi, bi_end;
  typename graph_traits<Graph>::vertex_iterator vi, vi_end;

  std::cout << "Reducing...\n";
  for (tie(vi, vi_end) = vertices(graph); vi != vi_end; ++vi) {
    Vertex n1 = *vi;
    if (n1 % 100 == 0)
      std::cout << n1 << "\n";
    dfs.setParent(n1);
    for (tie(ai, ai_end) = adjacent_vertices(n1, graph); ai != ai_end; ++ai) {
      Vertex n2 = *ai;
      for (tie(bi, bi_end) = adjacent_vertices(n2, graph); bi != bi_end; ++bi) {
        Vertex n3 = *bi;
        dfs.visit(n3);
      }
    }
  }

  std::cout << "Writing...\n";
  std::ofstream outputfile(argv[2]);
  if (!outputfile) {
    std::cout << "ERROR: Cannot open " << argv[2] << " for writing\n";
    exit(1);
  }
  write_graphviz(outputfile, graph);

  return 0;
}