Monday, 3 June 2019

Building a PubChem query

Evan Bolton was visiting recently and I got to watch over his shoulder as he built up a PubChem query on the website step-by-step. I helpfully added comments like "wait a second - how did you bring that up?" and "you can do that?", not to mention "ah - so that's how you do that".

The general idea is that any search you can do, you can combine with other searches with Boolean operators. It's a little bit confusing that the page that you use for combining them is not necessarily the same page you are using for doing the search, but once you're happy with that, it makes sense.

As an example, I am interested in those entries in the peptide dataset I uploaded to PubChem that do not have a biological annotation. This is a little bit complicated in that the biological annotations are a feature of PubChem Compound, but the datasets are all in PubChem Substance, so we are going to have to link from one to the other.

Here is a step-by-step guide which you may find useful to work through:

  1. First of all let's find all molecules in PubChem Compound that have a biological annotation. Go to https://pubchem.ncbi.nlm.nih.gov, click "Browse Data", select classification "PubChem/PubChem Compound TOC". In the tree that appears at the bottom of the page click on the number next to "Biologic Description" to do the search. (May have to reload the page if it gives an error first time.)
  2. Next, let's find all molecules in PubChem Compound that came from my dataset. Go to https://pubchem.ncbi.nlm.nih.gov/, click "Explore Data Sources" towards the bottom right, type "NextMove" into the "Search Sources" box, and then "8,699 Live Substances" to do the search.
  3. The previous search was of PubChem Substance, which is where datasets live. We want to find the corresponding molecules from PubChem Compound, so on the right under "Find related data" select "PubChem Compound", and then "PubChem Same Compound" in the Option box that appears. Click "Find Items" to do the search.
  4. Finally, we need to get the intersection between the search described in 3 and all those not in 1. For this go to https://www.ncbi.nlm.nih.gov/pccompound and click "Advanced". You will see under "History" the searches you have done. In my case, the relevant searches are #1 (step 1 above) and #4 (step 3 above), so I type into the builder box "#4", change "AND" to "NOT" on the next line and enter #1. Finally, I clicked "Search" to do the search. (261 results)

Sunday, 17 February 2019

Transitive reduction of large graphs

There are a number of graph algorithms with not-so-obvious names, and "transitive reduction" is one of those. When I first needed to implement this algorithm, I couldn't figure out whether it had a name, and I can tell you that it's hard to google for something when you don't know what it's called.

Anyway, what transitive reduction means is to remove the maximum number of edges from a directed acylic graph while keeping the 'reachability'; that is, you can remove any edge A->B in the graph if there's some other way to get from A to B. In practice, this is super-useful in tidying up a graph representing pairwise relationships, e.g. a partially-ordered set. For example, suppose you know that A>B, B>C and A>C. The graph showing this would have three edges, but you can reduce it to just two, A->B->C, that summarises the entire ordering.

I used this algorithm in my fingerprint paper, but more recently I've been working on another problem that also requires it, but this time the graphs are much bigger. At first, I used the tred program that comes with graphviz, but the graphs became too large for tred to finish within a reasonable amount of time. Next I used NetworkX, first from CPython and then from PyPy. The latter could just about handle it but required 30GB of memory and ran overnight, and so I moved onto C++. First I tried with igraph. This was painfuly slow - slower than the Python. It all comes down to the data structures that the library is using; I'm deleting 100s of millions of edges and though I batched up the deletions, it was just too slow.

Finally, I ended up at Boost Graph Library. This library is unusual in that you get to choose the data structures that the graph uses. If you need fast edge removal, then you can choose a linked list in which to store the edges. For the graphs of interest, the code ran in a matter of minutes (after a bit of profiling and optimisation) using a single GB of RAM.
#include <iostream>

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/graphviz.hpp>

using namespace boost;

typedef adjacency_list<listS, vecS, directedS> Graph;
typedef typename graph_traits<Graph>::vertex_descriptor Vertex;
typedef typename graph_traits<Graph>::edge_descriptor Edge;

class DFS
{
public:
  DFS(Graph &graph): m_graph(graph)
  {
    seen = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    children = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    child_edges = (Edge*) malloc(num_vertices(m_graph) * sizeof(Edge));
  }
  ~DFS()
  {
    free(seen);
    free(children);
    free(child_edges);
  }
  void visit(Vertex root)
  {
    SEEN++;
    dfs_visit(root);
  }
  void setParent(Vertex parent)
  {
    CHILD++;

    typename graph_traits<Graph>::out_edge_iterator ei, ei_end;
    for (tie(ei, ei_end) = out_edges(parent, m_graph); ei != ei_end; ++ei) {
      Edge edge = *ei;
      Vertex src = source(edge, m_graph);
      if (src == parent) {
        Vertex tget = target(edge, m_graph);
        children[tget] = CHILD;
        child_edges[tget] = edge;
      }
      else {
        children[src] = CHILD;
        child_edges[src] = edge;
      }
    }

  }

private:
  void dfs_visit(Vertex v)
  {
    seen[v] = SEEN;
    if (children[v] == CHILD)
      remove_edge(child_edges[v], m_graph);

    std::pair<graph_traits<Graph>::adjacency_iterator, graph_traits<Graph>::adjacency_iterator> adj_vertices_it;
    for (adj_vertices_it = adjacent_vertices(v, m_graph); adj_vertices_it.first != adj_vertices_it.second; ++adj_vertices_it.first) {
      Vertex n2 = *(adj_vertices_it.first);
      if (seen[n2] != SEEN)
        dfs_visit(n2);
    }
  }

  Graph &m_graph;
  long *seen;
  long SEEN = 0; // flag used to mark visited vertices
  long *children;
  long CHILD = 0; // flag used to mark children of parent
  Edge *child_edges;
};

void help()
{
  std::cout << "Usage: tred input.graph output.dot\n"
            << "\n"
            << "Where the input.graph is a DAG described with the following format:\n"
            << "   NUM_VERTICES\n"
            << "   SRC_1 DST_1\n"
            << "   SRC_2 DST_2\n"
            << "   ...\n"
            << "\nVertices are assumed to be numbered from 0 to NUM_VERTICES-1\n";
  exit(1);
}

int main(int argc, char* argv[])
{
  if (argc != 3)
    help();

  unsigned VERTEX_COUNT;

  std::cout << "Reading...\n";
  std::ifstream inputfile(argv[1]);
  if (!inputfile) {
    std::cout << "ERROR: Cannot open " << argv[1] << " for reading\n";
    exit(1);
  }

  inputfile >> VERTEX_COUNT;
  Graph graph(VERTEX_COUNT);
  int a, b;
  while (inputfile >> a >> b)
    add_edge(a, b, graph);

  DFS dfs(graph);

  typename graph_traits<Graph>::adjacency_iterator ai, ai_end, bi, bi_end;
  typename graph_traits<Graph>::vertex_iterator vi, vi_end;

  std::cout << "Reducing...\n";
  for (tie(vi, vi_end) = vertices(graph); vi != vi_end; ++vi) {
    Vertex n1 = *vi;
    if (n1 % 100 == 0)
      std::cout << n1 << "\n";
    dfs.setParent(n1);
    for (tie(ai, ai_end) = adjacent_vertices(n1, graph); ai != ai_end; ++ai) {
      Vertex n2 = *ai;
      for (tie(bi, bi_end) = adjacent_vertices(n2, graph); bi != bi_end; ++bi) {
        Vertex n3 = *bi;
        dfs.visit(n3);
      }
    }
  }

  std::cout << "Writing...\n";
  std::ofstream outputfile(argv[2]);
  if (!outputfile) {
    std::cout << "ERROR: Cannot open " << argv[2] << " for writing\n";
    exit(1);
  }
  write_graphviz(outputfile, graph);

  return 0;
}