Sunday 17 February 2019

Transitive reduction of large graphs

There are a number of graph algorithms with not-so-obvious names, and "transitive reduction" is one of those. When I first needed to implement this algorithm, I couldn't figure out whether it had a name, and I can tell you that it's hard to google for something when you don't know what it's called.

Anyway, what transitive reduction means is to remove the maximum number of edges from a directed acylic graph while keeping the 'reachability'; that is, you can remove any edge A->B in the graph if there's some other way to get from A to B. In practice, this is super-useful in tidying up a graph representing pairwise relationships, e.g. a partially-ordered set. For example, suppose you know that A>B, B>C and A>C. The graph showing this would have three edges, but you can reduce it to just two, A->B->C, that summarises the entire ordering.

I used this algorithm in my fingerprint paper, but more recently I've been working on another problem that also requires it, but this time the graphs are much bigger. At first, I used the tred program that comes with graphviz, but the graphs became too large for tred to finish within a reasonable amount of time. Next I used NetworkX, first from CPython and then from PyPy. The latter could just about handle it but required 30GB of memory and ran overnight, and so I moved onto C++. First I tried with igraph. This was painfuly slow - slower than the Python. It all comes down to the data structures that the library is using; I'm deleting 100s of millions of edges and though I batched up the deletions, it was just too slow.

Finally, I ended up at Boost Graph Library. This library is unusual in that you get to choose the data structures that the graph uses. If you need fast edge removal, then you can choose a linked list in which to store the edges. For the graphs of interest, the code ran in a matter of minutes (after a bit of profiling and optimisation) using a single GB of RAM.
#include <iostream>

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/graphviz.hpp>

using namespace boost;

typedef adjacency_list<listS, vecS, directedS> Graph;
typedef typename graph_traits<Graph>::vertex_descriptor Vertex;
typedef typename graph_traits<Graph>::edge_descriptor Edge;

class DFS
{
public:
  DFS(Graph &graph): m_graph(graph)
  {
    seen = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    children = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0
    child_edges = (Edge*) malloc(num_vertices(m_graph) * sizeof(Edge));
  }
  ~DFS()
  {
    free(seen);
    free(children);
    free(child_edges);
  }
  void visit(Vertex root)
  {
    SEEN++;
    dfs_visit(root);
  }
  void setParent(Vertex parent)
  {
    CHILD++;

    typename graph_traits<Graph>::out_edge_iterator ei, ei_end;
    for (tie(ei, ei_end) = out_edges(parent, m_graph); ei != ei_end; ++ei) {
      Edge edge = *ei;
      Vertex src = source(edge, m_graph);
      if (src == parent) {
        Vertex tget = target(edge, m_graph);
        children[tget] = CHILD;
        child_edges[tget] = edge;
      }
      else {
        children[src] = CHILD;
        child_edges[src] = edge;
      }
    }

  }

private:
  void dfs_visit(Vertex v)
  {
    seen[v] = SEEN;
    if (children[v] == CHILD)
      remove_edge(child_edges[v], m_graph);

    std::pair<graph_traits<Graph>::adjacency_iterator, graph_traits<Graph>::adjacency_iterator> adj_vertices_it;
    for (adj_vertices_it = adjacent_vertices(v, m_graph); adj_vertices_it.first != adj_vertices_it.second; ++adj_vertices_it.first) {
      Vertex n2 = *(adj_vertices_it.first);
      if (seen[n2] != SEEN)
        dfs_visit(n2);
    }
  }

  Graph &m_graph;
  long *seen;
  long SEEN = 0; // flag used to mark visited vertices
  long *children;
  long CHILD = 0; // flag used to mark children of parent
  Edge *child_edges;
};

void help()
{
  std::cout << "Usage: tred input.graph output.dot\n"
            << "\n"
            << "Where the input.graph is a DAG described with the following format:\n"
            << "   NUM_VERTICES\n"
            << "   SRC_1 DST_1\n"
            << "   SRC_2 DST_2\n"
            << "   ...\n"
            << "\nVertices are assumed to be numbered from 0 to NUM_VERTICES-1\n";
  exit(1);
}

int main(int argc, char* argv[])
{
  if (argc != 3)
    help();

  unsigned VERTEX_COUNT;

  std::cout << "Reading...\n";
  std::ifstream inputfile(argv[1]);
  if (!inputfile) {
    std::cout << "ERROR: Cannot open " << argv[1] << " for reading\n";
    exit(1);
  }

  inputfile >> VERTEX_COUNT;
  Graph graph(VERTEX_COUNT);
  int a, b;
  while (inputfile >> a >> b)
    add_edge(a, b, graph);

  DFS dfs(graph);

  typename graph_traits<Graph>::adjacency_iterator ai, ai_end, bi, bi_end;
  typename graph_traits<Graph>::vertex_iterator vi, vi_end;

  std::cout << "Reducing...\n";
  for (tie(vi, vi_end) = vertices(graph); vi != vi_end; ++vi) {
    Vertex n1 = *vi;
    if (n1 % 100 == 0)
      std::cout << n1 << "\n";
    dfs.setParent(n1);
    for (tie(ai, ai_end) = adjacent_vertices(n1, graph); ai != ai_end; ++ai) {
      Vertex n2 = *ai;
      for (tie(bi, bi_end) = adjacent_vertices(n2, graph); bi != bi_end; ++bi) {
        Vertex n3 = *bi;
        dfs.visit(n3);
      }
    }
  }

  std::cout << "Writing...\n";
  std::ofstream outputfile(argv[2]);
  if (!outputfile) {
    std::cout << "ERROR: Cannot open " << argv[2] << " for writing\n";
    exit(1);
  }
  write_graphviz(outputfile, graph);

  return 0;
}