Saturday 23 February 2008

Calculate circular fingerprints with Pybel

I've just realised that a friend of mine, Florian Nigsch, has started blogging about cheminformatics. In one of his posts, he describes the calculation of circular fingerprints with Python.

In his program, he parses the structure of a MOL2 file himself to work out the connectivity and so on. I wondered whether it would be easier with Pybel. I was hoping that the breadth-first iterator that comes with OpenBabel could be used to simplify things, but in the end I needed to do the breadth-first search myself (here I use a queue).
import os
import sys
from collections import deque # Better than list.pop(0)

import pybel
import openbabel as ob

# Set up a translator to Sybyl atom types
ttab = ob.OBTypeTable()
ttab.SetFromType("INT")
ttab.SetToType("SYB")

def bfs(mol, startatom):
visited = [False] * (mol.OBMol.NumAtoms() + 1)
ans = []
queue = deque([(startatom, 0)])
while queue:
atom, depth = queue.popleft()
idx = atom.GetIdx()
visited[idx] = True
atomtype = ttab.Translate(atom.GetType())
ans.append("%s%s" % (["", "%d-" % depth] [depth>0],
atomtype))
if depth < D:
for atom in ob.OBAtomAtomIter(atom):
if not visited[atom.GetIdx()]:
queue.append((atom, depth+1))
return ";".join(ans)

if __name__ == "__main__":
D = 2 # default max depth
if len(sys.argv)>1:
inputfile = sys.argv[1]
if len(sys.argv)>2:
D = int(sys.argv[2])
else:
sys.exit("Usage: BFS.py inputfile [depth=2]")

for mol in pybel.readfile(inputfile.split(".")[-1],
inputfile):
fp = []
for atom in ob.OBMolAtomIter(mol.OBMol):
fp.append(bfs(mol, atom))
print "\t".join([mol.title] + fp)
The advantage of using a library like Pybel is that this program will work for any of several file formats including SMILES. Also, it's trivial to use other atom types apart from Sybyl atom types if desired. Note that the output is not exactly identical to that of Florian's program. In particular, I don't fold together multiple environments that occur at the same depth.

Update (25/02/08): Improved the code after review by Andrew Dalke (see below for original and comments).

3 comments:

Andrew Dalke said...

As self-appointed code reviewer :) here's a few things to think about.

list.pop(0) is an O(N) operation because it copies the 1...N-1 PyObject pointers downwards by one. For small molecules this isn't a problem but if you're concerned about theoretical cases you can use the collections.deque which has O(1) insertion and deletion on the front and back.

You should pass the "visited" flags as a parameter to the bfs function, instead of using a module variable.

I would have done

if depth:
ans.append("%s-%s" % (depth, atomtype))
else:
ans.append(atomtype) # I think atomtype is a string

With 2.5 you would use the new if-else ternary expression

ans.append( ("%s-%s" % (depth, atomtype)) if depth else atomtype)

Change the bfs so it yields the values rather than appending to 'ans' then returning ans at the end. This eliminates the need to create an intermediate list.

You can do sys.exit("Usage: BFS.py inputfile [depth=2]") which gives almost the same result as the print/sys.exit, including using an exit code of 1, except the text is written to stderr. Stderr is better for this case.

Noel O'Boyle said...

Reviewer 1 raises some interestings points. To take each in turn:
(1) list.pop(0): Good to know
(2) visited: you're right, but in my defence I hastily adapted bfs from a recursive dfs function after I posted it by accident. visited needed to be set outside the recursive function.
(3) "if depth...": I assume you mean for clarity. I guess so - actually I don't usually use this type of expression. Regarding ternary expressions, I avoid Python2.5isms but that's good to know.
(4) "Change the bfs...": Yes, I realised this after posting...
(5) Usage of sys.exit: Good to know.

Thanks, Andrew. This a good way for me to learn a bit more Python actually. I'll make the above changes when I get a chance...

Noel O'Boyle said...

I've updated the code above. The original code is left here for posterity (unfortunately indentation and colour is lost):


import os
import sys

import pybel
import openbabel as ob

# Set up a translator to Sybyl atom types
ttab = ob.OBTypeTable()
ttab.SetFromType("INT")
ttab.SetToType("SYB")

def bfs(root):
ans = []
queue = [(root, 0)]
while queue:
atom, depth = queue.pop(0)
idx = atom.GetIdx()
visited[idx] = True
atomtype = ttab.Translate(atom.GetType())
ans.append("%s%s" % (["", "%d-" % depth] [depth>0],
atomtype))
if depth < D:
for atom in ob.OBAtomAtomIter(atom):
if not visited[atom.GetIdx()]:
queue.append((atom, depth+1))
return ans

if __name__ == "__main__":
D = 2 # max depth
if len(sys.argv)>1:
inputfile = sys.argv[1]
if len(sys.argv)>2:
D = int(sys.argv[2])
else:
print "Usage: BFS.py inputfile [depth=2]"
sys.exit(1)

for mol in pybel.readfile(inputfile.split(".")[-1],
inputfile):
visited = [False] * (mol.OBMol.NumAtoms() + 1)
fp = []
for atom in ob.OBMolAtomIter(mol.OBMol):
ans = bfs(atom)
fp.append(";".join(ans))
print mol.title + "\t" + "\t".join(fp)