Workflow packages such as
Pipeline Pilot,
Taverna and
KNIME allow the user to graphically create a pipeline to process molecular data. A downside of these packages is that the units of the workflow, the nodes, process data sequentially. That is, no data gets to Node 2 until Node 1 has finished processing all of it.
Correction (thanks Egon): The previous line is plain incorrect. Both KNIME and Taverna2, at least, pass on partially processed data as soon as it's available.
Wouldn't it be nicer if they worked more like Unix pipes, that is, as soon as some data comes out of Node 1 it gets passed onto the next Node and so on. This would have three advantages: (1) you get the first result quicker, (2) you don't use up loads of memory storing all of the intermediate results, (3) you can run things in parallel, e.g. Node 2 could start processing the data from Node 1 immediately, perhaps even on a different computer.
Luckily, there is a neat feature in Python called a generator that allows you to create a pipeline that processes data in parallel. Generators are functions that return a sequence of values. However, unlike just returning a list of values, they only calculate and return the next item in the sequence when requested. One reason this is useful is because the sequence of items could be very large, or even infinite in length. (For a more serious introduction, see
David Beazley's talk at PyCon'08, which is the inspiration for this blog post.)
Let's create a pipeline for processing an SDF file that has three nodes: (1) a filter node that looks for the word "ZINC00" in the title of the molecule, (2) a filter node for Tanimoto similarity to a target molecule, (3) an output node that returns the molecule title. (The full program is presented at the end of this post.)
# Pipeline Python!
pipeline = createpipeline((titlematches, "ZINC00"),
(similarto, targetmol, 0.50),
(moltotitle,))
# Create an input source
dataset = pybel.readfile("sdf", inputfile)
# Feed the pipeline
results = pipeline(dataset)
The variable 'results' is a generator, so nothing actually happens until we request the values returned by the generator...
# Print out each answer as it comes
for title in results:
print title
The titles of the molecules found will appear on the screen one by one as they are found, just like in a Unix pipe. Note how easy it is to combine nodes into a pipeline.
Here's the full program:
import re
import os
import itertools
# from cinfony import pybel
import pybel
def createpipeline(*filters):
def pipeline(dataset):
piped_data = dataset
for filter in filters:
piped_data = filter[0](piped_data, *filter[1:])
return piped_data
return pipeline
def titlematches(mols, patt):
p = re.compile(patt)
return (mol for mol in mols if p.search(mol.title))
def similarto(mols, target, cutoff=0.7):
target_fp = target.calcfp()
return (mol for mol in mols if (mol.calcfp() | target_fp) >= cutoff)
def moltotitle(mols):
return (mol.title for mol in mols)
if __name__ == "__main__":
inputfile = os.path.join("..", "face-off", "timing", "3_p0.0.sdf")
dataset = pybel.readfile("sdf", inputfile)
findtargetmol = createpipeline((titlematches, "ZINC00002647"),)
targetmol = findtargetmol(dataset).next()
# Pipeline Python!
pipeline = createpipeline((titlematches, "ZINC00"),
(similarto, targetmol, 0.50),
(moltotitle,))
# Create an input source
dataset = pybel.readfile("sdf", inputfile)
# Feed the pipeline
results = pipeline(dataset)
# Print out each answer as it comes through the pipeline
for title in results:
print title
So, if in future someone tells you that Python generators can be used to make a workflow, don't say "I never node that".
Image:
Pipeline by
Travis S. (CC BY-NC 2.0)