Wouldn't it be nicer if they worked more like Unix pipes, that is, as soon as some data comes out of Node 1 it gets passed onto the next Node and so on. This would have three advantages: (1) you get the first result quicker, (2) you don't use up loads of memory storing all of the intermediate results, (3) you can run things in parallel, e.g. Node 2 could start processing the data from Node 1 immediately, perhaps even on a different computer.
Luckily, there is a neat feature in Python called a generator that allows you to create a pipeline that processes data in parallel. Generators are functions that return a sequence of values. However, unlike just returning a list of values, they only calculate and return the next item in the sequence when requested. One reason this is useful is because the sequence of items could be very large, or even infinite in length. (For a more serious introduction, see David Beazley's talk at PyCon'08, which is the inspiration for this blog post.)
Let's create a pipeline for processing an SDF file that has three nodes: (1) a filter node that looks for the word "ZINC00" in the title of the molecule, (2) a filter node for Tanimoto similarity to a target molecule, (3) an output node that returns the molecule title. (The full program is presented at the end of this post.)
# Pipeline Python! pipeline = createpipeline((titlematches, "ZINC00"), (similarto, targetmol, 0.50), (moltotitle,)) # Create an input source dataset = pybel.readfile("sdf", inputfile) # Feed the pipeline results = pipeline(dataset)The variable 'results' is a generator, so nothing actually happens until we request the values returned by the generator...
# Print out each answer as it comes for title in results: print titleThe titles of the molecules found will appear on the screen one by one as they are found, just like in a Unix pipe. Note how easy it is to combine nodes into a pipeline.
Here's the full program:
import re import os import itertools # from cinfony import pybel import pybel def createpipeline(*filters): def pipeline(dataset): piped_data = dataset for filter in filters: piped_data = filter(piped_data, *filter[1:]) return piped_data return pipeline def titlematches(mols, patt): p = re.compile(patt) return (mol for mol in mols if p.search(mol.title)) def similarto(mols, target, cutoff=0.7): target_fp = target.calcfp() return (mol for mol in mols if (mol.calcfp() | target_fp) >= cutoff) def moltotitle(mols): return (mol.title for mol in mols) if __name__ == "__main__": inputfile = os.path.join("..", "face-off", "timing", "3_p0.0.sdf") dataset = pybel.readfile("sdf", inputfile) findtargetmol = createpipeline((titlematches, "ZINC00002647"),) targetmol = findtargetmol(dataset).next() # Pipeline Python! pipeline = createpipeline((titlematches, "ZINC00"), (similarto, targetmol, 0.50), (moltotitle,)) # Create an input source dataset = pybel.readfile("sdf", inputfile) # Feed the pipeline results = pipeline(dataset) # Print out each answer as it comes through the pipeline for title in results: print title
So, if in future someone tells you that Python generators can be used to make a workflow, don't say "I never node that".
Image: Pipeline by Travis S. (CC BY-NC 2.0)