My typical input is a SMILES file such as ChEMBL or PubChem, or a file I've created by processing these, or a set of reaction SMILES extracted from patents. In any case, the problem is the same - a large number of inputs that can be processed in parallel. Sure, you can split the file (see the 'split' command, though be careful to split on lines) but I tend to avoid that as it's a bit clunky. If on Linux, there's the magic "GNU parallel". But what I really should be doing is making use of multiple CPUs directly in my Python scripts.
And actually, that's not too hard. The multiprocessing module makes it pretty easy once you've figured it out. Here's my goto template for these calculations. I try to keep it as slimmed down as possible, because there's some magic going on behind the scenes and I don't want to have to debug any complex problems. (Note that there's also a multithreading module, but CPython cannot parallelize CPU-bound calculations using threads due to its Global Interpreter Lock.)
import multiprocessing as mp import pybel def calculate(data): return pybel.readstring("smi", data).molwt if __name__ == "__main__": POOLSIZE = 4 # the number of CPUs CHUNKSIZE = 1000 pool = mp.Pool(POOLSIZE) with open("output.txt", "w") as out: with open(r"C:\Tools\LargeData\chembl_23.smi", "r") as inp: for result in pool.imap(calculate, inp, CHUNKSIZE): # for result in pool.imap_unordered(calculate, inp, CHUNKSIZE): # for result in map(calculate, inp): # no multiprocessing out.write("%f\n" % result)
1. This blog is now Python 3.
2. By editing the commented-out code, you can choose one of three variations. During development, you can avoid multiprocessing entirely by just using a regular map (equivalent to itertools.imap in Python 2). If using multiprocessing, you can choose to have the output in the same order as the input, or just have it as it comes; the latter is presumably faster.
3. If you need to kill a multiprocessing job, CTRL+C just won't do it, as it spawns an additional process to replace it (if anyone knows of a way to make this work please let me know). You need to use the Process Manager, choose the master Python process and kill that one.