Saturday 3 February 2018

Using those other processors

This year, I've decided it's time to make more use of those other processors that my PC has. At some point, it gets embarrassing to see my single CPU job trundling along at its max speed, while 80 or 90% of the processing power is just sitting idle.

My typical input is a SMILES file such as ChEMBL or PubChem, or a file I've created by processing these, or a set of reaction SMILES extracted from patents. In any case, the problem is the same - a large number of inputs that can be processed in parallel. Sure, you can split the file (see the 'split' command, though be careful to split on lines) but I tend to avoid that as it's a bit clunky. If on Linux, there's the magic "GNU parallel". But what I really should be doing is making use of multiple CPUs directly in my Python scripts.

And actually, that's not too hard. The multiprocessing module makes it pretty easy once you've figured it out. Here's my goto template for these calculations. I try to keep it as slimmed down as possible, because there's some magic going on behind the scenes and I don't want to have to debug any complex problems. (Note that there's also a multithreading module, but CPython cannot parallelize CPU-bound calculations using threads due to its Global Interpreter Lock.)

import multiprocessing as mp
import pybel

def calculate(data):
    return pybel.readstring("smi", data).molwt

if __name__ == "__main__":
    POOLSIZE = 4 # the number of CPUs
    CHUNKSIZE = 1000
    pool = mp.Pool(POOLSIZE)
    with open("output.txt", "w") as out:
        with open(r"C:\Tools\LargeData\chembl_23.smi", "r") as inp:
            for result in pool.imap(calculate, inp, CHUNKSIZE):
            # for result in pool.imap_unordered(calculate, inp, CHUNKSIZE):
            # for result in map(calculate, inp): # no multiprocessing
                out.write("%f\n" % result)

Notes:
1. This blog is now Python 3.
2. By editing the commented-out code, you can choose one of three variations. During development, you can avoid multiprocessing entirely by just using a regular map (equivalent to itertools.imap in Python 2). If using multiprocessing, you can choose to have the output in the same order as the input, or just have it as it comes; the latter is presumably faster.
3. If you need to kill a multiprocessing job, CTRL+C just won't do it, as it spawns an additional process to replace it (if anyone knows of a way to make this work please let me know). You need to use the Process Manager, choose the master Python process and kill that one.

No comments: