Thursday, 28 November 2013

Got CPUs to burn? Put 'em to work with GNU parallel

I've just been using GNU parallel for the first time. It makes running jobs over multiple CPUs trivial.

In the past, if I had a large number of single-CPU computationally intensive jobs and multiple CPUs to run them over, I would create separate bash scripts for each CPU with a line for each calculation, e.g. ./runthis input1.smi > output1.txt. This is not super-ideal as different jobs take different lengths of time and so any CPU that finishes its bash script ahead of schedule just sits there idle. It also involves making N separate bash scripts.

Enter GNU parallel. This comes with several Linux distributions but on Centos I just quickly installed from source. Once done, you just need to put all of the jobs in a single script and pipe it through parallel:
cat | parallel -j7 # e.g. for 7 CPUs
There are a whole load of complicated ways of using parallel. I'm very happy with this one simple way.


Vladimir Chupakhin said...

Parallel is nice. I am also using the trick described here Thus you can provide the number of process, and in final do the trick: doParallel_babel *.sdf.

Unknown said...

There is also a PPSS - (Distributed) Parallel Processing Shell Script

Noel O'Boyle said...

Thanks for the pointers.

Note to self: here's how to run gzip in parallel:
ls *.txt | parallel -j8 gzip