In the past, if I had a large number of single-CPU computationally intensive jobs and multiple CPUs to run them over, I would create separate bash scripts for each CPU with a line for each calculation, e.g. ./runthis input1.smi > output1.txt. This is not super-ideal as different jobs take different lengths of time and so any CPU that finishes its bash script ahead of schedule just sits there idle. It also involves making N separate bash scripts.
Enter GNU parallel. This comes with several Linux distributions but on Centos I just quickly installed from source. Once done, you just need to put all of the jobs in a single script and pipe it through parallel:
cat myjobs.sh | parallel -j7 # e.g. for 7 CPUsThere are a whole load of complicated ways of using parallel. I'm very happy with this one simple way.
Parallel is nice. I am also using the trick described here http://www.linux-magazin.de/Ausgaben/2009/02/Parallelarbeit/. Thus you can provide the number of process, and in final do the trick: doParallel_babel *.sdf.
ReplyDeleteThere is also a PPSS - (Distributed) Parallel Processing Shell Script https://code.google.com/p/ppss/
ReplyDeleteFilip
Thanks for the pointers.
ReplyDeleteNote to self: here's how to run gzip in parallel:
ls *.txt | parallel -j8 gzip