Saturday 20 October 2007

At last - merge, split and create PDFs with open source tools

I recently had to manipulate some PDFs, and I was pleasantly surprised that things had improved so much since the last time I had to do this, a couple of years ago.

Whatever you feel about PDFs (for example, you may believe they are a very effective way of destroying all scientific data in the literature), your opinion will suddenly take a nosedive the first time you have to, for example, extract a single page from a PDF, or merge two PDFs together. At this point, you will suddenly realise that Adobe own you, and that you will need to buy Adobe's software if you want to perform this trivial task.

But help is at hand in the form of Open Source software. I found that I was able to manipulate PDFs by using Open Source tools, and on Windows. The first thing to do is to install PDFCreator (GPL). Once it's installed, when you print from any application (for example, Word) you just choose the PDFCreator 'printer', and click OK. After a couple of seconds, a dialog box will pop up where you can just click "Save" (or you might want to adjust the page size to A4 or Letter), and it will make a PDF with the same name as your original document.

A couple of years ago, I used PDFCreator and it worked 96% of the time. That is, Word documents sometimes had strange symbols inserted instead of Greek letters or bullet points; also, extra spacing was sometimes inserted in lines with some text in superscript. This time, it worked perfectly. Well done, PDFCreator creators.

Scanned documents are now often provided as PDFs. I needed to merge some pages of one scanned document with my new PDF. For this, I needed Pdftk (GPL). The blurb says it all:
If PDF is electronic paper, then pdftk is an electronic staple-remover, hole-punch, binder, secret-decoder-ring, and X-Ray-glasses. Pdftk is a command-line tool for doing everyday things with PDF documents.
Here's an example commandline which is similar to the one I actually used (again I did this on Windows). It creates a combined PDF which consists of pages 1-7 from one.pdf, 1-5 from two.pdf, and ends with page 8 from one.pdf.
pdftk A=one.pdf B=two.pdf cat A1-7 B1-5 A8 output combined.pdf
If you ever need to deal with PDFs, hopefully these tools can help reduce the pain.

No comments: