Keeping it Small and Simple

2007.05.18

Batch merging PDF documents

Filed under: Document Processing, PDF, Perl Programming, Utilities — Lorenzo E. Danielsson @ 18:59

In an earlier
post
I mentioned that I had discovered pdftk, a handy utility for manipulating PDF documents.

I had a set of PDF files that had been scanned. Each file contained several actual documents and the PDF files needed to be separated so that each document was a separate PDF file. Splitting the documents was simple, using pdftk’s burst command. Now I was left with one PDF file per page. Some of the original documents were just a page long, but some spanned over two or more pages.

I needed to merge several PDF documents into a single one, and of course pdftk can handle that as well. But I soon discovered that the whole procedure was time-consuming and error-prone. So I need a way to speed things up.

I started out by finding the page numbers of each document withing the original file. For this I was able to enlist the help of a few friends since it doesn’t require knowledge of pdftk. The only thing they needed to know was how to use a PDF reader such as xpdf or evince. We wrote down the title, start and end pages of each document we found. This did take some time as several of the PDF files were several hundreds of pages long.

Once we had the list on paper I created a simple text file in each directory holding bursted (split) PDF documents. Each line in the text file had the format:


Title :: start page [- end page]

The next thing I did was quickly hack together a script that could work on this text file. What I came up with was the following:

 1 #! /usr/bin/perl
 2 
 3 use strict;
 4 use warnings;
 5 use File::Copy;
 6 use File::Path;
 7 use File::Temp;
 8 
 9 # Parses a list file and returns an array of documents. Each document entry
10 # contains the document name, the first and the last page of the document.
11 sub parse_list {
12     my $file = shift;
13     my @documents = ();
14     my $doc;
15 
16     open INFILE, "<$file";
17     while (<INFILE>) {
18         chomp;
19         /^s*(.*?)s*::s*(d+)(s*-s*(d+))?s*$/;
20         $doc = { 'title' => $1, 'start' => $2 };
21         $doc->{'end'} = $4 if defined($4);
22         push @documents, $doc;
23     }
24     close INFILE;
25 
26     return @documents;
27 }
28 
29 # Converts the page number passed in to a file name of the type generated by
30 # pdftk burst, ie. pg_0032.pdf
31 sub num_to_file {
32     my $num = shift;
33     return "pg_" . "0" x (4 - length("$num")) . $num . ".pdf";
34 }
35 
36 # Merges the pdf files from $start to $end and saves the merged document as
37 # $title.
38 sub merge {
39     my ($title, $start, $end) = @_;
40     my $dir = tempdir();
41     for my $page ($start .. $end) {
42         copy(num_to_file($page), $dir);
43     }
44 
45     print "pdftk $dir/*.pdf cat output '$title.pdf'n";
46     `pdftk $dir/*.pdf cat output '$title.pdf'`;
47     rmtree($dir);
48 }
49 
50 my @list = parse_list($ARGV[0]);
51 for (my $i = 0; $i < scalar(@list); ++$i) {
52     if (!defined($list[$i]->{'end'})) {
53         copy(num_to_file($list[$i]->{'start'}), $list[$i]->{'title'} . ".pdf");
54     } else {
55         merge($list[$i]->{'title'}, $list[$i]->{'start'}, $list[$i]->{'end'});
56     }
57 }

Now this is obviously very ugly, but it actually works, which was mattered to me. That is, it works as long as you don’t try to do anything out of the ordinary with it. At some point when I have some time over I hope I can clean it up, and look at possible use cases. Maybe it could be slightly extended but for now it does exactly what I needed it to do, no more, no less.

I’m putting it up here simply so that if anybody else is in a similar situation, you may find the above useful. Feel free to improve it if you wish. I’m hoping I can sort out some web space soon so that I can start publishing my code again, but until then I can at least list smaller scripts here on my blog.

It may be that there is a tool already for batch merging PDF documents that I’m not aware of. I was at home, had no connection to the Internet and needed a solution to the problem right there and then. If you know of any tool that does the job, feel free to add a comment. You could also post improvements to my code there.

Advertisements

2007.05.10

A tool for manipulating PDF documents: pdftk

Filed under: Utilities — Lorenzo E. Danielsson @ 19:04

I recently needed to split and merge some PDF documents. Each PDF consisted of several actual documents that had been scanned and saved together as a larger PDF file. The client I was doing this for wanted each document to be a separate PDF. In most cases the documents were a single page, but some documents were slightly larger.

Since I hadn’t done anything like this before, I wasn’t sure how to solve the problem. But I do have to reliable friends: Google and aptitude search/show. When I started my search I was at home, so without an Internet connection, so my only hope was aptitude. Doing aptitude search pdf gave me a lot of packages, one of them being pdftk. I did a aptitude show pdftk and read through the description. Pdftk seemed very interesting, and most importantly, it said it could split and merge PDF files.

I quickly headed off to Accra where I could get some connectivity and downloaded aptitude installed pdftk. Soon thereafter I had split up the PDF files into separate documents, just as my client had wanted.

Pdftk is capable of much more than just splitting and merging documents, as its man pages demonstrates. My own needs were very simple. First of all I needed to split a document. Assuming that you would want every page to be a new document, you would use burst as follows:


% pdftk archive.pdf burst

This would take the PDF file archive.pdf and split it into separate pages. Note, however, that the original file will still exist. If, after issuing the command above, you do issue ls, you will see a number of pg_*.pdf file in your directory. These are the split PDF documents. Each file begins with pg_, followed by the page number in the original document. Each one of these is now exactly one page long.

If archive.pdf was 247 pages, then you should now have 247 new files in your directory, ranging from pg_0001.pdf up to pg_0247.pdf. Actually, there is one additional file created, doc_data.txt. As the extension reveals, the file is a regular text file. You can cat it to see the information it contains, if these sort of things amuse you (and they should).

Now, let’s suppose some of the pages were upside down (as they in fact were, in my case). You could then issue this command to rotate the document 180 degrees (that is, flip it):


% pdftk upsidedown.pdf cat 1S output fixed.pdf

This works well if the document is a single page and will create a file fixed.pdf which is a copy of upsidedown.pdf but the page has been rotated 180 degrees. If you had a multi-page document and wanted to rotate the entire document, the following would do that:


% pdftk upsidedown.pdf cat 1-endS output fixed.pdf

In the above, I also hinted at how to merge documents. Let’s suppose that you have already split a larger document and want to merge the pages 40-45 into a new document called pages.pdf. This is done with:


pdftk pg_004[0-5].pdf cat output pages.pdf

These are just a few examples of things you can achieve with pdftk. As I mentioned earlier, you can do much more than this with this handy tool. The pdftk manual page comes with a few good examples that you can use to learn how to use the tool.

If you are somebody who works a lot with PDF documents, and needs a tool to manipulate them, the pdftk is definitely something you should look into. It’s not the only tool for manipulating PDF files, not even the only tool in the free software realm, but it is very capable. (No, I’m not implying that other tools are not capable.) At least I can say that I recently discovered it, and it helped me solve a particular problem. I hope it can do so for others as well.

Create a free website or blog at WordPress.com.