Keeping it Small and Simple

2007.05.18

Batch merging PDF documents

Filed under: Document Processing, PDF, Perl Programming, Utilities — Lorenzo E. Danielsson @ 18:59

In an earlier
post
I mentioned that I had discovered pdftk, a handy utility for manipulating PDF documents.

I had a set of PDF files that had been scanned. Each file contained several actual documents and the PDF files needed to be separated so that each document was a separate PDF file. Splitting the documents was simple, using pdftk’s burst command. Now I was left with one PDF file per page. Some of the original documents were just a page long, but some spanned over two or more pages.

I needed to merge several PDF documents into a single one, and of course pdftk can handle that as well. But I soon discovered that the whole procedure was time-consuming and error-prone. So I need a way to speed things up.

I started out by finding the page numbers of each document withing the original file. For this I was able to enlist the help of a few friends since it doesn’t require knowledge of pdftk. The only thing they needed to know was how to use a PDF reader such as xpdf or evince. We wrote down the title, start and end pages of each document we found. This did take some time as several of the PDF files were several hundreds of pages long.

Once we had the list on paper I created a simple text file in each directory holding bursted (split) PDF documents. Each line in the text file had the format:


Title :: start page [- end page]

The next thing I did was quickly hack together a script that could work on this text file. What I came up with was the following:

 1 #! /usr/bin/perl
 2 
 3 use strict;
 4 use warnings;
 5 use File::Copy;
 6 use File::Path;
 7 use File::Temp;
 8 
 9 # Parses a list file and returns an array of documents. Each document entry
10 # contains the document name, the first and the last page of the document.
11 sub parse_list {
12     my $file = shift;
13     my @documents = ();
14     my $doc;
15 
16     open INFILE, "<$file";
17     while (<INFILE>) {
18         chomp;
19         /^s*(.*?)s*::s*(d+)(s*-s*(d+))?s*$/;
20         $doc = { 'title' => $1, 'start' => $2 };
21         $doc->{'end'} = $4 if defined($4);
22         push @documents, $doc;
23     }
24     close INFILE;
25 
26     return @documents;
27 }
28 
29 # Converts the page number passed in to a file name of the type generated by
30 # pdftk burst, ie. pg_0032.pdf
31 sub num_to_file {
32     my $num = shift;
33     return "pg_" . "0" x (4 - length("$num")) . $num . ".pdf";
34 }
35 
36 # Merges the pdf files from $start to $end and saves the merged document as
37 # $title.
38 sub merge {
39     my ($title, $start, $end) = @_;
40     my $dir = tempdir();
41     for my $page ($start .. $end) {
42         copy(num_to_file($page), $dir);
43     }
44 
45     print "pdftk $dir/*.pdf cat output '$title.pdf'n";
46     `pdftk $dir/*.pdf cat output '$title.pdf'`;
47     rmtree($dir);
48 }
49 
50 my @list = parse_list($ARGV[0]);
51 for (my $i = 0; $i < scalar(@list); ++$i) {
52     if (!defined($list[$i]->{'end'})) {
53         copy(num_to_file($list[$i]->{'start'}), $list[$i]->{'title'} . ".pdf");
54     } else {
55         merge($list[$i]->{'title'}, $list[$i]->{'start'}, $list[$i]->{'end'});
56     }
57 }

Now this is obviously very ugly, but it actually works, which was mattered to me. That is, it works as long as you don’t try to do anything out of the ordinary with it. At some point when I have some time over I hope I can clean it up, and look at possible use cases. Maybe it could be slightly extended but for now it does exactly what I needed it to do, no more, no less.

I’m putting it up here simply so that if anybody else is in a similar situation, you may find the above useful. Feel free to improve it if you wish. I’m hoping I can sort out some web space soon so that I can start publishing my code again, but until then I can at least list smaller scripts here on my blog.

It may be that there is a tool already for batch merging PDF documents that I’m not aware of. I was at home, had no connection to the Internet and needed a solution to the problem right there and then. If you know of any tool that does the job, feel free to add a comment. You could also post improvements to my code there.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: