Fun with PDFs

17 May 2007 . . Comments

#lucene #pdf #hacks

I've been working with a lot of PDF files lately for a few different projects (see The FlatHat and Card Catalog). With our special collections cards, when you got a result back, Acrobat viewer would blow up the image to around 600%, making for a rather ugly image. For the FlatHat, I really wanted to be able to open a PDF and have the search terms highlighted, so I started hunting for ways to actually do this. I've been using PDFBox to extract text from our PDFs to index with Lucene, so I started there and they clued me in to Adobe's PDF Open Parameters. This really killed a few birds with one stone. When I was working on the Flat Hat newspaper, I was originally only returning back the page that the search result was on. I had some misgivings about this (like what if the story was on more than one page), but being able to pass the search query from the engine into the PDF is really nice since the user doesn't have to search through the entire issue to find the the context they are searching for (e.g. whistle bait -- when I saw that term, I cracked up; definitely a different era). Basically, the PDF Open Parameters allow you to pass commands into a PDF to allow you to control how the PDF is opened. They're passed with a "#" after the filename (e.g. filename.pdf#zoom=100). You can string commands together with an ampersand (&) with a few caveats:
  1. only one digit after a decimal is retained
  2. parameters and their values can only be 32 total characters long
  3. you can't use reserved characters (=, #, and &) to escape special characters
  4. if you turn bookmarks off for a PDF that had bookmarks showing, they won't go away until the PDF has been rendered
Anyway, here are some examples of what you can do: