Tuesday, October 5, 2010

on PDF

Nowadays on the Web or through e-mail you are getting more and more PDF files as electronic documents instead of documents on paper.

Roughly spoken PDF documents are expected to display the same way on every computer platform (as opposed to documents created by usual word processing software). This is regarded a major advantage of PDF.

PDF vs. fonts vs. platform (in)dependence vs. resizability/scalability

Whenever a PDF document makes use of  outline fonts and stroke fonts as opposed to bitmap fonts (see the Wikipedia article on computer fonts!), you are able to resize resp. rescale your document to different sizes without suffering from the loss of quality of the fonts used. This is in general considered another major advantage.
But computer fonts are not in the public domain, so on every computer platform, different available fonts are used for PDF documents.

So what can we do against platform dependency stemming from fonts?
  • Include the fonts: that's the approach used by PDF/A.
    PDF/A is especially employed, where documents need to be available even after many years in the context of document archives.
    The major downside of this approach: PDF/A documents are much, much bigger than usual PDF documents, storing the fonts within them takes a lot space.
  • Another approach is to render text and fonts into ready-made bitmaps.
    Of course documents of this kind display best with a 1:1 relationship of the pixels in your documents to the pixels on your screen resp. on your printer output.
    Any resizing / rescaling results in pour quality.
    And I think you understand this very well: there is not text (as text) at all left in your PDF document, and you will not be able to extract any text from such a document.

Now you know: different kinds of PDF documents come with different advantages and also disadvantages.

I am interested here in PDF documents, that are not rendered into "one bitmap per page", but which rather contain the source document's text. Extracting that text simply as text is more or less an easy piece of cake, and there already exists software for this purpose.

PDF basics

Before I dive with you into what information we want to extract from PDF files, I want to explain PDF a little.

I am honestly not too deep into PDF, but I understand it as an advanced and optimized version of PostScript. My little knowledge of PostScript is (please find a slightly lengthier version here in the Wikipedia article!):
  • It's a stack-based programming language like Forth using reverse Polish notation.
  • It has data structures like arrays and dictionaries, but nothing more abstract than that.
  • Subprograms are called resp. regarded as operators of the stack machine.
  • Some relevant information details may be coded into operator names.
  • Some other relevant information details (like page numbers) are coded into comment lines, see the article on PostScript Document Structuring Conventions. I have no clue, what corresponds to that in PDF. Maybe there are language elements for that.
Now you have an idea of how PDF looks like, and you may have a vague idea, of what is possible with PDF and what isn't.

No comments: