Friday, May 20, 2011

extracting infos from a rather detailed PDF (from a software developer's point of view)

If I access PDF, I rather read the XML created by "pdfthtml -xml" for a PDF file. Although there are features, that I miss with XML::Simple, I find that module rather convenient.

Think of a pay slip as PDF. It has quite a regular structure. (Of course, you might also want to receive an XML representation of it directly from the salary software, but that's another issue. In this very case this looked like rather hard to achieve.)
There are labels and there are values. I want to access values by their labels. Therefore I need a specification describing, where the value belonging to a specific label is located relatively. I do this by giving a relative rectangular range / region. All text strings provided by "pdftohmtl -xml" (i.e. the text elements) get stored into a matrix (X×Y). So far there were no big obstacles accessing the value for a label by scanning the matrix within that relative rectangular region.
I actually and also usually don't want and need to specify, where the label is located on the page. Why would you want to specify that, as long as it's not necessary?
But certain labels appear more than once. I add the absolute rectangular region of the label, in case that is needed. Of course, this spec. is as terse as possible. A PDF page has its origin at the upper left corner (you do know that). So if the label is just above y=500, you neither need to give the left upper corner of the resp. rectangular region nor the lower right corner. This makes the label/value spec. just as verbose as needed.
(Right, I know a picture would help: A picture is worth a thousand words.)

My software is implemented in Perl, and so far the label/value specs are done programmatically. Of course, I would like to have a spec as XML or as a DSL, but I am not there yet.

To be continued …

No comments: