Jochen Hayek's Blog: PDF harvesting – automatic extraction of information from PDF files

Wednesday, April 27, 2011

PDF harvesting – automatic extraction of information from PDF files

Why and what for ...

There may be a lot of structured information available in a PDF file (I assume), just as in web pages, and I was thinking about also making use of that information.
Of course I owe you here a discussion of what structured information, we might wish to get hold on, and maybe I am going to add this here one day, but for the time being I will just focus on tables.

For web pages i.e. HTML pages there are a couple of harvesting techniques out there.
I am focusing here on table extraction, and for HTML table extraction I want to refer you to another blog article of mine.
HTML table extraction looks like solved to me, PDF table extraction doesn't.
I think, there isn't as much meta-information left in PDF files as there is in HTML. I would rather appreciate, if somebody proves me wrong there, but I rather suspect, that's the way it is. I am still a little optimistic, that tables can get recognized in PDF files.

Update 2011-07-03:

all these approaches cost me far too much effort already
I don't regard them production ready
of course it's possible to extract tabular information from PDF files
if you do want to get at it yourself (with specific software, i.e. your own code), make use of "pdftohtml -xml"

Update 2012-01-09:

read my latest (2012-01) article(s) tagged pdftohtml!
I created some software, that extracts tables from PDF and saves them as CSV
the heuristics for recognizing the tables are not in the software, rather does the user have to specify the physical x-positions – but that's not too hard, and the software supplies the user with all necessary and valuable details

pdf2table
2010-10-10 :

in contact with the author
created Apache Ant build files, managed to build the software on Linux
managed to run the software
the software actually runs "pdftohtml -xml"
the software attempts to recognise table using some heuristics and creates an XML file with the tables recognised, but it fails far too often for my taste
created RNC for that XML file
corrected that XML file, optimized it slightly for further processing
created a simple shell script to wrap the Java class
…

This looked rather promising to me. But actually I encountered too many obstacle.
The only benefit from my attempts here: I got quite familiar with "pdftohtml -xml", which I have been using a lot lately.

tableseer

2010-10-10 :

in contact with the maintainer on sourceforge and also with the authors
the software comes with Apache Ant build files
but I don't manage to run the software
…

Links and ressources

http://ieg.ifs.tuwien.ac.at/projects/pdf2table/
(GNU license, apparently "sleeping" since 2006)
seems to apply heuristics to recognize tables and delivers XML for that;
I am going to try this out rather soon
http://tableseer.sourceforge.net/
…
http://www.convertzone.com/comparepdf2txt.htm
(commercial, apparently "sleeping" since 2003)
I will only spend my time and no money on these experiments, so I doubt I will be able to try this software out.
…