Wednesday, April 27, 2011

PDF harvesting – automatic extraction of information from PDF files

Why and what for ...

There may be a lot of structured information available in a PDF file (I assume), just as in web pages, and I was thinking about also making use of that information.
Of course I owe you here a discussion of what structured information, we might wish to get hold on, and maybe I am going to add this here one day, but for the time being I will just focus on tables.

For web pages i.e. HTML pages there are a couple of harvesting techniques out there.
I am focusing here on table extraction, and for HTML table extraction I want to refer you to another blog article of mine.
HTML table extraction looks like solved to me, PDF table extraction doesn't.
I think, there isn't as much meta-information left in PDF files as there is in HTML. I would rather appreciate, if somebody proves me wrong there, but I rather suspect, that's the way it is. I am still a little optimistic, that tables can get recognized in PDF files.

Update 2011-07-03:
  • all these approaches cost me far too much effort already
  • I don't regard them production ready
  • of course it's possible to extract tabular information from PDF files
  • if you do want to get at it yourself (with specific software, i.e. your own code), make use of "pdftohtml -xml"
Update 2012-01-09:
  • read my latest (2012-01) article(s) tagged pdftohtml!
  • I created some software, that extracts tables from PDF and saves them as CSV
  • the heuristics for recognizing the tables are not in the software, rather does the user have to specify the physical x-positions – but that's not too hard, and the software supplies the user with all necessary and valuable details

pdf2table
2010-10-10 :
  • in contact with the author
  • created Apache Ant build files, managed to build the software on Linux
  • managed to run the software
  • the software actually runs "pdftohtml -xml"
  • the software attempts to recognise table using some heuristics and creates an XML file with the tables recognised, but it fails far too often for my taste
  • created RNC for that XML file
  • corrected that XML file, optimized it slightly for further processing
  • created a simple shell script to wrap the Java class
This looked rather promising to me. But actually I encountered too many obstacle.
The only benefit from my attempts here: I got quite familiar with "pdftohtml -xml", which I have been using a lot lately.

2010-10-10 :
  • in contact with the maintainer on sourceforge and also with the authors
  • the software comes with Apache Ant build files
  • but I don't manage to run the software

Links and ressources