There may be a lot of structured information available in a PDF file (I assume), just as in web pages, and I was thinking about also making use of that information.
Of course I owe you here a discussion of what structured information, we might wish to get hold on, and maybe I am going to add this here one day, but for the time being I will just focus on tables.
For web pages i.e. HTML pages there are a couple of harvesting techniques out there.
I am focusing here on table extraction, and for HTML table extraction I want to refer you to another blog article of mine.
HTML table extraction looks like solved to me, PDF table extraction doesn't.
I think, there isn't as much meta-information left in PDF files as there is in HTML. I would rather appreciate, if somebody proves me wrong there, but I rather suspect, that's the way it is. I am still a little optimistic, that tables can get recognized in PDF files.
Update 2011-07-03:
pdf2table
2010-10-10 :
The only benefit from my attempts here: I got quite familiar with "pdftohtml -xml", which I have been using a lot lately.
I am focusing here on table extraction, and for HTML table extraction I want to refer you to another blog article of mine.
HTML table extraction looks like solved to me, PDF table extraction doesn't.
I think, there isn't as much meta-information left in PDF files as there is in HTML. I would rather appreciate, if somebody proves me wrong there, but I rather suspect, that's the way it is. I am still a little optimistic, that tables can get recognized in PDF files.
Update 2011-07-03:
- all these approaches cost me far too much effort already
- I don't regard them production ready
- of course it's possible to extract tabular information from PDF files
- if you do want to get at it yourself (with specific software, i.e. your own code), make use of "pdftohtml -xml"
Update 2012-01-09:
- read my latest (2012-01) article(s) tagged pdftohtml!
- I created some software, that extracts tables from PDF and saves them as CSV
- the heuristics for recognizing the tables are not in the software, rather does the user have to specify the physical x-positions – but that's not too hard, and the software supplies the user with all necessary and valuable details
pdf2table
2010-10-10 :
- in contact with the author
- created Apache Ant build files, managed to build the software on Linux
- managed to run the software
- the software actually runs "pdftohtml -xml"
- the software attempts to recognise table using some heuristics and creates an XML file with the tables recognised, but it fails far too often for my taste
- created RNC for that XML file
- corrected that XML file, optimized it slightly for further processing
- created a simple shell script to wrap the Java class
- …
The only benefit from my attempts here: I got quite familiar with "pdftohtml -xml", which I have been using a lot lately.
2010-10-10 :
- in contact with the maintainer on sourceforge and also with the authors
- the software comes with Apache Ant build files
- but I don't manage to run the software
- …
- http://ieg.ifs.tuwien.ac.at/projects/pdf2table/
(GNU license, apparently "sleeping" since 2006)
seems to apply heuristics to recognize tables and delivers XML for that;
I am going to try this out rather soon - http://tableseer.sourceforge.net/
… - http://www.convertzone.com/comparepdf2txt.htm
(commercial, apparently "sleeping" since 2003)
I will only spend my time and no money on these experiments, so I doubt I will be able to try this software out. - …
2 comments:
What about this effort?
Added an update today.
Post a Comment