Showing posts with label PDF harvesting. Show all posts
Showing posts with label PDF harvesting. Show all posts

Monday, January 9, 2012

installing pdftohtml from sources – successfully using 0.40a

  • pdftohtml-0.40a.tar.gz (an experimental version from 2006-11-06 on SourceForge.net)
  • as opposed to 0.39 I got this recompiled out of the box on Linux and Mac OS X Lion
True, this is not brand-new information, but I think still worth mentioning.

Sunday, January 8, 2012

table_pdf2csv.pl : extracting tables from PDF, saving them as CSV


  • I leave the PDF extraction bit to "pdftohtml -xml".
  • My perl scripts tells you, at what "physical columns" text gets found within the PDF file.
  • You choose, which "physical columns" really makes sense to you as logical column starters.
  • Now you run my perl script with those few serious physical columns specified,
    and it creates a CSV file for you.
  • Per logical row a few physical rows got created.
  • If you want, you can merge cells from neighboring rows into logical cells,
    you can use LibreOffice Calc, or OpenOffice Calc, or Excel for this step.
Does this sound interesting to you?

Friday, May 20, 2011

extracting infos from a rather detailed PDF (from a software developer's point of view)

If I access PDF, I rather read the XML created by "pdfthtml -xml" for a PDF file. Although there are features, that I miss with XML::Simple, I find that module rather convenient.

Think of a pay slip as PDF. It has quite a regular structure. (Of course, you might also want to receive an XML representation of it directly from the salary software, but that's another issue. In this very case this looked like rather hard to achieve.)
There are labels and there are values. I want to access values by their labels. Therefore I need a specification describing, where the value belonging to a specific label is located relatively. I do this by giving a relative rectangular range / region. All text strings provided by "pdftohmtl -xml" (i.e. the text elements) get stored into a matrix (X×Y). So far there were no big obstacles accessing the value for a label by scanning the matrix within that relative rectangular region.
I actually and also usually don't want and need to specify, where the label is located on the page. Why would you want to specify that, as long as it's not necessary?
But certain labels appear more than once. I add the absolute rectangular region of the label, in case that is needed. Of course, this spec. is as terse as possible. A PDF page has its origin at the upper left corner (you do know that). So if the label is just above y=500, you neither need to give the left upper corner of the resp. rectangular region nor the lower right corner. This makes the label/value spec. just as verbose as needed.
(Right, I know a picture would help: A picture is worth a thousand words.)

My software is implemented in Perl, and so far the label/value specs are done programmatically. Of course, I would like to have a spec as XML or as a DSL, but I am not there yet.

To be continued …

Wednesday, April 27, 2011

PDF harvesting – automatic extraction of information from PDF files

Why and what for ...

There may be a lot of structured information available in a PDF file (I assume), just as in web pages, and I was thinking about also making use of that information.
Of course I owe you here a discussion of what structured information, we might wish to get hold on, and maybe I am going to add this here one day, but for the time being I will just focus on tables.

For web pages i.e. HTML pages there are a couple of harvesting techniques out there.
I am focusing here on table extraction, and for HTML table extraction I want to refer you to another blog article of mine.
HTML table extraction looks like solved to me, PDF table extraction doesn't.
I think, there isn't as much meta-information left in PDF files as there is in HTML. I would rather appreciate, if somebody proves me wrong there, but I rather suspect, that's the way it is. I am still a little optimistic, that tables can get recognized in PDF files.

Update 2011-07-03:
  • all these approaches cost me far too much effort already
  • I don't regard them production ready
  • of course it's possible to extract tabular information from PDF files
  • if you do want to get at it yourself (with specific software, i.e. your own code), make use of "pdftohtml -xml"
Update 2012-01-09:
  • read my latest (2012-01) article(s) tagged pdftohtml!
  • I created some software, that extracts tables from PDF and saves them as CSV
  • the heuristics for recognizing the tables are not in the software, rather does the user have to specify the physical x-positions – but that's not too hard, and the software supplies the user with all necessary and valuable details

pdf2table
2010-10-10 :
  • in contact with the author
  • created Apache Ant build files, managed to build the software on Linux
  • managed to run the software
  • the software actually runs "pdftohtml -xml"
  • the software attempts to recognise table using some heuristics and creates an XML file with the tables recognised, but it fails far too often for my taste
  • created RNC for that XML file
  • corrected that XML file, optimized it slightly for further processing
  • created a simple shell script to wrap the Java class
This looked rather promising to me. But actually I encountered too many obstacle.
The only benefit from my attempts here: I got quite familiar with "pdftohtml -xml", which I have been using a lot lately.

2010-10-10 :
  • in contact with the maintainer on sourceforge and also with the authors
  • the software comes with Apache Ant build files
  • but I don't manage to run the software

Links and ressources

Wednesday, April 20, 2011

an INCOMPLETE story from my "PDF to JasperReports" migration project


My "current" (as of 2011-01) project is actually rather interesting and challenging and well-paying,
but it's only going to last for no longer than 2 months, I assume.

From my customer's point of view this migration project must be horror.

I don't really know, how serious "they" were, when they determined, that this would be a 3-months project.

Migrating 98% of the documents from PDF to JasperReport's "JRXML"

  http://en.wikipedia.org/wiki/JasperReports#JRXML

will take like 6 months (I started in December).
But the remaining 2% of the documents may take another 3 to 6 months.

If they don't complete that project (I mean true 100% of the documents, that need to get migrated),
they cannot abandon the old software,
which was one of the main goals initially.

They had no realistic concept for migrating all these documents,
they even had no realistic approach of analyzing all the PDF documents.
We are talking about many hundreds of PDF documents resp. pages with form fields,
and these form fields are really rather "delicate" details.

The usual way to work on PDF form fields
is to load the file into Acrobat Pro and to display the form fields.
You can "obviously" create a hard copy for each page,
but that's tedious and you still don't have hold of all the "more atomar details" of PDF form fields.

There was no software available for easy "batch-way" dealing with PDF forms.
I created something myself around a PDF library from perl's CPAN ("CAM::PDF").
Oh, wonderful CPAN!!!
In the meantime I contacted the developer of that PDF library,
because I needed more details for PDF document objects, such as the page number of a form field,
and "of course" he helped me out.
I was so happy.
Initially the page number was only "nice to have",
as 99.5% of the documents, that I had dealt with so far, were 1-page only.

In the meantime I found out, I would have to deal with like 200 multi-page documents,
so it was a serious necessity to be able to extract the number of the page the form-fields are located on.

When I decided to contact the developer of CAM::PDF, I was already close to despair,
and I was overly happy, when we had gotten that feature implemented within a couple of days, with just a few e-mails in both directions.

Tuesday, February 8, 2011

"pdftohtml" vs. DRM

A project of mine involves extracting strings and other details from PDF files using "pdftohtml -xml".

A plain "pdftohtml -xml" refuses to read PDF files with set copy-protection bits set. But if you add "-nodrm" on the command line, it reads them anyway, but it mentions the problem on STDERR.

Friday, December 10, 2010

form fields in PDF – how to retrieve their details?

This command line shows you a few details:

$ pdftk … dump_data_fields

Not enough details for me.

What about CAM::PDF?
It comes with a couple of nice sample utilities (the bin/ subdirectory), one of them is called listpdffields.pl . It also does not show me enough details, but I think I will enhance that one.

Update / 2010-12-11:
Yes, CAM::PDF works very well for me. I wrote another article on that.

Sunday, October 10, 2010

Saturday, October 9, 2010

TableSeer.SourceForge.net

TableSeer | Download TableSeer software for free at SourceForge.net

TableSeer is a tool that automatically identifies tables in digital documents and extracts the contents in the cells of the tables as well as table metadata.
That software seems to apply more heuristics than pdf2table.