Sunday, January 8, 2012

table_pdf2csv.pl : extracting tables from PDF, saving them as CSV


  • I leave the PDF extraction bit to "pdftohtml -xml".
  • My perl scripts tells you, at what "physical columns" text gets found within the PDF file.
  • You choose, which "physical columns" really makes sense to you as logical column starters.
  • Now you run my perl script with those few serious physical columns specified,
    and it creates a CSV file for you.
  • Per logical row a few physical rows got created.
  • If you want, you can merge cells from neighboring rows into logical cells,
    you can use LibreOffice Calc, or OpenOffice Calc, or Excel for this step.
Does this sound interesting to you?

No comments: