A plain "pdftohtml -xml" refuses to read PDF files with set copy-protection bits set. But if you add "-nodrm" on the command line, it reads them anyway, but it mentions the problem on STDERR.
Tuesday, February 8, 2011
"pdftohtml" vs. DRM
A project of mine involves extracting strings and other details from PDF files using "pdftohtml -xml".
Labels:
PDF,
PDF harvesting,
PDF scraping,
pdftohtml
Location:
Eugensplatz, Stuttgart, Germany
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment