Wednesday, April 27, 2011
Unicode, UTF-8, Plan 9
Unicode - Wikipedia, the free encyclopedia
Did you know, that "UTF-8 was originally developed for Plan 9"?!
Labels:
Plan 9 from Bell Labs,
Unicode
PDF harvesting – automatic extraction of information from PDF files
Why and what for ...
There may be a lot of structured information available in a PDF file (I assume), just as in web pages, and I was thinking about also making use of that information.
Of course I owe you here a discussion of what structured information, we might wish to get hold on, and maybe I am going to add this here one day, but for the time being I will just focus on tables.
There may be a lot of structured information available in a PDF file (I assume), just as in web pages, and I was thinking about also making use of that information.
Of course I owe you here a discussion of what structured information, we might wish to get hold on, and maybe I am going to add this here one day, but for the time being I will just focus on tables.
For web pages i.e. HTML pages there are a couple of harvesting techniques out there.
I am focusing here on table extraction, and for HTML table extraction I want to refer you to another blog article of mine.
HTML table extraction looks like solved to me, PDF table extraction doesn't.
I think, there isn't as much meta-information left in PDF files as there is in HTML. I would rather appreciate, if somebody proves me wrong there, but I rather suspect, that's the way it is. I am still a little optimistic, that tables can get recognized in PDF files.
Update 2011-07-03:
pdf2table
2010-10-10 :
The only benefit from my attempts here: I got quite familiar with "pdftohtml -xml", which I have been using a lot lately.
I am focusing here on table extraction, and for HTML table extraction I want to refer you to another blog article of mine.
HTML table extraction looks like solved to me, PDF table extraction doesn't.
I think, there isn't as much meta-information left in PDF files as there is in HTML. I would rather appreciate, if somebody proves me wrong there, but I rather suspect, that's the way it is. I am still a little optimistic, that tables can get recognized in PDF files.
Update 2011-07-03:
- all these approaches cost me far too much effort already
- I don't regard them production ready
- of course it's possible to extract tabular information from PDF files
- if you do want to get at it yourself (with specific software, i.e. your own code), make use of "pdftohtml -xml"
Update 2012-01-09:
- read my latest (2012-01) article(s) tagged pdftohtml!
- I created some software, that extracts tables from PDF and saves them as CSV
- the heuristics for recognizing the tables are not in the software, rather does the user have to specify the physical x-positions – but that's not too hard, and the software supplies the user with all necessary and valuable details
pdf2table
2010-10-10 :
- in contact with the author
- created Apache Ant build files, managed to build the software on Linux
- managed to run the software
- the software actually runs "pdftohtml -xml"
- the software attempts to recognise table using some heuristics and creates an XML file with the tables recognised, but it fails far too often for my taste
- created RNC for that XML file
- corrected that XML file, optimized it slightly for further processing
- created a simple shell script to wrap the Java class
- …
The only benefit from my attempts here: I got quite familiar with "pdftohtml -xml", which I have been using a lot lately.
2010-10-10 :
- in contact with the maintainer on sourceforge and also with the authors
- the software comes with Apache Ant build files
- but I don't manage to run the software
- …
- http://ieg.ifs.tuwien.ac.at/projects/pdf2table/
(GNU license, apparently "sleeping" since 2006)
seems to apply heuristics to recognize tables and delivers XML for that;
I am going to try this out rather soon - http://tableseer.sourceforge.net/
… - http://www.convertzone.com/comparepdf2txt.htm
(commercial, apparently "sleeping" since 2003)
I will only spend my time and no money on these experiments, so I doubt I will be able to try this software out. - …
Labels:
PDF,
PDF harvesting,
PDF scraping,
pdftohtml
Unicode and emoticons
Emoticon - Wikipedia, the free encyclopedia
"Emoticons are introduced in Unicode Standard version 6.0. It covers unicode range from 1F600 to 1F64F."
Labels:
Unicode
a blog of mine on blogspot.com got removed - I am 100% sure it did not contain SPAM
a blog of mine on blogspot.com got removed - I am 100% sure it did not contain SPAM - Blogger Help
There was no way getting it back again, I wasn't even able to create an archive dump, before the access got deleted.
There was no way getting it back again, I wasn't even able to create an archive dump, before the access got deleted.
Labels:
blogspot.com
openSUSE 11.4 – installed it from online repositories
software.opensuse.org: Download openSUSE 11.4
- I decided to install openSUSE once again from online repositories,
- I downloaded a CD image,
- burnt it on CD,
- started my 1st installation overnight (a fresh install on a VM),
- and I found it completed in the morning.
Of course I will not do a completely unattended installation on my most appreciated ASUS notebook. I will watch that then with anxious meditation and prayers.
(I attempted to insert here Unicode 1F629 ("weary face"), but I could not find out how to achieve that.)
(I attempted to insert here Unicode 1F629 ("weary face"), but I could not find out how to achieve that.)
Labels:
Linux,
openSUSE,
openSUSE upgrade
ruby gem: Slop - Option gathering made easy
Slop - Option gathering made easy -> a DSL for command line options
Labels:
The Ruby Programming Language
Google Health, CCR, my health status and history
It's really a good thing to keep your own record "somewhere" for yourself.
You may doubt, it's a good thing to keep it at health.google.com. Well, for the time being I will keep mine there, and I will "frequently" download the CCR ("Continuity of Care Record") to my computer, let's say: each time I add a note up there.
The CCR XML is quite readable and useful.
You may doubt, it's a good thing to keep it at health.google.com. Well, for the time being I will keep mine there, and I will "frequently" download the CCR ("Continuity of Care Record") to my computer, let's say: each time I add a note up there.
The CCR XML is quite readable and useful.
Labels:
Google Health,
XML
Tuesday, April 26, 2011
how to supply more than one (prefix) argument to an emacs keyboard shortcut resp. function
- You can not supply more than argument to a keyboard shortcut, but you can only supply them to the related function itself.
- You can find out, which function is related to a keyboard shortcut by typing "C-h k" before the keyboard shortcut.
- Of course you would have to call the function "the Lisp way", but how and where?
"M-x eval-expression". - After using emacs for about 20 years, today I asked myself this question seriously, because "No Gnus v0.15" started costing me too much time, if I want to move an article from one group to another one of the many, many groups of mine (gnus-summary-move-article). It takes an "eternity" to prepare the list of groups I can move to, and that's really far too much time for me. Now I call it like this:
(gnus-summary-move-article 1 "nnfolder:persons.xyz")
it costs me far less time to switch to the *Group* buffer, to find the target group there, to copy the right name into a paste buffer there, and to use it through eval-expression.
Update 2011-06-22:
With "No Gnus v0.17" gnus-summary-move-article did not go through SPAM filtering again any more. That's what I want. But I only found that out after starting my emacs again after like 2 months, when I started to prepare the article for gnus-info-english in order to ask there how to get rid of that terrible delay.
Labels:
emacs,
keyboard shortcuts
Monday, April 25, 2011
The Last Train = "Der letzte Zug" (2006) - IMDb
The Last Train (2006) - IMDb
The German title of this movie: Der letzte Zug (nach Auschwitz).
The German title of this movie: Der letzte Zug (nach Auschwitz).
I saw this movie last Thursday; it has a really, really touching story.
Labels:
films,
Germany,
IMDb,
the Holocaust
Thursday, April 21, 2011
JasperReports Sample Reference
JasperReports 4.0.1 - Sample Reference
I guess there is also such a directory for 3.x (my client still employs 3.x).
Labels:
JasperReports,
JasperSoftForge
blogger.com : they changed the new draft "dashboard" – they took a turn for the worse
I see much less than before. I hate that. …
Of course the non-draft dashboard is still alright: blogger.com.
Of course the non-draft dashboard is still alright: blogger.com.
Labels:
Blogger
Wednesday, April 20, 2011
an INCOMPLETE story from my "PDF to JasperReports" migration project
My "current" (as of 2011-01) project is actually rather interesting and challenging and well-paying,
but it's only going to last for no longer than 2 months, I assume.
From my customer's point of view this migration project must be horror.
I don't really know, how serious "they" were, when they determined, that this would be a 3-months project.
Migrating 98% of the documents from PDF to JasperReport's "JRXML"
http://en.wikipedia.org/wiki/JasperReports#JRXML
will take like 6 months (I started in December).
But the remaining 2% of the documents may take another 3 to 6 months.
If they don't complete that project (I mean true 100% of the documents, that need to get migrated),
they cannot abandon the old software,
which was one of the main goals initially.
They had no realistic concept for migrating all these documents,
they even had no realistic approach of analyzing all the PDF documents.
We are talking about many hundreds of PDF documents resp. pages with form fields,
and these form fields are really rather "delicate" details.
The usual way to work on PDF form fields
is to load the file into Acrobat Pro and to display the form fields.
You can "obviously" create a hard copy for each page,
but that's tedious and you still don't have hold of all the "more atomar details" of PDF form fields.
There was no software available for easy "batch-way" dealing with PDF forms.
I created something myself around a PDF library from perl's CPAN ("CAM::PDF").
Oh, wonderful CPAN!!!
In the meantime I contacted the developer of that PDF library,
because I needed more details for PDF document objects, such as the page number of a form field,
and "of course" he helped me out.
I was so happy.
Initially the page number was only "nice to have",
as 99.5% of the documents, that I had dealt with so far, were 1-page only.
In the meantime I found out, I would have to deal with like 200 multi-page documents,
so it was a serious necessity to be able to extract the number of the page the form-fields are located on.
When I decided to contact the developer of CAM::PDF, I was already close to despair,
and I was overly happy, when we had gotten that feature implemented within a couple of days, with just a few e-mails in both directions.
Monday, April 18, 2011
pdfnup(1): n-up pages of pdf files - Linux man page
pdfnup(1): n-up pages of pdf files - Linux man page
This utility is a part of PDFjam.
Back in the old PostScript days I sometimes used psnup; and recently I thought, I would like to display 3x2 pages of a PDF file on just one page to get a quick overview of the entire file; that made me try this utility, that comes supplied with openSUSE.
This utility is a part of PDFjam.
Back in the old PostScript days I sometimes used psnup; and recently I thought, I would like to display 3x2 pages of a PDF file on just one page to get a quick overview of the entire file; that made me try this utility, that comes supplied with openSUSE.
Labels:
PDF
Saturday, April 16, 2011
Eden (2006) - IMDb
Eden (2006) - IMDb
With Charlotte Roche as Eden Drebb, really, really touching.
My Friday night movie.
With Charlotte Roche as Eden Drebb, really, really touching.
My Friday night movie.
Thursday, April 14, 2011
Tuesday, April 12, 2011
Monday, April 11, 2011
K3b - a CD and DVD authoring application for KDE
K3b - Wikipedia, the free encyclopedia
I used this utility to create backup copies of my kid's English course CDs. We were not able to play the original ones on a couple of CD players
I used this utility to create backup copies of my kid's English course CDs. We were not able to play the original ones on a couple of CD players
Location:
Charlottenburg, Berlin, Germany
Thursday, April 7, 2011
Wednesday, April 6, 2011
Perl 6 Modules Directory
Perl 6 Modules Directory
That's apparently the de facto CPAN6, the CPAN for perl6.
CSV, XML, DBD, …
Within "web" there is "Ratel", something like "eruby".
That's apparently the de facto CPAN6, the CPAN for perl6.
CSV, XML, DBD, …
Within "web" there is "Ratel", something like "eruby".
Labels:
CPAN,
The Perl6 Programming Language
Tuesday, April 5, 2011
Outta Control (TV 2008) - IMDb
Outta Control (TV 2008) - IMDb
In German: Ihr könnt euch niemals sicher sein.
I saw this movie on Friday night. Very well done!
In German: Ihr könnt euch niemals sicher sein.
I saw this movie on Friday night. Very well done!
Monday, April 4, 2011
Sunday, April 3, 2011
João Gabriel sits on a bike without training wheels and he rides it – for the very first time
That was rather overwhelming. I expected him to need help, but he didn't – he rather did not need any help at all. He only needs to be in a good mood to do it. And we can trust him: he stops before streets.
Labels:
family
Location:
Schöneberg, Berlin, Germany
Elisa di Rivombrosa (TV Series) - IMDb
Elisa di Rivombrosa (TV Series 2003) - IMDb
German public television "DasErste.de" shows this TV serie on Tuesday nights.
German public television "DasErste.de" shows this TV serie on Tuesday nights.
Saturday, April 2, 2011
Subscribe to:
Posts (Atom)