Wednesday, April 27, 2011

Amen = "Der Stellvertreter" (2002) - IMDb

Amen. (2002) - IMDb

The German title of this movie: Der Stellvertreter.

I saw this movie recently on DasErste.de.

Unicode, UTF-8, Plan 9

Unicode - Wikipedia, the free encyclopedia
Did you know, that "UTF-8 was originally developed for Plan 9"?!

PDF harvesting – automatic extraction of information from PDF files

Why and what for ...

There may be a lot of structured information available in a PDF file (I assume), just as in web pages, and I was thinking about also making use of that information.
Of course I owe you here a discussion of what structured information, we might wish to get hold on, and maybe I am going to add this here one day, but for the time being I will just focus on tables.

For web pages i.e. HTML pages there are a couple of harvesting techniques out there.
I am focusing here on table extraction, and for HTML table extraction I want to refer you to another blog article of mine.
HTML table extraction looks like solved to me, PDF table extraction doesn't.
I think, there isn't as much meta-information left in PDF files as there is in HTML. I would rather appreciate, if somebody proves me wrong there, but I rather suspect, that's the way it is. I am still a little optimistic, that tables can get recognized in PDF files.

Update 2011-07-03:
  • all these approaches cost me far too much effort already
  • I don't regard them production ready
  • of course it's possible to extract tabular information from PDF files
  • if you do want to get at it yourself (with specific software, i.e. your own code), make use of "pdftohtml -xml"
Update 2012-01-09:
  • read my latest (2012-01) article(s) tagged pdftohtml!
  • I created some software, that extracts tables from PDF and saves them as CSV
  • the heuristics for recognizing the tables are not in the software, rather does the user have to specify the physical x-positions – but that's not too hard, and the software supplies the user with all necessary and valuable details

pdf2table
2010-10-10 :
  • in contact with the author
  • created Apache Ant build files, managed to build the software on Linux
  • managed to run the software
  • the software actually runs "pdftohtml -xml"
  • the software attempts to recognise table using some heuristics and creates an XML file with the tables recognised, but it fails far too often for my taste
  • created RNC for that XML file
  • corrected that XML file, optimized it slightly for further processing
  • created a simple shell script to wrap the Java class
This looked rather promising to me. But actually I encountered too many obstacle.
The only benefit from my attempts here: I got quite familiar with "pdftohtml -xml", which I have been using a lot lately.

2010-10-10 :
  • in contact with the maintainer on sourceforge and also with the authors
  • the software comes with Apache Ant build files
  • but I don't manage to run the software

Links and ressources

Unicode and emoticons

Emoticon - Wikipedia, the free encyclopedia
"Emoticons are introduced in Unicode Standard version 6.0. It covers unicode range from 1F600 to 1F64F."

a blog of mine on blogspot.com got removed - I am 100% sure it did not contain SPAM

a blog of mine on blogspot.com got removed - I am 100% sure it did not contain SPAM - Blogger Help

There was no way getting it back again, I wasn't even able to create an archive dump, before the access got deleted.

openSUSE Medical 0.0.6 released - The H Open Source: News and Features

openSUSE Medical 0.0.6 released - The H Open Source: News and Features

Ruby on Rails update with faster Active Record - The H Open Source: News and Features

Ruby on Rails update with faster Active Record - The H Open Source: News and Features

Version 2.10.0 of the Parrot virtual machine released - The H Open Source: News and Features

Version 2.10.0 of the Parrot virtual machine released - The H Open Source: News and Features

Oracle asks Apache to reconsider its position on Java - The H Open Source: News and Features

Oracle asks Apache to reconsider its position on Java - The H Open Source: News and Features

Google releases Refine 2.0 data sifting tool - The H Open Source: News and Features

Google releases Refine 2.0 data sifting tool - The H Open Source: News and Features

Wikis for Individuals, Groups, and Organizations - Wikispaces

Wikis for Everyone - Wikispaces

Efrat Alony im Hallenbad

Efrat Alony - Lights On/Off

openSUSE 11.4 – installed it from online repositories

software.opensuse.org: Download openSUSE 11.4


  • I decided to install openSUSE once again from online repositories,
  • I downloaded a CD image,
  • burnt it on CD,
  • started my 1st installation overnight (a fresh install on a VM),
  • and I found it completed in the morning.
Of course I will not do a completely unattended installation on my most appreciated ASUS notebook. I will watch that then with anxious meditation and prayers.
(I attempted to insert here Unicode 1F629 ("weary face"), but I could not find out how to achieve that.)

Tiny Core Linux 3.3 released - The H Open Source: News and Features

Tiny Core Linux 3.3 released - The H Open Source: News and Features

Linux tools support iOS 4.2.1 - The H Open Source: News and Features

Linux tools support iOS 4.2.1 - The H Open Source: News and Features

GUIdancer test tool to become Eclipse project - The H Open Source: News and Features

GUIdancer test tool to become Eclipse project - The H Open Source: News and Features

O'Reilly Media book: Perl Hacks

Perl Hacks - O'Reilly Media

ruby gem: Slop - Option gathering made easy

Slop - Option gathering made easy -> a DSL for command line options

Google loses Linux patent suit - The H Open Source: News and Features

Google loses Linux patent suit - The H Open Source: News and FeaturesThe U.S. patent system is generally sick.

Google Health, CCR, my health status and history

It's really a good thing to keep your own record "somewhere" for yourself.
You may doubt, it's a good thing to keep it at health.google.com. Well, for the time being I will keep mine there, and I will "frequently" download the CCR ("Continuity of Care Record") to my computer, let's say: each time I add a note up there.
The CCR XML is quite readable and useful.

Tuesday, April 26, 2011

how to supply more than one (prefix) argument to an emacs keyboard shortcut resp. function

  • You can not supply more than argument to a keyboard shortcut, but you can only supply them to the related function itself.
  • You can find out, which function is related to a keyboard shortcut by typing "C-h k" before the keyboard shortcut.
  • Of course you would have to call the function "the Lisp way", but how and where?
    "M-x eval-expression".
  • After using emacs for about 20 years, today I asked myself this question seriously, because "No Gnus v0.15" started costing me too much time, if I want to move an article from one group to another one of the many, many groups of mine (gnus-summary-move-article). It takes an "eternity" to prepare the list of groups I can move to, and that's really far too much time for me. Now I call it like this:
    (gnus-summary-move-article 1 "nnfolder:persons.xyz")
    it costs me far less time to switch to the *Group* buffer, to find the target group there, to copy the right name into a paste buffer there, and to use it through eval-expression.
Update 2011-06-22:
With "No Gnus v0.17" gnus-summary-move-article did not go through SPAM filtering again any more. That's what I want. But I only found that out after starting my emacs again after like 2 months, when I started to prepare the article for gnus-info-english in order to ask there how to get rid of that terrible delay.

Monday, April 25, 2011

The Last Train = "Der letzte Zug" (2006) - IMDb

The Last Train (2006) - IMDb

The German title of this movie: Der letzte Zug (nach Auschwitz).

I saw this movie last Thursday; it has a really, really touching story.

Wednesday, April 20, 2011

JasperReports and its JRXML - Wikipedia, the free encyclopedia

JasperReports - Wikipedia, the free encyclopedia

an INCOMPLETE story from my "PDF to JasperReports" migration project


My "current" (as of 2011-01) project is actually rather interesting and challenging and well-paying,
but it's only going to last for no longer than 2 months, I assume.

From my customer's point of view this migration project must be horror.

I don't really know, how serious "they" were, when they determined, that this would be a 3-months project.

Migrating 98% of the documents from PDF to JasperReport's "JRXML"

  http://en.wikipedia.org/wiki/JasperReports#JRXML

will take like 6 months (I started in December).
But the remaining 2% of the documents may take another 3 to 6 months.

If they don't complete that project (I mean true 100% of the documents, that need to get migrated),
they cannot abandon the old software,
which was one of the main goals initially.

They had no realistic concept for migrating all these documents,
they even had no realistic approach of analyzing all the PDF documents.
We are talking about many hundreds of PDF documents resp. pages with form fields,
and these form fields are really rather "delicate" details.

The usual way to work on PDF form fields
is to load the file into Acrobat Pro and to display the form fields.
You can "obviously" create a hard copy for each page,
but that's tedious and you still don't have hold of all the "more atomar details" of PDF form fields.

There was no software available for easy "batch-way" dealing with PDF forms.
I created something myself around a PDF library from perl's CPAN ("CAM::PDF").
Oh, wonderful CPAN!!!
In the meantime I contacted the developer of that PDF library,
because I needed more details for PDF document objects, such as the page number of a form field,
and "of course" he helped me out.
I was so happy.
Initially the page number was only "nice to have",
as 99.5% of the documents, that I had dealt with so far, were 1-page only.

In the meantime I found out, I would have to deal with like 200 multi-page documents,
so it was a serious necessity to be able to extract the number of the page the form-fields are located on.

When I decided to contact the developer of CAM::PDF, I was already close to despair,
and I was overly happy, when we had gotten that feature implemented within a couple of days, with just a few e-mails in both directions.

Monday, April 18, 2011

pdfnup(1): n-up pages of pdf files - Linux man page

pdfnup(1): n-up pages of pdf files - Linux man page

This utility is a part of PDFjam.

Back in the old PostScript days I sometimes used psnup; and recently I thought, I would like to display 3x2 pages of a PDF file on just one page to get a quick overview of the entire file; that made me try this utility, that comes supplied with openSUSE.

PDFjam – a collection of shell scripts dealing with PDF files

PDFjam-README.html

Tuesday, April 12, 2011

Monday, April 11, 2011

K3b - a CD and DVD authoring application for KDE

K3b - Wikipedia, the free encyclopedia

I used this utility to create backup copies of my kid's English course CDs. We were not able to play the original ones on a couple of CD players

Tuesday, April 5, 2011

Outta Control (TV 2008) - IMDb

Outta Control (TV 2008) - IMDb

In German: Ihr könnt euch niemals sicher sein.

I saw this movie on Friday night. Very well done!

Sunday, April 3, 2011

João Gabriel sits on a bike without training wheels and he rides it – for the very first time

That was rather overwhelming. I expected him to need help, but he didn't – he rather did not need any help at all. He only needs to be in a good mood to do it. And we can trust him: he stops before streets.

Elisa di Rivombrosa (TV Series) - IMDb

Elisa di Rivombrosa (TV Series 2003) - IMDb

German public television "DasErste.de" shows this TV serie on Tuesday nights.