Saturday, October 30, 2010

Web crawler - Wikipedia, the free encyclopedia

Web crawler - Wikipedia, the free encyclopedia

Web scraping - Wikipedia, the free encyclopedia

Web scraping - Wikipedia, the free encyclopedia

screen-scraper.com

screen-scraper.com

Google group scrubyt is gone

Once in a while I am curious to see, what goes on in the scrubyt are.
I have a few Atom and RSS feed URLs stored in my feed reader (Firefox Sage), but I don't reach them any more. The group just does not exist any more:
Cannot find scrubyt
The group named scrubyt has been removed because it violated Google's Terms Of Service.
I assume, it got spammed to much, the moderators didn't care, and finally the resp. Google Groups surveillance instance had to go that step.

I still wonder, what the future of scrubyt is. Visit this previous link, even if only for the list of competing software (ruby only, I assume).

hpricot | RubyGems.org | a swift, liberal HTML parser with a fantastic library

hpricot | RubyGems.org | your community gem host

Nokogiri – an HTML, XML, SAX, & Reader parser with the ability to search documents via XPath or CSS3 selectors… and much more

Nokogiri

No Javascript support.

web scraping afternoon

This wasn't meant to be yet another web scraping afternoon.

This afternoon started with me trying to recover a little from a hard time.
I had two probation days for a web-site testing job with Selenium, I am in the middle of a couple of recruitment processes, and I don't want to tell you about the real trouble.


  • I got intrigued to search oreilly.com for literature on Selenium and found a "Short Cut" document.
  • I found something.
  • I had a few looks over the chapter on "twill".
  • Before I really dived into the chapter on Selenium, I summed up, what I really liked and disliked about Selenium.
  • Of course, being able to use XPath is great.
  • With Selenium you somehow aren't aware at all, that there is Javascript being made use of on a web-site, but you just leave this to the browser engine, initially to Firefox and to the Selenium IDE.
  • I actually hate it, if your HTTP scripting depends on desktop computers running a browser and some remote control software to connect your server, where you "HTTP scripts" actually run, and the web browser(s), that you make use of.
  • I did a little superficial research on: perl/ruby + mechanize + xpath.
  • Yes, there is still scrubyt around, but isn't  that vaporware now itself?
  • Found perl's WWW::Scraper::TidyXML - "TidyXML and XPath support for Scraper". Not bad. But then it's from around 2003, and it seems to be vaporware. My e-mail to the author could not get delivered ("over quota"), so I guess, it's seriously no longer maintained.
  • WWW::Mechanize::Firefox seems to be nice, have a look at WWW::Mechanize::Firefox::Cookbook!

Friday, October 29, 2010

Scrappy: All Powerful Web Harvester, Spider, Scraper fully automated - search.cpan.org

Scrappy - search.cpan.org

EDI for Ruby (edi4r)

EDI for Ruby (edi4r)

Actually they refer to EDIFACT here.

You can use this software to output JSON, which you can process in any other software than.

WWW::Mechanize::Firefox - search.cpan.org

WWW::Mechanize::Firefox - search.cpan.org

Support for Javascript and XPath.

What about recording resp. capturing such a script?

perl, cpan: WWW::Scripter

WWW::Scripter - search.cpan.org

From the POD there:
DESCRIPTION 
This is a subclass of WWW::Mechanize that uses the W3C DOM and provides support for scripting.
No actual scripting engines are provided with WWW::Scripter, but are available as separate plugins. (See also the "SEE ALSO" section below.)

So it supports DOM, but no XPath expression yet.
And there is Javascript support through plugins.

An Introduction to Testing Web Applications with twill and Selenium - O'Reilly Media

An Introduction to Testing Web Applications with twill and Selenium - O'Reilly Media

To cheap not to own it – I thought a little, now I am reading it.

HSDD = hypoactive sexual desire disorder

A link to the abstract of the conference article / press release.

From that abstract:
CONCLUSION: Cerebral activation patterns in women with HSDD differs from those in women with normal sexual function and may reflect differences in how they interpret sexual stimuli.
In other words: Women with low libidos ‘have different brains’.

Have a good laugh!!!

Here is a lengthy discussion of the "miserable" approach in that article.

Thursday, October 28, 2010

Selenium+XPather: e.g. verifyTextPresent vs. verifyElementPresent

Selenium usually records string clicks and tests instead of true native language independent XPath expressions. But you can always find the right XPath expression yourself (resp. with the help of XPather, a Firefox extension), and make use of it in your selenium code.

Caveat: the XPath expression, that XPather tells you, needs yet another '/' in the beginning to be useful in your Selenium code.

Yes, these XPath expressions are lengthy, and you may think they are overspecifying your location in question, but then: when will that lengthy XPath expression ever fail? If your HTML programmer changes his code. And that's exactly, what you should insist of being informed of in the first place. Track your HTML programmer! If you don't, he will screw you w/o any mercy. You don't want to screw him, but you need to know the consequences of what he is doing. Actually not in every detail, but more details are better than no details at all.

We replaced verifyTextPresent with verifyElementPresent, and it worked "out of the box". We gained native language independence immediately.

Wall Street 2: Money Never Sleeps (2010)

Wall Street 2: Money Never Sleeps (2010)

My Wednesday (2010-10-27) night movie.

Very nice. Good entertainment.

Now I know, how to pronounce the surname "Schwartz" in English. That's one of the main characters in that movie.

Thanks to my movie night sponsor EN!!!!

Selenium: strftime, sprintf

I would like to see a strftime or an sprintf in Selenium.

Javascript has a printf, but only for files, not for strings.
(Maybe there is a way to regards strings as files, but my Javascript competence is not good enough for that.)

I found Javascript code, that implements sprintf.

(I only introduced "Selenium" in the title, because the Tweet created from a title w/o it looks stupid.)

misc. Selenium links


Selenium Training - The Automated Tester:

Un-enumerable thanks to DS, who provided me with these links above and much more!!!

Here are a few more links, that I found helpful:

Monday, October 25, 2010

Exodus 34:30 Moses' face was so radiant, that they couldn't bear it

29: When Moses came down from Mount Sinai with the two tablets of the Testimony in his hands, he was not aware that his face was radiant because he had spoken with the LORD.
30: When Aaron and all the Israelites saw Moses, his face was radiant, and they were afraid to come near him.

In German from Martin Buber's translation:
Es geschah,
als Mosche vom Berge Ssinai herabstieg
die zwei Tafeln der Vergegenwärtigung in Mosches Hand,
als er vom Berg herabstieg
– Mosche wußte aber nicht, daß von seinem Reden mit ihm die Haut seines Antlitzes strahlte –,
sah Aharon und alle Söhne Jissraels Mosche an:
da, die Haut seines Antlitzes strahlte,
und sie fürchteten sich, zu ihm zu treten.

Saturday Night Fever (1977)

Saturday Night Fever (1977)

Where were you, when they showed this movie? What was your life like?

yet another posting of mine on the FRITZ!Boxes

Somebody asked me a few questions regarding the FRITZ!Boxes, and I think, it makes sense to answer them here on this blog.

No, I haven't been digging into the FRITZ!Boxes "at levels well below the standard web UI".

There is no SSH server, that comes with the FRITZ!Boxes.There are private web-sites, where you can find a ready-made dropbear SSH server. I got mine that way. And a gkrellmd as well.

There should be a way, to create yourself a development system targeted towards the "busybox" on your particular FRITZ!Box, but there are various slightly different processors being employed. I wasn't successful there, when I tried. There is ip-phone-forum.de, where the FRITZ!Boxes get discussed a lot, but the preferred language there is German, although they would most certainly try to answer question, that people ask in English.But the threads are mostly in German. Be aware, that the main community for the FRITZ!Boxes is located in Germany and I guess also in Austria and in Switzlerland. You may also find a HOWTO, that explains how to set up a development environment.
But actually ... - see above!

The "international" 7390 "speaks" English, but I don't think, there a lot on frequent firmware updates for it. They only provide frequent firmware updates for their resp. battleship, and that is "certainly" not the *international* version of the 7390 but the "German" version -- but how do I actually know? I don't own an international 7390, I only brought one to Martinique and saw it working for like 2 weeks.

I assume the firewall on the FRITZ!Boxes is homegrown.

What tools and languages are available?
Well, basically they use the "busybox" software and what is built into that.
Nowadays there is also a "lua" interpreter provided. AVM seem to use that for CGI purposes.
perl etc. are regarded far to resource hungry.

Saturday, October 23, 2010

The Road (2009)

The Road (2009)

Post-apocalyptic …

It made me cry, it almost made me throw up. It's definitely not going to cheer you up.

This was my Saturday night movie.

I dislike Facebook apps pulling list of friends

I never let then do that, and I hope you (as my Facebook friend) don't either – if I notice you let them throw  their propaganda at me, I will have to remove you as my Facebook friend.
Why isn't there a Facebook flag to generally opt out of that kind of thing? I guess that's against there business model. How bad.

"Fear, uncertainty and doubt" - Wikipedia, the free encyclopedia

Fear, uncertainty and doubt - Wikipedia, the free encyclopedia

"Embrace, extend and extinguish" - Wikipedia, the free encyclopedia

Embrace, extend and extinguish - Wikipedia, the free encyclopedia

Apple: iChat vs. FaceTime (XMPP?)

In his recent keynote Steve Jobs announce FaceTime for OS X as something new. But didn't we all wonder, why they didn't just extend iChat to allow us to talk to somebody using FaceTime on the iPad or iPhone? I would rather prefer that.

Actually I would like to know, whether FaceTime and iChat are XMPP base or whether Apple prefers to invent proprietary protocols. Even “embrace and extend” would be better than that. Hmm, I don't really mean “embrace, extend and extinguish”, but that's what “embrace and extend” sadly redirects to on Wikipedia. I mean: does the world really need yet another messaging standard? There is SIP around, and it should actually get replaced by XMPP, but that's not really going to happen for the next 15 years, I assume. Google and Facebook employ XMPP for Google Talk resp. the Facebook chatting facility, and I like that.

Thursday, October 21, 2010

the coming App Store for Mac OS X: the next Apple jail?

First it sounds like the App Store will help the users to have a single point, where to find and update apps for OS X. But then: look what it means on iOS, i.e. the iPhone and the iPad: it means: apps will only find their way to your device through Apple's "placet", and finally through the App Store, and Apple will always earn money through 3rd parties' contributions.

Deutsches Historisches Museum Berlin: Exhibition: Hitler and the Germans. Nation and Crime

Deutsches Historisches Museum Berlin - Hitler and the Germans. Nation and Crime - Exhibition

Location: Deutsches Historisches Museum, "Ausstellungshalle von I. M. Pei".

Opening hours: 7 days a week 10:00 through 18:00, on Friday through 21:00.
2010-10-15 .. 2011-02-06.

I found the exhibition far too overcrowded. The captions made very bad use of 3 different fonts, occasionally making the text rather unreadable. I never saw any exhibition as badly made as this one.

Date format(s) - Wikipedia, the free encyclopedia

Date format(s) - Wikipedia, the free encyclopedia

I really love ISO 8601 as date format.

Mad Max Beyond Thunderdome (1985)

Mad Max Beyond Thunderdome (1985)

with Tina Turner as Aunty Entity

my KDE GNOME zigzag

Pls read this article for more information!

Oracle issues first OpenOffice.org 3.3.0 release candidate - The H Open Source: News and Features

Oracle issues first OpenOffice.org 3.3.0 release candidate - The H Open Source: News and Features

The Road (2009) - IMDb

The Road (2009) - IMDb

From the storyline on IMDB:
… They have nothing: just a pistol to defend themselves …
OMG! Did I really suggest this movie for my Saturday night? I know … – it was one of the movies I listed, because I disliked the others or I already saw them.

Ondine (2009) - IMDb

Ondine (2009) - IMDb

Google releases Chrome 7.0 stable - The H Security: News and Features

Google releases Chrome 7.0 stable - The H Security: News and Features

weather.com - Northern-Europe Satellite Map

weather.com - Map Room - Satellite Map, Weather Map, Doppler Radar Europe: Northern Satellite

Set default values with the defined-or operator. | The Effective Perler

Set default values with the defined-or operator. | The Effective Perler

Amazon EC2 Running SUSE Linux

Amazon Elastic Compute Cloud (EC2) Running SUSE Linux Enterprise Server

AWS: Host Your Web Site in the Cloud – Amazon Web Services Made Easy – by Jeff Barr

Host Your Web Site in the Cloud by Jeff Barr

Wednesday, October 20, 2010

upgrading my ASUS notebook to openSUSE 11.3

This was a day (or at last a couple of hours) for finally updating my Linux notebook again.
I had screwed the packaging system or yast like 2 months ago, and I haven't been able since to do any updates through yast.
This afternoon I did not really have the nerve for anything else then this – well, I could have gone to the gym, but I wanted to be a little productive at least.

I booted from a CD-ROM, installed ("upgraded") over the network, and after like 2.5 hours it looked as if I could log into the new system straight away. But I couldn't. After the log in there wasn't really any progress, it kept showing me the wallpaper. I started a yast remotely over "ssh -X", created a new user, logged into that user, but creating virtual desktops wasn't really successful. My diagnosis was, that the openSUSE guys finally had screwed, what they long intended to screw: gnome on openSUSE. I gave KDE a chance. I installed KDM and the KDE Desktop packages, restarted xdm, logged into my account, created my 4*5 virtual desktops, and started starting up my work environment as usual. KDE feels a little strange, but then: I actually live within emacs and a couple of xterm-s. Why did I actually switch from KDE to gnome like 12 years ago during my London period? Because KDE only allowed me 4*2 virtual desktops or so. That restriction got lifted apparently, so no big deal, I now go with KDE – not as a disciple, just as a user.
This is what I call my KDE GNOME zigzag.

And now back to work, resp. answering a couple of e-mails.

<off_topic>
The reason, why I was quite nervous during this afternoon, "dissolved". I received rather good news. Maybe I just interprete them a little too positive. Maybe I am going to recover economically a little (or even more then that) in a short while. Now I have good reason for some optimism at least. I am a lot calmer now. Maybe I should not have written private e-mails during the last couple of days. They weren't really the most sensitive ones for quite some time. But they were honest. And I had good reasons to write them the way I did. Yes, this paragraph does not belong under this title.
</off_topic>

Update 2010-10-21 #1
The upgrade screwed my mailing system as well.
The messages sounded, as if the SMTP server "out there" did not want to relay my messages any longer. The support staff there told me, my software would not attempt the required authentication (any longer). Remember: all this had worked before the upgrade!
The mailing system set up makes me believe, they push you into setting up an LDAP server, and that they want the local SMTP server to talk to that LDAP server. Alright, easy dialogs, set up seems to complete successfully.
But then: still to not success WRT the mailing system.
A couple of years ago I had switched from using sendmail to postfix, as the set up had no longer worked easily with sendmail, but with postfix it did then. So now I give sendmail another try. And? Success!!!
sendmail rants something like this:
Authentication-Warning: MY_BOX.fritz.box: MY_USER set sender to USER@SURNAME.name using -f
That was really easy to solve. I had seen that message a couple of times before in my life. I always got it solved. Searched the web for this message, leaving out the private bits, keeping "using -f" together by quoting it in the enquiry. Found the resp. documentation on the sendmail.org website. Added MY_USER to /etc/mail/trusted-users, just as that manual page said. Great! My mailing system works again.
This is what I call my sendmail postfix zigzag.


Update 2010-10-21 #2
Yippee!!!! My Samba set up seems to still work – my Linux computers function as file servers on my LAN, i.e. also to my Mac OS X machine. If that wouldn't work any longer, that would be bad. But it does work.

do not work on your Google Mail address book in parallel!

I worked on my address book in two different tabs of my browser. That was actually Google Chrome 8, but I think, that doesn't matter. If you modify and save an address book entry in tab "B", you will not see these changes in tab "A". They are not coordinated, and that's not optimal. I am quite sure, the changes in tab "B" are still updated on your persistent database at Google's sites, and you can expect to see the changes, when you load your address book anew.

Thursday, October 14, 2010

Petrocelli is an American legal drama which ran for two seasons on NBC from September 11, 1974 to March 31, 1976

Petrocelli - Wikipedia, the free encyclopedia

Petrocelli on IMDb

In Germany it was shown starting February 1976 on ZDF. I loved watching Petrocelli then, and I wanted to study law then, but then I felt my writing capabilities were not sufficient for this kind of profession.

"double taxation treaty", that's not the only term for it

In English they call it
  • double tax ...
  • double taxation ...
  • double ... agreement
  • double ... convention
  • double ... treaty
So there are abbrevations like DTA and DTC.

In German the word is Doppelbesteuerungsabkommen, and it is abbreviated as DBA.

Wednesday, October 13, 2010

yet another UMTS/HSDPA modem for my FRITZ!Boxes: ZTE INCORPORATED K3565-Z

Vodafone had a campaign with my gym (FitnessFirst), they gave away free UMTS / HSDPA modems, and you won't believe that: they are SIM lock and net-lock free.

The modem presents itself to my 7270 as
  • ZTE INCORPORATED K3565-Z.
It works fine with my 2 SIM cards (with dedicated tariffs) for use with UMTS modems:
  • one from T-Mobile
  • one from simyo
This is my very 1st HSDPA modem, and it works with my FRITZ!Boxes – that's great!

Now I got one modem per SIM card, and I don't need to swap SIM cards on my 4G UMTS modem any longer.

Thank you, Vodafone!

Tuesday, October 12, 2010

my blogs, Facebook, Twitter, "NetworkedBlogs"

Most things I write and that I want to share, I usually write and publish on one of my blogs.
But I like to see them on my Facebook wall, in my Twitter stream, and also in my Buzz stream as well.

NetworkedBlogs.com offer a Facebook application, that picks up the (new) articles on my blogs and creates resp. entries on my Facebook wall and also on my Twitter stream.

Now the tweets, they generate also appear on my Facebook wall. That is a little annyoing. And I don't know, how to stop this. Is it NetworkedBlogs.com, that pulls my Twitter stream onto my Facebook wall, is it Facebook itself. For sure, it's me, who configured that somewhere at some stage in the past. But as I said: I don't know, how to stop this.

It's pretty similar on my Buzz stream, but that's not directly because of NetworkedBlogs.com. I want to have my blogs articles included there and also my tweets. But then most tweets got created from blog articles. It's rather rare, that I write a tweet manually. I got my Buzz stream to include my blogs and also my Twitter stream. That is easy to handle.

Sometimes I remove duplicate Facebook wall and Buzz entries, but not very often.

I am a software developer, that's true, but so far I have been a little lazy, and I did not create an application myself, that would smartly derive Facebook wall and Buzz entries and also tweets, avoiding redundancy. I am sure my readers get pretty annoyed by that redundancy. It's simply, that I have no time to fight this redundancy. This is rather sad.

Bible (World English) - Wikisource

Bible (World English) - Wikisource

Another good online ressource for quoting.

Old English Hexateuch - Wikipedia, the free encyclopedia

Old English Hexateuch - Wikipedia, the free encyclopedia

First I got a little confused, as so far I have only been familiar with the term Pentateuch (Greek for five books), somehow synonymous to  Torah.

PDF::Extract - search.cpan.org

PDF::Extract - search.cpan.org


A Perl module, that you can find on CPAN.

I have not used it yet; this article here is only my bookmark for it.

PDF::Burst - search.cpan.org

PDF::Burst - search.cpan.org

A Perl module, that you can find on CPAN.

I have not used it yet; this article here is only my bookmark for it.

PDFsam – PDF Split and Merge

PDF Split and Merge

Software implemented in Java, it comes as a GUI, that calls misc. utilities. The author latest goal is to implement a web hosted version of this software.

I have been using this software for a couple of years.

Monday, October 11, 2010

MP3 tags and file names


  • IMHO file names should never ever contain anything else but only "printable ASCII" characters.
  • letting music file names reflect the contents is sometimes just too much work, so sometimes I abbreviate them to just the track number
I just came across some Glenn Gould albums, and now the file names all look like 99.mp3. I first tried something better than that, but that was just PITA and I stopped it. I did improve and clean the track names themselves though. Maybe they are too long for the iPod now, but they display well in iTunes.

RCS vs. Apache Ant – **/*.java vs. RCS/*.java

Looks like the exclude tag also works well within the javac tag.

I like having at look at Roseanne Zhang's "questions and answers on Apache Ant" now and then.

Links:

Microsoft filed an action against Motorola - The H Open Source: News and Features

Is Microsoft running out of steam? - The H Open Source: News and Features

Saturday, October 9, 2010

The Social Network (2010) - IMDb

The Social Network (2010) - IMDb That's the movie on Mark Zuckerberg, the "inventor" and creator of Facebook.

From Trivia:
  • The opening breakup scene with Jesse Eisenberg and Rooney Mara ran eight script pages and took 99 takes. (link)
  • "Who was the movie star?" – "Does it matter?" – the movie star was, in fact, Natalie Portman (Born: Natalie Hershlag), who was enrolled at Harvard from 1999 to 2003 and helped screenwriter Aaron Sorkin by providing him insider information about goings-on at Harvard at the time Facebook first appeared there.
  • The Winklevoss twins were both played by actor Armie Hammer. However, Ralph Lauren model Josh Pence played one of them strictly from the neck down. His face was digitally replaced with Hammer's to make them appear identical, as the two men are unrelated and look nothing alike. The two spent 10 months in twin boot camp to match one another's subtle movements and rapport.
From Quotes:
  • "As if every thought that tumbles through your head was so clever it would be a crime for it not to be shared."
  • "You're not an asshole, Mark. You're just trying so hard to be one."
Update after watching the movie:

I think the Winklevoss twins should not have gotten any money. My impression is, that they just discussed a business idea with Zuckerberg, and discussing such an idea a without non-disclosure agreement isn't really worth anything. They should not have gotten money or shares from Zuckerberg or Facebook, that was wrong.

I think, it's very sad, how the friendship between Mark Zuckerberg and Eduardo Saverin evolved, but then that's how it goes with "business partnerships". You have to make sure to stay in very, very close contact with your partners, otherwise you run the risk to get catapulted out of the game. Of course it's esp. very sad to see, that "Zuckerberg dropped Saverin's 30% ownership share of Facebook down to 0.03%" (from en.wikipedia.org/wiki/Eduardo_Saverin#Personal_life_and_Facebook).

Justin Timberlake … plays Napster creator Shawn Fanning as a slightly delusional, paranoid entrepreneur (from technofunkie's review (on IMDb) on this movie).

I am very, very grateful to my supporter friend, who allowed me to watch this movie.

TableSeer.SourceForge.net

TableSeer | Download TableSeer software for free at SourceForge.net

TableSeer is a tool that automatically identifies tables in digital documents and extracts the contents in the cells of the tables as well as table metadata.
That software seems to apply more heuristics than pdf2table.

my wifi access point TP-LINK TP-WA901ND

I am employing another wifi access point (that's the device, whose name I am using here in the tittle) and a different WPA encrypted wifi network (SSID) for my neighbours, and today I thought, I should have a look at it again.

I got intrigued to do a firmware upgrade on the device, not really the latest one, but one of this year, and after rebooting
  1. I noticed, the "System" LED kept blinking slowly,
  2. and I also couldn't access it any more.
I feared, these 2 symptoms would be related, and that made me a little nervous.

I enforced a reset by pressing the reset button for 5 seconds, and applied the settings again, that I had applied severals months ago. Everything is fine now.

The "System" LED is still blinking slowly, but the manual doesn't handle this case and I also can't notice any real problem, that this might indicate, so I simply ignore that.

Just to make sure, that my lovely neighbours do not successfully invite anybody to my wifi network by themselves, today I enabled MAC address whitelisting on that wifi network, and I added the MAC addresses of all of their computers, that I am aware of. MAC address whitelisting is not the last security feature I apply to protect myself and my own computers. I also route their IP packages through a different VLAN on a managed switch, a Netgear FS526T, but that actually belongs into a different blog article.

Actually this device is not only a wifi access point. You can also operate it in several other modes, but right now I am not in the mood to describe that.

Tuesday, October 5, 2010

on PDF

Nowadays on the Web or through e-mail you are getting more and more PDF files as electronic documents instead of documents on paper.

Roughly spoken PDF documents are expected to display the same way on every computer platform (as opposed to documents created by usual word processing software). This is regarded a major advantage of PDF.

PDF vs. fonts vs. platform (in)dependence vs. resizability/scalability

Whenever a PDF document makes use of  outline fonts and stroke fonts as opposed to bitmap fonts (see the Wikipedia article on computer fonts!), you are able to resize resp. rescale your document to different sizes without suffering from the loss of quality of the fonts used. This is in general considered another major advantage.
But computer fonts are not in the public domain, so on every computer platform, different available fonts are used for PDF documents.

So what can we do against platform dependency stemming from fonts?
  • Include the fonts: that's the approach used by PDF/A.
    PDF/A is especially employed, where documents need to be available even after many years in the context of document archives.
    The major downside of this approach: PDF/A documents are much, much bigger than usual PDF documents, storing the fonts within them takes a lot space.
  • Another approach is to render text and fonts into ready-made bitmaps.
    Of course documents of this kind display best with a 1:1 relationship of the pixels in your documents to the pixels on your screen resp. on your printer output.
    Any resizing / rescaling results in pour quality.
    And I think you understand this very well: there is not text (as text) at all left in your PDF document, and you will not be able to extract any text from such a document.

Now you know: different kinds of PDF documents come with different advantages and also disadvantages.

I am interested here in PDF documents, that are not rendered into "one bitmap per page", but which rather contain the source document's text. Extracting that text simply as text is more or less an easy piece of cake, and there already exists software for this purpose.

PDF basics

Before I dive with you into what information we want to extract from PDF files, I want to explain PDF a little.

I am honestly not too deep into PDF, but I understand it as an advanced and optimized version of PostScript. My little knowledge of PostScript is (please find a slightly lengthier version here in the Wikipedia article!):
  • It's a stack-based programming language like Forth using reverse Polish notation.
  • It has data structures like arrays and dictionaries, but nothing more abstract than that.
  • Subprograms are called resp. regarded as operators of the stack machine.
  • Some relevant information details may be coded into operator names.
  • Some other relevant information details (like page numbers) are coded into comment lines, see the article on PostScript Document Structuring Conventions. I have no clue, what corresponds to that in PDF. Maybe there are language elements for that.
Now you have an idea of how PDF looks like, and you may have a vague idea, of what is possible with PDF and what isn't.

Sunday, October 3, 2010

Google URL shortener Goo.gl Goes Public

Goo.gl Goes Public

Borderline personality disorder (BPD) - Wikipedia, the free encyclopedia

Borderline personality disorder - Wikipedia, the free encyclopedia

Eat Pray Love (2010) - IMDb

Eat Pray Love (2010) - IMDb:

A married woman realizes how unhappy her marriage really is, and that her life needs to go in a different direction. After a painful divorce, she takes off on a round-the-world journey to "find herself".
The married woman is being portrayed by Julia Roberts, so even if there was far too much of that self-finding-thing in that movie for me, I always enjoy looking at her - apart from when she looks sad, because I find her ugly than - but I really like her smile.

Javier Bardem played her Brazilian lover (although he actually is a Spaniard), he even spoke some Portuguese there, and he did a good and serious job.

The nicest music in the movie (IMHO) is actually also Brazilian, and I loved it (you can of course also find it on YouTube, but no nice one with Bebel Gilberto performing):


There were a few scenes, that really got me crying, e.g. the farewell scene between the Brazilian father and his son.

This was my Saturday night movie at the CineStar Original movie theatre at the Sony Center in Berlin. I really enjoyed it - but for the pictures and the music.
The story and the the main character are truely sick, and this is how one of the reviewers on IMDb ended his text:
Do not see this movie and encourage others to avoid it like the plague!
He titled "American Films Continue to Glorify Female Borderline Personality Disorder", and I think, I agree to him.

Them (2006) - IMDb

Them (2006) - IMDb

Horror | Mystery | Thriller

Watched this French-Rumanian scary movie on Saturday / Sunday night. It really took hold of me.

Aura Dione - I Will Love You Monday

Aura Dione - Song for Sophie [Official Video HD]

Saturday, October 2, 2010

e-mail addresses and "plussing"

e-mail addresses and "plussing"

e-mail messages addressing John.Doe+MailingListName@gmail.com are meant to actually go to johndoe@gmail.com, in other words:

  • “.” characters actually get removed for computing the real mail box
  • everything starting the “+” character and going to the “@” character (not including the latter) gets removed entirely
On the recipient side, software can check on plussing and may come to decisions based on the string between the “+” and the “@”.

Yes, gmail does support plussing.

On my domains I have a catch-all rule for e-mail forwarding aliases, and on those computeres, where I receive e-mail using fetchmail, procmail rules help me with the checks.

When I will get around to it, I will write here under "e-mail", how I make use of IMAP, procmail, and fetchmail.