Wednesday, June 30, 2010

some * domains

All of a sudden today I thought I should register a couple of domains.
Here they are:
Quite a good start -- even e-mail addresses are set up!
If you want to have any such e-mail address -- talk to me!

working through

what do you think about working through as a "provider"?

Spidering Hacks - O'Reilly Media


Hack #2 – Best Practices for You and Your Spider    

Be Liberal in What You Accept
… This is an inexact science, to put it mildly. …

Monitor your spider’s output on a regular basis to make sure it’s working as expected [Hack #31], make the appropriate adjustments as soon as possible to avoid losing ground with your data gathering, and design your spider to be as adaptive to site redesigns [Hack #32] as possible.

Don’t Reinvent the Wheel
  • Best Practices for You
If you must scrape HTML, do so sparingly. If the information you want is avail- able only embedded in an HTML page, try to find a “Text Only” or “Print this Page” variant; these usually have far less complicated HTML and a higher content-to-presentation markup quotient, and they don’t tend to change all that much (by comparison) during site redesigns.
Hack #4 – Registering Your Spider
By the way, you might think that your spider is minimal or low-key enough that nobody’s going to notice it. That’s probably not the case. In fact, sites like Webmaster World ( have entire forums devoted to identifying and discussing spiders. Don’t think that your spider is going to get ignored just because you’re not using a thousand online servers and spidering millions of pages a day.
Naming Your Spider
… There are web sites, like, devoted to tracking IP addresses of legitimate spiders. …
Hack #5 – Preempting Discovery
No matter how gentle and polite your spider is, sooner or later you’re going to be noticed. Some webmaster’s going to see what your spider is up to, and they’re going to want some answers.

Hack #6 – Keeping Your Spider Out of Sticky Situations Hack
Bad Spider, No Biscuit!
… There is nothing stopping a disgruntled site from revising its TOS to deny a spider’s access, and then sending you a “cease and desist” letter. … Spidering another site’s content and reappropriating it into your own framed pages is bad. Don’t do it. …
Competitive IntelligenceSome sites complain because their competitors access and spider their data—data that’s publicly available to any browser—and use it in their com- petitive activities. You might agree with them and you might not, but the fact is that such scraping has been the object of legal action in the past. Bid- der’s Edge was sued by eBay ( for such a spider. …
Possible Consequences of Misbehaving Spiders
… But considering lawyer’s fees, the time it’ll take out of your life, and the monetary penalties that might be imposed on you, a lawsuit is bad enough, and it’s a good enough reason to make sure that your spiders are behaving and your intent is fair.
Assembling a Toolbox

Hacks #8–32

Chapter 4 Gleaning Data from Databases

Hack #69 – Aggregating RSS and Posting Changes
-> meta feeds, aggregating feeds, …

Tuesday, June 29, 2010 needs a couple of changes -- let's start brain storming!

For me as a freelancer it's very clear:
  • There must be separate feeds for freelance and salaried staff.
  • There should be an opportunity of commenting on the job postings, e.g. if the original poster doesn't close the job, it makes sense to get that information from somebody else, maybe from somebody who was somehow involved. Yes, that cannot happen anonymously.
What else?

Yes, I tried to contact Ask Bjørn Hansen at ask(AT) before I started this here, but to no success.

"Senior Software Engineer - Perl" / "Germany, Karlsruhe" / "Pay rate: 70,00 €/h" / CLOSED

"Closed", so the recruiter says.
What a pity, that comments on job postings on are not possible.

"Tour de Babel" by Steve Yegge

My whirlwind tour will cover C, C++, Lisp, Java, Perl, (all languages we use at Amazon), Ruby (which I just plain like), and Python, which is in there because — well, no sense getting ahead of ourselves, now.

"A Quick Tour of Ruby" by Steve Yegge

Very nice to read.

Ruby used to annoy me simply by existing. I first heard about Ruby years ago, in maybe 1997 or 1998, and folks said it was kind of like Perl, but "cleaner", whatever that meant. Ruby fans back then seemed like a tiny minority of rebels and fringe separatists.
Ruby irked me primarily because we already had Perl, which was working just fine thank you very much. And if for some strange reason you didn't like Perl, we had Python. If Perl fans were dog owners, and Python fans were cat owners, then Ruby fans seemed like ferret owners. They could go on and on about how much they adored their beady-eyed albino stretch-limo rats, and how cute they were, but we all knew they were just looking for attention. Nobody really wants a pet rat. (Ferret owners will correct me and say they're not rodents; they're more closely related to weasels and skunks. As if that helps.) Regardless, I didn't want to have anything to do with Ruby.
Last year, though, I was looking at a bunch of different languages in the hopes of finding one to replace Perl for small- to medium-sized tasks. One day my magic Perl dust had worn off rather suddenly, and I'd joined the growing ranks of people who were beginning to notice the emperor was a wee bit underdressed. But all the alternatives to Perl looked pretty bad themselves, and I started judging languages by how far I'd get into the reference manual before throwing it across the room.
I eventually picked up a Ruby book -- ...
Steve ...'s home page.

I personally keep loving both of them. I can afford that in the comp.lang.* area and in some others as well, but that doesn't concern my girl-friend, of course.

I actually came across Steve, when I searched for elisp.

Monday, June 28, 2010

iPhone apps, that I need sooner or later

how to avoid to accidentally Quit Firefox?

Is there any config. variable?
There is a checkbox labeled "Warn me when closing multiple tabs". That does the job.

first steps in IRC with pidgin

  1. "Add Account"  for each IRC server/user pair (e.g., that you want to use, within pidgin with IRC as protocol
  2. "Join a Chat" (below Buddies), select the right Account (i.e. one of your IRC protocol/server accounts), enter the Channel (including the '#'), leave the Password blank! here we are!
Did I mention recently, how much I love my pidgin?
I did all this with a (fink) pidgin on my MacBook running Snow Leopard (OS X), but I don't doubt, it will also run on my openSUSE Samsung notebook.

networks and logos

Where do you get to the personalised logos resp. badges of misc. networks:
To be continue …

Sunday, June 27, 2010

how to structure my stuff on

I need help. Does anybody want to instruct me, how to set up my stuff resp. repositories on My profile there is this. I created a RELAX-NG schema earlier today, and that's the first to upload.

bulk upload of events in XML at XING

At this link XING tells us, that they accept XML files in a certain format for uploading events to their site. They supply us with some nice but rather informal documentation in PDF, but there is no schema. Now that made me a little curious today, I downloaded their template file, created a few variants of the sample entry, and running trang I created a RELAX-NG schema file (anybody interested?). trang seems to not recognise date-time values of tags, so that always needs a little manual post-processing.
O, yes, I proudly uploaded one such event that way. (Actually on the first attempt I got the year wrong, that was quite embarrassing.)

Saturday, June 26, 2010

Erekat to Meridor: Without two-state solution by year's end, 'you will sweat'

my approach to HTTP scripting, web harvesting, page scraping, and all that

  • use LiveHTTPheaders in Firefox
  • resp. ieHTTPHeaders in IE for extracting the relevant HTTP traffic;
  • run my script on that and create a raw perl script (raw because e.g. it doesn't know, that it includes session IDs etc. as constant literals),
  • that makes use of my JHwis toolkit around libcurl, it also includes a truely working cookie machine.
  • on the Curl web-site you can also find, that helps a lot in finding all the "input" tags.
  • now you have a raw perl script, that you can enrich will all necessary condition handling – there is all kinds of stuff in that code (like in any code generated from visiting a web-site), that you need to replace by dedicated handling, like session IDs and similar fields and URL ingredients; associate coordinate names with the resp. image names.
  • on CPAN lives a module by the name of HTML::TableExtract for extracting HTML tables;
    I wrapped it up a little, so that I can make use of it on a command line.
  • that command line utility supplies me with all necessary options for related tasks regarding navigating through HTML and all its tables, just what HTML::TableExtract actually does.
  • I love writing "small" utilities in perl resp. ruby, that I wrap up bash resp. zsh scripts.
Years ago (I guess the situation is not so much different nowadays) libcurl was just so much more powerful than LWP, that I simply had to got for libcurl. Read "Using cURL to automate HTTP job" aka "The Art Of Scripting HTTP Requests Using Curl", that's IMHO the major evergreen in that area. libcurl does a lot more than just simple PUT, GET and that sort of thing.
All that makes my swiss army knife of web harvesting and page scraping.
Reality is seriously more challenging than text book examples, trust me!
Right, I could make all of that open source. I just recently started my open sourcerer career, after SzabGab had stayed in my place for a couple of days around at Berlin.

I should also mention Daniel Stenberg, the father of curl. IMO without his great work the art of HTTP scripting would not stand, where it stands today.

Right: last not least: no, I am not into dealing with AJAX and all that. For the last couple of years my approach has been: with the toolset I described above I can still manage "all" tasks without caring for AJAX. It does not matter enough.

Wednesday, June 23, 2010

shared synchronised editing with GNU Emacs

I continuously edit my ~/diary on my openSUSE Linux notebook and on my MacBook.
Emacs helps me there quite a lot. In case I changed something on that file on some other computer than the one I am currently sitting in front of, it warns me and also offers me to revert to the last version saved to disk.
I really love that.

Timer Utility - i use this on mac os x

Alarm clock, countdown timer, and stopwatch.
great stuff!!!
almost like the iPhone utility

MacMetronome - i use this on mac os x

Facebook does (not) import my Google Buzz public feed

First it didn't, what surprise ;-)

Update / 2010-06-23 17:00:
Works now.

I actually first tried that through "Import an External Blog", and you can only name there a single one. But they seem to reject that because of some details in the XML, that they don't like ;-)
But you can always use NetworkedBlogs for more letting other streams like blogs and whatever RSS feeds flow into "your" Facebook (wall).

Tuesday, June 22, 2010

Photo Booth on OS X

It was time to change the pics on Facebook, Skype etc. again.
Wondered how to take a few pics using the MacBook.
Finally found Photo Booth.
Move the utility's window to that corner of the screen, where you want your face to look at, otherwise your eyes look into the "wrong" direction. Sounds obvious, I know.

we remember the end of WW I and its armistice in Compiègne Forest

  • article on
  • the armistice at the end of WW II has also taken place in Compiègne Forest
  • today is still a bank holiday in Belgium

Monday, June 21, 2010

GoogleCL: Command-line tool for Google services

GoogleCL: Command-line tool for Google services

CGI::IDS - PerlIDS - Perl Website Intrusion Detection System (XSS, CSRF, SQLI, LFI etc.)

CGI::IDS - PerlIDS - Perl Website Intrusion Detection System (XSS, CSRF, SQLI, LFI etc.)

some old funny Python anti-Perl and Ruby anti-Python propaganda

Have you ever come across this this piece of Python anti-Perl propaganda? The article that referred to it is actually Ruby propaganda and dated 2004-07, so that anti-Perl propaganda is even older than that. The Ruby now found a followup. Look and enjoy!

By any chance: isn't there any gifted artist around, that the Perl community can make use of in order to create some Perl-minded stuff?

Update / 2010-08-05:
In response to my re-share here more sweet perl vs. ruby vs. python propaganda got created by Mark Keating: :

debugging my .procmailrc

It's usually not really big fun, if you find a message like this in your procmail log file: "procmail: Missing action".

Well, I don't yet know of a procmail syntax checker, which is rather a pity, so the debugging is of a dynamic kind.

Just add "VERBOSE=on" in the beginning of your .procmailrc, enjoy the flood of "procmail: No match on ...,", and find your "procmail: Missing action" occasionally in the middle of it. You found the corrupted rule then. The rule in question might actually be a very rarely used one resp. one pretty at the end of your .procmailrc, so it may take a while resp. a couple of incoming mails, until it gets used and shows up.
Once you found and corrected the offender, I am quite sure, you will remove "VERBOSE=on" very quickly.

"Mock the Web Service" - O'Reilly Broadcast

Mock the Web Service - O'Reilly Broadcast

spoilt files from bittorrent

I recently saw "a friend" ;-)  who complained, that he had downloaded huge movie files from a bittrorrent, and when he tried to watch them, it turned out, that a special viewer is needed, and that that special viewer even makes a monthly subscription necessary. Well, under these circumstances downloading movies from bittorrents don't make sense.

FIFA worldcup on my Google Calendar

I just wanted to add the games to my calendar, now there are so many game events in my calendar, that I don't really see anything any more. And as there is one calender per team (and I only selected a couple of teams), it's not even easy to see the calendar wi/o FIFA worldcup. What a mess! ;-)

correcting red eyes on pictures

Looks like Picasa 3 for Windows seems to be the appropriate tool for doing that. Free and easy to use -- these reasons are good enough for me.
Update / 2010-06-22:
Alright, alright, I can simply use iPhoto on OS X for that. Silly me!
Did quite a few snapshots, uploaded them to the relevant places …
Life seems to be far easier with the right tools, i.e. on a Mac, esp. if it's running a Unix derivate …

2010 LinuxTag and Perl::Staff on Picasa

Sunday, June 20, 2010

to proselytize

For those, that don't trust me (again), that there is no such word: . For some reason I prefer this word over the similar . Right, they don't actually show the some progress of the process (ROTFL!!!), but they go to the same direction.

@szabgab: I had to think of you and p-e-r-l, when I wrote this. (Using the dashes, so this article does not get picked up by the "respective" blog article grabber, as szabgab occasionally is soooo embarrassed over what I say.) In German we also have the phrase "Proselyten machen", which in this context sounds really, really funny – therefore ROTFL.

a new wonderful book on DocBook by Norman Walsh: "DocBook 5: The Definitive Guide"

Norman is an excellent writer, and it's good fun reading his books. If you are interested in DocBook, then get this book: = DocBook 5: The Definitive Guide - O'Reilly Media

I purchased the PDF recently from O'Reilly, and I just printed a single page for a friend, who looks like being my newest DocBook proselyte.

Saturday, June 19, 2010

duplicating a tab in Firefox

Came across this recently whilst browsing one of those magazines.
What would you need this for?
Well, if I have my Google Mail Contacts open and I want to write a message to one of my contacts, but I also still want to keep … Contacts open – you never know, how long it takes to complete a message, but you still want to be able to look up contact details – then at least I need it.
On OS X you drag the tab to another place with the mouse or whatever and the Alt key pressed. On Linux and Windows it's the Control key instead.

Friday, June 18, 2010

keyboard shortcuts on all the major operating systems

For the last couple of months I have been struggling with the keyboard of my beloved MacBook Pro. How often did I click on "Show Keyboard Viewer"? And it still didn't really help on the long run. Today I did some respective "research", and here are the links:
Some of those shortcuts, that I really love now on my Snow Leopard MacBook:
  • "cycle through open … windows of the current desktop" resp. "switch focus to the next/previous window (without dialog)": Ctrl+F4 or Cmd+` (that's the Grave accent key).
  •  "show / hide desktop": F11
Today I also screen dumped my the Keyboard Viewer window, printed it a couple of times, and scribbled all the other uses of a key (together with fn, Control, Alt, Command) on it. This will seriously help me learning how to find brackets, curly braces (that I need for perl and ruby), the tilde (that I need for Portuguese and in a shell command line) far sooner.

Update / 2010-06-22:
Yes, on OS X at System Preferences / Keyboard / Keyboard Shortcuts, this is where a lot of nice shortcuts get listed.

Wednesday, June 16, 2010

using curl for "streaming" Flash movies into a file on your hard disk

Flash movies are usually thought of as being safe against being stored by the viewer. Well, that changed, when rtmpdump and other tools appeared on the scene. Actually the rtmpdump development resp. repository got hidden in a safe harbour. I wonder, how long curl's feature will survive, as the movie industry will not like that feature to be easily available, and with one of the world's best downloaders (I actually think the best), the only bits you need, are the values in  your HTML, that corespond to the host, "the app", and the playpath, and all three go into single command line argument, which is a little special. I assume, the assignment to the shell variable playpath flows into a second line. It surely shouldn't do that within your shell script.

You need a version of curl  ≥ 7.21.0 for having that feature on board.

These are sample command lines for this job:

$ playpath='mp4:videoportal/mediathek/Polizeiruf+110/c_120000/126955/format130594.f4v'
$ curl "rtmp:// app=ardfs/ playpath=${playpath}" \
    -o "Polizeiruf_110--Aquarius.mp4.flv"

Update / 2010-07-10:
My commentator apparently experienced problems with that. I have to assume, he didn't download 7.21.0 from its usual place.

activating the Meta key for the Terminal app under Mac OS X Snow Leopard

I just came across a description on how to do this, which seems far outdated and just not working. So I thought, I would let you know, how it really works with Snow Leopard.
Here you see the window, that pops up for Terminal's Preferences menue entry:
You can see the Use option as meta key switch – use it, if you want!

"rake", the ruby DSL, improved

Well, I started using rake instead of make in 2007, when I got ruby infected. I noticed then, that rake's output is a little "dis-arranged" (the entire "command" came printed in  one single line). That changed in the meantime, at least now with 0.8.7 it is just the way I like it. Thank you to the developers!

Google Mail Contacts is my personal killer app

Once again: it syncs with the iPhone address book.
I use it for reverse look up (phone number to address book entry) together with my telephone system in my home office, and that's build around an AVM "FRITZ!Box". The glue software got implemented by myself in ruby (w/o Rails).
Why ruby? I thought, I could make it run with Cocoa Ruby on the iPhone. But you know yourself: Apple fights those kinds of things. And just for running it with a GUI on a usual Mac OS X? No, that's not worth my effort.

Update / 2010-06-16 11:45:
I forgot mentioning here. Downloading vCards from xing and adding them here – that's just awesome!

my own "tiny" Facebook security leak

I started using a Facebook "Profile Badge" here on this blog quite a while ago. Today I noticed, that all my Facebook status updates got shown here on that Profile Badge in the right column of this blog. Believe me: I seriously hurried removing that field "Status updates" from my Facebook Profile Badge.

To be very honest with you: no, this security leak was a giant one, not a tiny one.

I really love using vCards and profiles pictures from Xing

I "always" also "copy" them for my Google Mail address book, which gets synced with my iPhone's address book. This really makes my life so much easier  – recognising resp. remembering people by their faces is so much easier.
Update / 2010-06-16:
I forgot to mention, that I always use Xing's vCards for new contacts to create resp. complete their entries in my Google Mail resp. iPhone address book.

perl events on a public and shared calendar

Gábor Szabó is my hero!

I came across the TPF's Perl 5 Wiki entry on events, suggested to him and Renée Bäcker to also maintain that calendar on a public and shared calendar, and Gabór just pointed me to that Google calendar (Google Calendar ID: that had already gotten set up. Hey guys, your are doing a really great job.

Of course I added that calendar immediately to my list.

option and configuration processing

Once again I came across this very nice and certainly very helpful article on O'Reilly's

I personally really have been loving the Art of Command Line Processing for a very, very long time. During one of my last projects (it was actually mainly using p*th*n as programming language because of some rather weird and esoteric guy, who made it his personal mission to reinvent all the software spread over the bank from perl to p*th*n, especially the one he had written himself in quite unreadable perl a couple of years ago) I faced the task of prepping up a Java program from, and the major value I added to that utility was to add command line processing using libraries from Apache. I actually offered Teodor Danciu of JasperSoft, JasperReports' inventor and main developer, to contribute my version of that utility to JasperForge, but he wasn't interested. I was rather sad. Maybe he will rethink that decision.

Update / 2010-09-19:

I think, I sort of "lied" :-) here. I didn't "offer" Teodor that extension of his own OSS software, that he had developed himself and then published under LGPL, I rather asked him this:
Is there any interest in a "'-D' command line / getProperty" version of TextApp?
I would really love to even write and contribute a new and better version of my old approach, which was 100% based on OSS.

Can you imagine, that I get legally threatened for the above question, because an organisation thinks to possess rights on these minimal modifications I applied? We are talking literally about a couple of lines.

comparison of web application frameworks on

I was cleaning up my Google bookmarks, came across Maypole, tried to look it up, and came across a comparison of web application frameworks on, that includes the perl approaches.

My humble suggestion:
members of the respective perl communities add resp. maintain their entries to / within that table. mojolicious came to my mind at first. I personally wouldn't be too shy changing anything on wikipedias, but I can't promise to keep the entries updated – but I might be of help setting up the row in that table, if that's of any use.

Tuesday, June 15, 2010

me at

This is me and Ignacio Correas Usón (the "men in red"), "Mr. ebox", at the Perl Booth. He is seriously seeking business partners in Germany for the ebox.

my first scrubyt extractor

Followed this wiki article.
Had to install XPather and DOM Inspector in Firefox (quite common plugins), and also FireWatir, as described on this blog.
Back-quotes and forward-quotes within that article are just simple-quotes instead.
To be continue …

installing FireWatir for Firefox

Followed this wiki article.
The list of "files attached" is truely broken, but the official installation guide helps.

Monday, June 14, 2010

installing scrubyt from Github on openSUSE-11.2

 Followed this blog article.
Had to install rpm libxml2-devel, not just libxml2.
Had to install rpm libxslt-devel, not just libxslt.

Sunday, June 13, 2010

set an ACL on /var/run and locked myself out a little

I wanted to run a script of mine with output in /var/log and its PID captured in a file in /var/run. I must have gotten something wrong with the ACL I set on them, as GDM did not succeed starting my next session afterwords. Took me a little to find out, how that came, but .xsession-errors was my friend there.

prepped up my 1st contribution to CPAN

Now I am waiting for my PAUSE ID on CPAN. My first script to contribute is named, a utility, that tries to handle XLS files a little like TAR treats its archives.

Update / 2010-06-13 14:00:
Got my account on PAUSE resp. CPAN acknowledged ("johayek"), so my home directory there is
Uploaded xls-tar.tar, now waiting for it to show up.
Do scripts (as opposed to modules) get listed within a user's directory?

Update / 2010-06-13 14:25:
xls-tar_1_32 first showed up on (actually as $CPAN/authors/id/J/JO/JOHAYEK/ . Sooooo … – now I am quite proud.
More scripts to appear occasionally …
Yes, now I expect getting tarred and fathered  – pls have mercy!!!

contributing on CPAN

This week with the perl community around made me rethink my status of "involvedness" with the perl community a little. Yes, I am going to really share things from now on, e.g. on CPAN. So this morning after Gábor Szabó's departure, I not only applied for getting this blog merged into, but I also applied for an account on, and right now I virtually stand in the middle of that article "The [Perl programming] Authors Upload Server", where I got to from "How do I contribute modules to CPAN?" in the CPAN FAQ. Currently I rather have a few scripts in mind, that I want to share, so I am also in the middle of "How to submit a script to CPAN", another entry in the CPAN FAQ. I am really excited about this right now. More soon here!

applied for "Planet Perl Iron Man"

Gábor Szabó encouraged me to a couple of things during this week with him around One of them was to "join the program" at, which seems to be a Grand Unified Perl Blog. I am rather excited about this right now.

Wednesday, June 9, 2010

Parrot, the virtual machine

I am rather astonished:  Parrot has a rather lengthy list of existing client languages in its Wikipedia article, and there is an even longer one on I wonder, how lively they really are.

Ragel State Machine Compiler

Ragel compiles executable finite state machines from regular languages. Ragel targets C, C++, Objective-C, D, Java and Ruby. […]

Screen Capture in Snow Leopard

Helpful article!!
To take a screen capture press Cmd+Shift+4 on your Snow Leopard. […]

ELPA = Emacs Lisp Package Archive

This article describes how to install the installer.
[...] Once you have installed the package manager, type M-x package-list-packages. Type r in the package menu buffer to update the list of packages available from the server. If you want a particular package, type i next to its name to mark it for installation, and then x to download and install it.[...]

"A weblog client for emacs"

weblogger.el implements Blogger, MetaWeblog and (hopefully soon) the Atom API weblogging APIs.

TextMate style "snippets" for Emacs

An article talking about YASnippet. So maybe after having learnt, how powerful TextMate is, I will reconvert back to Emacs, using YASnippet "of course". A few words on snippets ...

Monday, June 7, 2010

copying from from OS X to a Samba server fails, using Cyberduck instead

The operation can't be completed, because you don't have permission to access some of the items.
Did you come across that error message before? I did, more than once.
Today I started using Cyberduck to copy over to the other side, but now as SFTP server. That's almost as nice, at least it looks like it's not going to fail that silly.

Saturday, June 5, 2010

more on web harvesting

Update 2010-06-05/06:
One night later I am still very impressed by scrubyt, and I rather want to try it on a real life example quite soon.
Actually in a way scrubyt does, what I also do with my JHwis toolkit, but of course, it looks, as if goes far (?!?) beyond that. JHwis navigates in a programmed way through web-sites, and it downloads certain HTML files to the disk for further processing. Those HTML files contain HTML tables, and there is already a nice PERL library, that I wrap into a command line utility, that extracts HTML tables into CSV files. These CSV files are actually not really of a kind, that you can directly load into a spreadsheet GUI utility like OpenOffice Calc or whatever. They need further mechanical processing and refinement, before they can get loaded into database tables.
With scrubyt's help (apparently) you extract an XML file from the quite nested HTML table structures of a web page.
Years ago, when I started my project I created CSV files. A couple of years later, I also created XML files. But I never adapted the entire tool chain to make use of these XML files.
My XML files only reflect exactly the data, that I want to make use of.
scrubyt's XML files reflect (I think) the entire table structure.
Nowadays with XSLT processors you "easily" develop an XSL script (aka "stylesheet"), that extracts the portion, that you are really interested in.
To be continued ...

Friday, June 4, 2010

jsvi = VI in Javascript

I will have to look how to make use of it.
Here is the link.

web harvesting and my toolkit JHwis

I implemented a toolkit years ago, that I call JHwis. Now and then I think, I should have do more advertising for it. I have been using software created by that toolkit for downloading bank account statements and other stuff for years now. I would like to prove you, it's also very well suited for web harvesting, according to the English wikipedia "a focused form of a web crawler search". Who is going to give me the opportunity of proving that?

phonebook on Facebook

Did you know, you have a phonebook on Facebook? Now you do. But nowadays more and more Facebook friends remove their phone numbers from their Facebook profile.

a rather impressive "sand drawing movie" showing Germany's invasion and occupation of Ukraine during WWII

Thursday, June 3, 2010

fictiticious world cup letter to the wife

Dear Sweetheart,

  1. Between 11 June and 11 July 2010, you should read the sports section of the newspaper so that you are aware of what is going on regarding the South African World Cup, and that way you will be able to join in the conversations. If you fail to do this, then you will be looked at in a bad way, or you will be totally ignored. DO NOT complain about not receiving any attention.
  2. During the World Cup, the television is mine, at all times,  without any exceptions. If you  even take a glimpse of the remote control, you will lose it (your eye).
  3. If you have to pass by in front of the TV during a game, I don't mind, as long as you do it crawling on the floor and without distracting me.
  4. During the games I will be blind, deaf and mute, unless I require a refill of my drink or something to eat. You are out of your mind if you expect me to listen to you, open the door, answer the telephone, or pick up the baby that just fell on the floor. It won't happen.
  5. It would be a good idea for you to keep at least 2 six packs in the fridge at all times, as well as plenty of things to nibble on (excluding your body parts), and please do not make any funny faces to my friends when they come over to watch the games. In return, you will be allowed to use the TV between 12am and 6am, unless they replay a good game that I missed during the day.
  6. Please, please, please!! If you see me upset because one of my teams is losing, DO NOT say "get over it, it's only a game", or "don't worry, they'll win next time ". If you say these things, you will only make me angrier and I will love you less. Remember, you will never ever know more about football than me and your so called "words of encouragement" will only lead to a break up or divorce.
  7. You are welcome to sit with me to watch one game and you can talk to me during halftime but only when the commercials are on, and only if the half time score is pleasing me. In addition, please note I am saying "one" game; hence do not use the World Cup as a nice cheesy excuse to "spend time together".
  8. The replays of the goals are very important. I don't care if I have seen them or I haven't seen them, I want to see them again, Many times.
  9. Tell your friends NOT to have any babies, or any other child related parties or gatherings that requires my attendance because:
    a ) I will not go,
    b ) I will not go, and
    c) I will not go.
  10. But, if a friend of mine invites us to his house on a Sunday to watch a game, we will be there in a flash.
  11. The daily World Cup highlights show on TV every night is just as important as the games themselves. Do not even think about saying "but you have already seen this, why don't you change the channel to something we can all watch?" because, the reply will be, "Refer to Rule #2 of this list".
  12. And finally, please save your expressions such as "Thank God the World Cup is only every 4 years". I am immune to these words, because before and after this comes the Champions League, Premier League, Italian League, Spanish League, KPL, FA Cup, Euro Cup, etc.
By the way if you get stuck on the road call the Police or AA.
Thank you for your co-operation.

Tuesday, June 1, 2010

ruby and curly braces around statements

"In memory" of perl, ruby allows curly braces in "almost every place", where perl allows it. But it does not allow it for function resp. method definitions, and not for if/elsif/else. There it enforces the keyword "end". Correct me, if I'm wrong.