Saturday, October 30, 2010

web scraping afternoon

This wasn't meant to be yet another web scraping afternoon.

This afternoon started with me trying to recover a little from a hard time.
I had two probation days for a web-site testing job with Selenium, I am in the middle of a couple of recruitment processes, and I don't want to tell you about the real trouble.


  • I got intrigued to search oreilly.com for literature on Selenium and found a "Short Cut" document.
  • I found something.
  • I had a few looks over the chapter on "twill".
  • Before I really dived into the chapter on Selenium, I summed up, what I really liked and disliked about Selenium.
  • Of course, being able to use XPath is great.
  • With Selenium you somehow aren't aware at all, that there is Javascript being made use of on a web-site, but you just leave this to the browser engine, initially to Firefox and to the Selenium IDE.
  • I actually hate it, if your HTTP scripting depends on desktop computers running a browser and some remote control software to connect your server, where you "HTTP scripts" actually run, and the web browser(s), that you make use of.
  • I did a little superficial research on: perl/ruby + mechanize + xpath.
  • Yes, there is still scrubyt around, but isn't  that vaporware now itself?
  • Found perl's WWW::Scraper::TidyXML - "TidyXML and XPath support for Scraper". Not bad. But then it's from around 2003, and it seems to be vaporware. My e-mail to the author could not get delivered ("over quota"), so I guess, it's seriously no longer maintained.
  • WWW::Mechanize::Firefox seems to be nice, have a look at WWW::Mechanize::Firefox::Cookbook!

No comments: