Saturday, June 26, 2010

my approach to HTTP scripting, web harvesting, page scraping, and all that

  • use LiveHTTPheaders in Firefox
  • resp. ieHTTPHeaders in IE for extracting the relevant HTTP traffic;
  • run my script scan_HTTP_log_file.pl on that and create a raw perl script (raw because e.g. it doesn't know, that it includes session IDs etc. as constant literals),
  • that makes use of my JHwis toolkit around libcurl, it also includes a truely working cookie machine.
  • on the Curl web-site you can also find formfind.pl, that helps a lot in finding all the "input" tags.
  • now you have a raw perl script, that you can enrich will all necessary condition handling – there is all kinds of stuff in that code (like in any code generated from visiting a web-site), that you need to replace by dedicated handling, like session IDs and similar fields and URL ingredients; associate coordinate names with the resp. image names.
  • on CPAN lives a module by the name of HTML::TableExtract for extracting HTML tables;
    I wrapped it up a little, so that I can make use of it on a command line.
  • that command line utility supplies me with all necessary options for related tasks regarding navigating through HTML and all its tables, just what HTML::TableExtract actually does.
  • I love writing "small" utilities in perl resp. ruby, that I wrap up bash resp. zsh scripts.
Years ago (I guess the situation is not so much different nowadays) libcurl was just so much more powerful than LWP, that I simply had to got for libcurl. Read "Using cURL to automate HTTP job" aka "The Art Of Scripting HTTP Requests Using Curl", that's IMHO the major evergreen in that area. libcurl does a lot more than just simple PUT, GET and that sort of thing.
All that makes my swiss army knife of web harvesting and page scraping.
Reality is seriously more challenging than text book examples, trust me!
Right, I could make all of that open source. I just recently started my open sourcerer career, after SzabGab had stayed in my place for a couple of days around LinuxTag.org/2010 at Berlin.

I should also mention Daniel Stenberg, the father of curl. IMO without his great work the art of HTTP scripting would not stand, where it stands today.

Right: last not least: no, I am not into dealing with AJAX and all that. For the last couple of years my approach has been: with the toolset I described above I can still manage "all" tasks without caring for AJAX. It does not matter enough.

1 comment:

Carly Fiorina said...

Hi Friend,

This is really interesting take on the concept.The theme of your blog is very beautiful and the article is written very well, I will continue to focus on your article.