Jochen Hayek's Blog: more on web harvesting

Saturday, June 5, 2010

more on web harvesting

Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails, Episode 1
http://scrubyt.org (ruby)
HPricot.com : "a swift, liberal HTML parser with a fantastic library" (ruby)
http://brightplanet.com : "Pioneers in Harvesting the Deep Web"
...

Update 2010-06-05/06:
One night later I am still very impressed by scrubyt, and I rather want to try it on a real life example quite soon.
Actually in a way scrubyt does, what I also do with my JHwis toolkit, but of course, it looks, as if goes far (?!?) beyond that. JHwis navigates in a programmed way through web-sites, and it downloads certain HTML files to the disk for further processing. Those HTML files contain HTML tables, and there is already a nice PERL library, that I wrap into a command line utility, that extracts HTML tables into CSV files. These CSV files are actually not really of a kind, that you can directly load into a spreadsheet GUI utility like OpenOffice Calc or whatever. They need further mechanical processing and refinement, before they can get loaded into database tables.
With scrubyt's help (apparently) you extract an XML file from the quite nested HTML table structures of a web page.
Years ago, when I started my project I created CSV files. A couple of years later, I also created XML files. But I never adapted the entire tool chain to make use of these XML files.
My XML files only reflect exactly the data, that I want to make use of.
scrubyt's XML files reflect (I think) the entire table structure.
Nowadays with XSLT processors you "easily" develop an XSL script (aka "stylesheet"), that extracts the portion, that you are really interested in.
To be continued ...

1 comment:

Carly Fiorina said...: Hi,

Web harvesting is commonly used to describe web scraping from a multitude of sites. It also refers to an implementation of a web crawler that uses human expertise or machine guidance to direct the crawler to URLs which compose a specialized collection or set of knowledge. Thanks a lot.....

Web Content Extractor; Thursday, March 31, 2011 at 8:25:00 AM GMT+2

Jochen Hayek's Blog

Saturday, June 5, 2010

more on web harvesting

1 comment:

networks, profiles, logos, badges, …

my most exciting web-sites

my home pages, profiles, ...

the top of my book shelf

Popular Posts

Total Pageviews

favourites and wishlist

Blog Archive