Using Hpricot

Hpricot is a HTML parser for the Ruby programming language. With Hpricot you can scan and scape a HTML document. To illustrate how to use Hpricot i’ll write a list the code of a short script I recently wrote. The script grabs all the links for the past week from A Rubyist Railstastic Adventure, a tumblelog.

The general structure of the HTML used by the web page that I will be scraping is something like the following.

One thing to note about the HTML produced by the site we will scape is that the date is optional in the post. The date is only displayed once for a day, so some posts don’t have a given date. Also, there are several other types of posts such as quotes, images, etc. We are only interested in posts with links. Again, the Ruby/Hpricot script will only gather the links for the past week.

Technorati Tags: , , , ,

The $100,000 Customer »
« This Week in Ruby
 
Related Posts
Recent Posts
 

2 Comments so far

  1. David Madden on May 23rd, 2008

    I really love Hpricot. It’s a really useful library.

    Good post; as a working example can only help in get more people using it.

  2. Hakeem on May 26th, 2008

    How do I submit button or click links with hpricot if I was setting a pluggable parser for mechanize?

Leave a reply