Using Hpricot

19
May
3

Hpricot is a HTML parser for the Ruby programming language. With Hpricot you can scan and scape a HTML document. To illustrate how to use Hpricot i’ll write a list the code of a short script I recently wrote. The script grabs all the links for the past week from A Rubyist Railstastic Adventure, a tumblelog.

The general structure of the HTML used by the web page that I will be scraping is something like the following.

One thing to note about the HTML produced by the site we will scape is that the date is optional in the post. The date is only displayed once for a day, so some posts don’t have a given date. Also, there are several other types of posts such as quotes, images, etc. We are only interested in posts with links. Again, the Ruby/Hpricot script will only gather the links for the past week.

Technorati Tags: , , , ,

Enjoy. Share. Be Happy.
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • BlinkList
  • MySpace
  • Netvouz
  • NewsVine
  • StumbleUpon
  • TwitThis
Filed under: HTML/XML, Ruby, TechKnow
3 Comments

3 Comments

  1. David Madden
    3:44 am on May 23rd, 2008

    I really love Hpricot. It’s a really useful library.

    Good post; as a working example can only help in get more people using it.

  2. Hakeem
    6:04 am on May 26th, 2008

    How do I submit button or click links with hpricot if I was setting a pluggable parser for mechanize?

  3. cw
    8:49 pm on October 19th, 2008

    The line:

    date_day = (post_date_elem/”big”).text

    fails saying undefined ‘text’

    Any suggestions?

    Thanks

Leave a comment

RSS feed for comments on this post