Using Hpricot

Hpricot is a HTML parser for the Ruby programming language. With Hpricot you can scan and scape a HTML document. To illustrate how to use Hpricot i’ll write a list the code of a short script I recently wrote. The script grabs all the links for the past week from A Rubyist Railstastic Adventure, a tumblelog.

The general structure of the HTML used by the web page that I will be scraping is something like the following.

[source:html]
<div class=”post”>
<div class=”date”>
Sun
<em>
May
<big>18</bi>
</em>
</div>

<div class=”link”>
<a href=”http://www.juixe.com” class=”link”>Juixe TechKnow</a>
</div>
</div>

[/source]

One thing to note about the HTML produced by the site we will scape is that the date is optional in the post. The date is only displayed once for a day, so some posts don’t have a given date. Also, there are several other types of posts such as quotes, images, etc. We are only interested in posts with links. Again, the Ruby/Hpricot script will only gather the links for the past week.

[source:ruby]
require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
require ‘parsedate’

# Convert days to number to seconds
def days_to_sec(days)
secs = days.to_i
secs *= 24
secs *= 60
secs *= 60
secs
end

# pretty print the link in a list
def print_link(link)
print ” <li>”,
“<a href=’#{link.attributes[‘href’]}’>”,
“#{link.inner_html.strip}”,
“</a>”,
“</li>\n”
end

def get_links(doc)
curr_date = Time.new
(doc/”div.post”).each do |post|
post_date_elem = (post/”div.date/em”)
date = post_date_elem.inner_html.strip

# Parse the date of the post
if date != “”
date_day = (post_date_elem/”big”).text
date_mon = nil
date.each_line do |line|
date_mon = line.strip if date_mon.nil?
break if date_mon.nil?
end
date_str = “#{date_mon} #{date_day}, #{$week_ago.year}”
data = ParseDate.parsedate date_str
curr_date = Time.local data[0], data[1], data[2]
end

# Stop if already looking past one week
break if curr_date < $week_ago

# Handle all links in post
(post/”a.link”).each do |link|
print_link link
end
end

if curr_date > $week_ago
next_page = Hpricot(open($rubyist_next))
get_links next_page
end
end

# The Rubyist home page to be scraped
$rubyist_home = “http://rubyist.tumblr.com/”
$rubyist_next = “http://rubyist.tumblr.com/page/2”
# Scrape one weeks worth of links
$week_ago = Time.new – days_to_sec(7)

# Run the script
doc = Hpricot(open($rubyist_home))
get_links doc
[/source]

Technorati Tags: ruby, rubyist, hpricot, html, scape

This entry was posted on Monday, May 19th, 2008 at 1:25 pm

Juixe Techknow

Using Hpricot

Love it? Share it!