I was looking for some nice Ruby utility to help in parsing out Microformats from webpages. There are 3 projects currently on RubyForge:
- uformatparser - looks very flexible and mature. developed by LabNotes
- uformats - pretty good, only supports a couple of formats
- mfTools - nothing released yet, unknown status
Talking on #microformats
I was pointed to LabNotes newer incarnation of a parser: scrAPI. It's a much more generic HTML parser/scraper, that can handle getting data from HTML by structure, class, or id. Here is Assaf's presentation at Mashup Camp II where he gives some good tutorials and discussion about the API.
Down and dirty with the code
To illustrate scrAPI, I'll show you the code needed to parse geo location data from a webpage.
First we just do bring in the necessary libraries, and get an example HTML page:
require 'scrapi/lib/scrapi'
require "net/http"
h = Net::HTTP.new("code.highearthorbit.com", 80)
resp, data = h.get("/greaseroute/index.php")
Then we define our scrapers. The geo microformat looks like:
-80.764
The process
method of the Geo
class can take a HTML structure path, CSS class or id, and then the attribute to store. Also, our general Location
scraper will look for all geo
class tags in the HTML, and fill out the geos
array using the Geo
class Scraper.
class Geo < Scraper::Base
process ".latitude", :latitude => :text
process ".longitude", :longitude => :text
end
class Location < Scraper::Base
array :geos
process ".geo", :geos => Geo
result :geos
end
Finally, now that we've built up our "tools", we can scrape the data, and output all the found locations.
locations = Location.scrape(data)
locations.each {|loc| puts "[#{loc.latitude} x #{loc.longitude}]" }
That was really easy, and effective. Additionally, due to the Microformats standards, we can feel pretty confident on changes to the original site's markup to not mess up our parsing.