scrAPI - Microformat Parsing in Ruby

I was looking for some nice Ruby utility to help in parsing out Microformats from webpages. There are 3 projects currently on RubyForge:

Talking on #microformats I was pointed to LabNotes newer incarnation of a parser: scrAPI. It's a much more generic HTML parser/scraper, that can handle getting data from HTML by structure, class, or id. Here is Assaf's presentation at Mashup Camp II where he gives some good tutorials and discussion about the API.

Down and dirty with the code


To illustrate scrAPI, I'll show you the code needed to parse geo location data from a webpage.

First we just do bring in the necessary libraries, and get an example HTML page:


require 'scrapi/lib/scrapi'
require "net/http"

h = Net::HTTP.new("code.highearthorbit.com", 80)
resp, data = h.get("/greaseroute/index.php")

Then we define our scrapers. The geo microformat looks like:

35.126,
-80.764

The process method of the Geo class can take a HTML structure path, CSS class or id, and then the attribute to store. Also, our general Location scraper will look for all geo class tags in the HTML, and fill out the geos array using the Geo class Scraper.


class Geo < Scraper::Base
process ".latitude", :latitude => :text
process ".longitude", :longitude => :text
end

class Location < Scraper::Base
array :geos
process ".geo", :geos => Geo
result :geos
end

Finally, now that we've built up our "tools", we can scrape the data, and output all the found locations.


locations = Location.scrape(data)

locations.each {|loc| puts "[#{loc.latitude} x #{loc.longitude}]" }

That was really easy, and effective. Additionally, due to the Microformats standards, we can feel pretty confident on changes to the original site's markup to not mess up our parsing.

About this article

written on
posted in ProgrammingRubyProject Back to Top

About the Author

Andrew Turner is an advocate of open standards and open data. He is actively involved in many organizations developing and supporting open standards, including OpenStreetMap, Open Geospatial Consortium, Open Web Foundation, OSGeo, and the World Wide Web Consortium. He co-founded CrisisCommons, a community of volunteers that, in coordination with government agencies and disaster response groups, build technology tools to help people in need during and after a crisis such as an earthquake, tsunami, tornado, hurricane, flood, or wildfire.