Status
No public Twitter messages.
Location
Arlington, VA
Subscribe to GeoRSS Subscribe to KML


scrAPI – Microformat Parsing in Ruby

Published in Programming, Project, Ruby  |  1 Comment


I was looking for some nice Ruby utility to help in parsing out Microformats from webpages. There are 3 projects currently on RubyForge:

  • uformatparser – looks very flexible and mature. developed by LabNotes
  • uformats – pretty good, only supports a couple of formats
  • mfTools – nothing released yet, unknown status

Talking on #microformats I was pointed to LabNotes newer incarnation of a parser: scrAPI. It’s a much more generic HTML parser/scraper, that can handle getting data from HTML by structure, class, or id. Here is Assaf’s presentation at Mashup Camp II where he gives some good tutorials and discussion about the API.

Down and dirty with the code

To illustrate scrAPI, I’ll show you the code needed to parse geo location data from a webpage.

First we just do bring in the necessary libraries, and get an example HTML page:


require 'scrapi/lib/scrapi'
require "net/http"

h = Net::HTTP.new("code.highearthorbit.com", 80)
resp, data = h.get("/greaseroute/index.php")

Then we define our scrapers. The geo microformat looks like:
<div class=”geo”>
<span class=”latitude”>35.126</span>,
<span class=”longitude”>-80.764</span>
</div>

The process method of the Geo class can take a HTML structure path, CSS class or id, and then the attribute to store. Also, our general Location scraper will look for all geo class tags in the HTML, and fill out the geos array using the Geo class Scraper.


class Geo < Scraper::Base
  process ".latitude", :latitude => :text
  process ".longitude", :longitude => :text
end

class Location < Scraper::Base
  array :geos
  process ".geo", :geos => Geo
  result :geos
end

Finally, now that we’ve built up our “tools”, we can scrape the data, and output all the found locations.


locations = Location.scrape(data)

locations.each {|loc| puts "[#{loc.latitude} x #{loc.longitude}]" }

That was really easy, and effective. Additionally, due to the Microformats standards, we can feel pretty confident on changes to the original site’s markup to not mess up our parsing.

Similar Posts


Responses

  1. Chris Blow says:

    December 5th, 2008 at 7:46 pm (#)

    if you are still working on this, care to update it? I’m sure you are doing it completely differently now. Thanks c

Leave a Response