Web Crawling, Caching, and API's

Several of the projects I'm currently working on involve harvesting (mashing, grazing, retrieving, etc.) data from other websites. The web has recently (last 2-3 years) made vast amounts of data readily available via API's or simple spiders/harvesters. This is more recently been made apparent by the huge number of mashups that have shown up (including one of my own).

With the large number of mashups, even to the point of seeking venture capital funding, one needs to wonder as to the real value involved. As a ZDNet article points out in Mashups: who's really in control?, the developers of these mashups haven't necessarily added unique value, but provided a path to lead the original data providers or other larger entities to move in and provide a better solution.

Of course, then you actually see the purchase of del.icio.us, GeoBloggers developer Daniel Catt, Platial, et al. You realize that mashups are actually viable. But, as pointed out on a recent Wired podcast as well as Joel Spolsky talk, "Do What you Love".

ProgrammableWeb is an excellent resource listing 192 available API's (and reports that there are approximately 2.78 new mashups per day). Some of these services, such as Vast, are created purely as data providers. (via TechCrunch). They provide great REST interfaces, and many have wrappers written in your favorite language (and if it isn't, perhaps its time you got a new favorite language).

While many services provide excellent API's, sometimes you just need to go old fashion and farm the data yourself, in this case, find yourself something like a Ruby Spider.

About this article

written on
posted in TechnologyProgrammingWeb Back to Top

About the Author

Andrew Turner is an advocate of open standards and open data. He is actively involved in many organizations developing and supporting open standards, including OpenStreetMap, Open Geospatial Consortium, Open Web Foundation, OSGeo, and the World Wide Web Consortium. He co-founded CrisisCommons, a community of volunteers that, in coordination with government agencies and disaster response groups, build technology tools to help people in need during and after a crisis such as an earthquake, tsunami, tornado, hurricane, flood, or wildfire.