Part 3 in looking at the current state of GeoWeb Standards. See the introduction here.
It's time to take a hard look across the board at where we're coming up short and issues that need to be addressed. One way to summarize:
GeoRSS, KML, and GeoJSON are the itching powder, squirting ink pen, and dribble cup of geodata formats.
- Sean Gillies
Sean is definitely known for his candor, and his viewpoint definitely has merit. Overall the various formats and standards fulfill various needs, but still don't provide for all use cases, align well with best practices, or make sense to users and developers.
The simplest overall problem with many of these formats, and how they fit into the Web, is that they lack proper web-type descriptions. One primary mechanism that Web clients know how to present data is through the use of MIME-Types. MIME types provide a way for the server to notify clients that the data is in a format such as XML, Text, a PNG Image, and so on. These must be formally registered, but also ad-hoc, or vendor specific, types are commons.
In addition, MIME types allow crawlers and registries to easily record the type of the file in the metadata.
Looking over our list of various GeoWeb standards, it's very easy to identify which formats abide by this and which don't.
Atom, JSON, HTML, and SQLite all provide format specific MIME-Types, allowing clients to easily employ the proper applications. However, none provide a special mechanism for notifying that the data includes geospatial markup. Not necessarily a problem, geo shouldn't be that special.
KML is perhaps the only format that has a geospatial specific MIME-type. However, despite it now being an OGC standard, the MIME Type is still the vendor specific:
application/vnd.google-earth.kml+xml. However, KML was particulary ingenious in also providing for the compressed, or zipped, format as a unique MIME-type:
GML is just XML, so that is entirely not useful in notifying a client that it should try and pass this onto a geo-enabled application. And Shapefiles are agglomeration of multiple files, and even zipped up are only marked as compressed files.
More broadly in services, the OGC has a mime type for service descriptions and responses:
application/vnd.ogc.wms_xml, though errors have their own MIME-types:
OpenSearch has a special MIME Type, and obviously Tiles and Image files have MIME-types.
Doesn't matter if you can't download it
Another major issues facing many of the GeoWeb formats is their file size. Generally, the web bounces back and forth between disregarding sizes due to assumed, ubiquitous high-speed and reliable connectivity, and trying to speed up pages. But even more important is the fact that many potential users don't have access to high-speed internet and so their is a huge difference between 10k and 100k or 1MB of data.
To compare the sizes, I took a relatively large dataset from GeoCommons, Statistics Canada, Land and freshwater area, Canada, 2005 and exported it in a variety of formats, both uncompressed, and compressed via standard zip algorithms.
|Shapefile||5.4 MB||3.6 MB|
|GeoRSS||3.3 MB||1.1 MB|
|KML||7.3 MB||2.4 MB|
|Spatialite||5.4 MB||3.6 MB|
|JSON||7.9 MB||2.3 MB|
CSV just includes latitude and longitude columns of the centroid - so obviously not fully representative. An option would be to include the EWKB in a column for the full geometry - but that is far from any kind of 'standard' that other tools would know how to intepret.
Perhaps most surprising from these results are that JSON is so large. Unfortunately, the syntax for complex geometries requires a lot of syntax that adds up in representing polygonal data.
Linkability, Durability, and Discoverability
Moving past purely file format and data type specifications brings up the issue of discoverability and linkability in GeoWeb standards. The Web is more than a list of documents that mention resources, but that they can actually link to durable endpoints that can be resolved, queried, accessed, and parsed.
Non-web native formats have no concept of linking. CSV, Shapefiles, and SQLite contain data, but no links. By contrast, Atom, GML, and KML are chock-full of links, although not always used to great effect. JSON can contain links, but without a schema, who knows what the link means.
Obviously the best model to follow here is HTML, which provides automatic links to feeds, OpenSearch description documents, pages, media, styles, and scripts.
However, what happens when a resource disappears and is no longer resolvable? How do you know where else to get another version of the same data, and is it the same data? This is becoming a big problem in the larger web, made more problematic by the use of URL shorteners, but also especially disconcerting when it affects the provenance and accuracy of geospatial data.
But without Complexity
While linkability, durability, and discoverability are vital to GeoWeb standards, the cost of complexity inhibits adoption and probability of support.
This is a long argument in many circles - often made more difficult by practitioners that have been working in a field for years or decades and consider the most opaque formats or concepts commonplace. Look to the OWL/RDF/SemanticWeb space for an example of how there is a mismatch between proponents and the general public.
A standard needs to have clear value to developers and users for it to even begin to be considered. No one is going to dive into a dense specification of a format without even knowing why they would want to use it or how it fits into workflows and architecture.
And complexity can also surface in small ways - inconsistant capitalization of element names (you know who you are KML), or by supporting a plethora of similar, but different flavors making it unclear which to use (GeoRSS).
In this last section of the overall problems we're facing with GeoWeb standards, the most prevalent, and easy to address, is the lack of tools that interact and convert between these formats. Really, formats don't matter to users - they have data from one source such as their camera, PND, blog posts, Government agency, etc. and they want to do something with it like understand what's going on around them, find their favorite restaurant, save the rainforest, provide services, get their car fixed, or just share stories with their family.
Easy to use, engaging, and data agnostic tools are vital for adoption of any formats. Again you only have to look as far as KML's meteoric rise from application specific format to perhaps the most ubiquitous, and growing, GeoWeb standard due to the compelling reason of "I want to see my house and things going on around the world".
Why do none of the major RSS news readers really support GeoRSS? Every site should offer KML and Atom output of their data. Mobile devices should allow me to open in whatever mapping interface or app any of my data from any of my services.
Missing Middle Ground
Amongst the plethora of formats, we're really missing some middle ground. Each of these formats are quite independent and unique of one another, with little cross pollination and linking occuring.
- Why can't my KML file link to Atom updates and also to other formats?
- Can OpenSearch describe my tile pyramid?
- How do I describe my path through life, media, events, places I've lived, worked, and people I've known?
We too easily get caught either in this "this format must solve all possible problems", or "it's good enough so why change it". In between we need to converge to understand use cases, and how these formats and specifications can cross various barriers - connecting the experts with the amateurs, the citizens and the authorities, one with another.
GeoWeb Standards Series