Apr 18 2009

Crowdsourcing the semantic web

Category: Ideas,Semantic WebAleksander Kmetec @ 4:08 am

This is getting a bit old, isn’t it? Even after years and years of hearing about the semantic web, the actual semantic metadata is still an extremely rare occurrence on the web. It’s obvious that our current approach to building the linked data cloud is just not working.

Currently, all attempts at providing semantic metadata require server-side changes which means that we need to rely on page authors to implement them. This, of course, is a major obstacle. But what if we could change that? What if we could bypass page authors and have the crowd add semantic metadata to existing pages?

I believe that this is more than possible.

Semantic metadata would usually be added to web pages by adding additional attributes to HTML elements or creating new elements where needed. But as it turns out, existing web pages are already broken up into a surprising number of elements, so why not just use these?

Let’s take this search result from Amazon as an example:

Block tagging example

Relax, it's just a mock-up. Let's please resist arguing about the correctness of the labels.

Everything in the above image you see having a blue border is already a separate HTML block, addressable using an XPath expression. In order to attach a meaning to a block, we could use this XPath expression and associate it with a meaning, like this:

//div[@class="productData"]/div[@class="productTitle"]/a = “Book:Title”

Behind the scenes URIs from an OWL based ontology would be used instead of simple text labels. These mappings could then be uploaded to a server and made available to anyone.

So let’s forget forget about convincing thousands of web developers to learn and start using semantic markup. Forget Microformats. Forget RDFa. Everything about the semantic web boils down to one thing: reusable data. And existing data already published on the web can be made reusable by the crowd, using simple tools. Within a year we could have thousands of reusable semantic data sources!

Possible uses

Once we have a large collection of rules for determining what the data inside a certain HTML block represents, what can we do with them?

We can begin by creating an application which knows how to extract data from web pages and transform it into various formats like the original HTML with RDFa mixed in, pure RDF, RSS and others. That’s right – we can still have RDF and therefore compatibility with existing and future data published in the same format! This would also create a business opportunity for hosted RDF services, similar to how Feedburner is hosting RSS feeds. A service could even be created for hosting SPARQL endpoints.

Diagram 1

What about some uses that would benefit the regular people instead of just linked data nerds?

I believe that one area which desperately needs semantic metadata is mobile web browsing. Limited screen sizes and tiny or even virtual keyboards make tasks which are trivial to perform using a desktop computer a real chore on mobile devices. With semantic metadata, mobile browsers could be much more context-aware and could offer a better browsing experience. Please see my experimental browser Mosembro for more information on that topic.

Diagram 2

Some other things which would suddenly become easy to implement:

  • Automatic browser-generated mashups. Put a map next to any address, a “find on Amazon” link next to any book, etc.
  • Ad-hoc personal search and comparison engines: search all international Amazon stores, search used car ads on multiple sites or find all the 26″ screens costing less than 400€ by pulling in data from several electronics retailers and then filtering it.
  • Autonomous agents that notify you when one of your favorite travel agencies posts a last minute offer that matches your criteria.
  • Site-level search integrated into the browser (by using semantically tagged search forms). I never want to use browser’s “find in page” functionality again just to locate the search form! And since there would also be semantic metadata for search results available, local search results could then be combined with Google search results for that same domain.
  • Autofill for all kinds of forms.
  • Custom RSS feeds from any source, filtered by any set of criteria. (show Hacker news feed filtered by your favorite posters, or only posts with more than 5 comments)
  • Backup tool for your data stored in web apps: store tagged content in a reusable format; then import it into another app (which can also be automated using tagged input forms).
  • Data portability
  • Pretty much anything else promised by Microformats and RDFa advocates. The list goes on and on.

I suggest you also have a look at demo videos for the Aurora browser concept, as they are full of great examples of what would be possible if lots of semantic metadata was available.

At this point you might be thinking: “This sounds nice and all, but it’s never going to work”, so here’s a closer look at two existing browser extensions which are based on similar approaches.

Extension 1: Autopager

Autopager is a Firefox extension which automatically loads the next page of a site inline when you reach the end of the current page. It’s interesting for us because it uses XPath expressions for addressing HTML blocks, user generated rules and a central server for sharing those rules.

In order for it to do its thing, Autopager only needs to know two things:

  • which page element is the link to the next page
  • which element represents the contents of the page.

This is done using XPath expressions, like shown in the image below:


Autopager is very popular and is an excellent proof that this approach works – at least for simple tasks such as sharing rules for locating the “next page” link.

Extension 2: Intel Mash Maker

Intel logo


Intel’s Mash Maker is also a Firefox extension and it comes remarkably close to the idea described above. Poking at the JSON encoded data returned by its web service reveals that it also uses XPath expressions for addressing HTML elements. It also makes it possible to further narrow down the selection using regular expressions to only capture a part of the element’s contents. The part where it falls short, though, is that it only uses plain text lables instead of URIs for defining the meaning of blocks. This unfortunately makes the definitions pretty much unusable outside of their mashup platform.

Unfortunately, the project appears to be dead or on hold. The newest user contributed mashup is more than two months old and the latest blog entry was posted more than a year ago. But it still is worth checking out and it does prove that this approach can be implemented to handle more complex situations than those from the Autopager example.


There you have it, folks: a simple and straightforward way of solving the semantic web’s chicken and egg problem if perfectly within our reach.

The question now is who would be willing to sponsor such a project?

Tags: , , , , , ,