Apr 18 2009

Crowdsourcing the semantic web

Category: Ideas,Semantic WebAleksander Kmetec @ 4:08 am

This is getting a bit old, isn’t it? Even after years and years of hearing about the semantic web, the actual semantic metadata is still an extremely rare occurrence on the web. It’s obvious that our current approach to building the linked data cloud is just not working.

Currently, all attempts at providing semantic metadata require server-side changes which means that we need to rely on page authors to implement them. This, of course, is a major obstacle. But what if we could change that? What if we could bypass page authors and have the crowd add semantic metadata to existing pages?

I believe that this is more than possible.

Semantic metadata would usually be added to web pages by adding additional attributes to HTML elements or creating new elements where needed. But as it turns out, existing web pages are already broken up into a surprising number of elements, so why not just use these?

Let’s take this search result from Amazon as an example:

Block tagging example

Relax, it's just a mock-up. Let's please resist arguing about the correctness of the labels.

Everything in the above image you see having a blue border is already a separate HTML block, addressable using an XPath expression. In order to attach a meaning to a block, we could use this XPath expression and associate it with a meaning, like this:

//div[@class="productData"]/div[@class="productTitle"]/a = “Book:Title”

Behind the scenes URIs from an OWL based ontology would be used instead of simple text labels. These mappings could then be uploaded to a server and made available to anyone.

So let’s forget forget about convincing thousands of web developers to learn and start using semantic markup. Forget Microformats. Forget RDFa. Everything about the semantic web boils down to one thing: reusable data. And existing data already published on the web can be made reusable by the crowd, using simple tools. Within a year we could have thousands of reusable semantic data sources!

Possible uses

Once we have a large collection of rules for determining what the data inside a certain HTML block represents, what can we do with them?

We can begin by creating an application which knows how to extract data from web pages and transform it into various formats like the original HTML with RDFa mixed in, pure RDF, RSS and others. That’s right – we can still have RDF and therefore compatibility with existing and future data published in the same format! This would also create a business opportunity for hosted RDF services, similar to how Feedburner is hosting RSS feeds. A service could even be created for hosting SPARQL endpoints.

Diagram 1

What about some uses that would benefit the regular people instead of just linked data nerds?

I believe that one area which desperately needs semantic metadata is mobile web browsing. Limited screen sizes and tiny or even virtual keyboards make tasks which are trivial to perform using a desktop computer a real chore on mobile devices. With semantic metadata, mobile browsers could be much more context-aware and could offer a better browsing experience. Please see my experimental browser Mosembro for more information on that topic.

Diagram 2

Some other things which would suddenly become easy to implement:

  • Automatic browser-generated mashups. Put a map next to any address, a “find on Amazon” link next to any book, etc.
  • Ad-hoc personal search and comparison engines: search all international Amazon stores, search used car ads on multiple sites or find all the 26″ screens costing less than 400€ by pulling in data from several electronics retailers and then filtering it.
  • Autonomous agents that notify you when one of your favorite travel agencies posts a last minute offer that matches your criteria.
  • Site-level search integrated into the browser (by using semantically tagged search forms). I never want to use browser’s “find in page” functionality again just to locate the search form! And since there would also be semantic metadata for search results available, local search results could then be combined with Google search results for that same domain.
  • Autofill for all kinds of forms.
  • Custom RSS feeds from any source, filtered by any set of criteria. (show Hacker news feed filtered by your favorite posters, or only posts with more than 5 comments)
  • Backup tool for your data stored in web apps: store tagged content in a reusable format; then import it into another app (which can also be automated using tagged input forms).
  • Data portability
  • Pretty much anything else promised by Microformats and RDFa advocates. The list goes on and on.

I suggest you also have a look at demo videos for the Aurora browser concept, as they are full of great examples of what would be possible if lots of semantic metadata was available.

At this point you might be thinking: “This sounds nice and all, but it’s never going to work”, so here’s a closer look at two existing browser extensions which are based on similar approaches.

Extension 1: Autopager

Autopager is a Firefox extension which automatically loads the next page of a site inline when you reach the end of the current page. It’s interesting for us because it uses XPath expressions for addressing HTML blocks, user generated rules and a central server for sharing those rules.

In order for it to do its thing, Autopager only needs to know two things:

  • which page element is the link to the next page
  • which element represents the contents of the page.

This is done using XPath expressions, like shown in the image below:

Autopager

Autopager is very popular and is an excellent proof that this approach works – at least for simple tasks such as sharing rules for locating the “next page” link.

Extension 2: Intel Mash Maker

Intel logo

ddmqm7j6_23fjqwrcgm_b

Intel’s Mash Maker is also a Firefox extension and it comes remarkably close to the idea described above. Poking at the JSON encoded data returned by its web service reveals that it also uses XPath expressions for addressing HTML elements. It also makes it possible to further narrow down the selection using regular expressions to only capture a part of the element’s contents. The part where it falls short, though, is that it only uses plain text lables instead of URIs for defining the meaning of blocks. This unfortunately makes the definitions pretty much unusable outside of their mashup platform.

Unfortunately, the project appears to be dead or on hold. The newest user contributed mashup is more than two months old and the latest blog entry was posted more than a year ago. But it still is worth checking out and it does prove that this approach can be implemented to handle more complex situations than those from the Autopager example.

Finally…

There you have it, folks: a simple and straightforward way of solving the semantic web’s chicken and egg problem if perfectly within our reach.

The question now is who would be willing to sponsor such a project?

Tags: , , , , , ,

19 Responses to “Crowdsourcing the semantic web”

  1. Ben Stein

    Very interesting post!

    For another view of semantic-web solutions – check out http://www.urlclassifier.com – for online demo of classifying web-pages web-service, using some of ContextIn Semanting Advertising algorithms

  2. alex

    Interesting link. I completely forgot about making it possible for the crowd to create metadata about whole pages, not just about elements.

  3. Maciek Adwent

    Well-timed post Alex; We’re also working on something just like this and are trying to make it as pluggable, open, and useful as possible. Along the way we’ve discovered that there are a number of intermediate forms that such an application can take before even offering full RDF/OWL/SPARQL support, which is questionably useful without the widespread existence of tools (A chicken-egg problem, sounds familiar doesn’t it).

    Check it out at: http://scrapmetl.com/

    Give us a shout on Twitter ( @Maciek416 and @corban )

  4. Dan Woolley

    Good explanation. We use these techniques at Dwellicious – extracting property data from publicly available homes for sale pages and turning that into RSS, JSON, and reformatted HTML for social bookmarking.

  5. Vincent Murphy

    How do Solvent http://simile.mit.edu/wiki/Solvent fit into your scheme? I think all that Solvent is missing is a website where you can upload and share your scraper.

  6. tim finin

    How does this comare to the W3C’s annotea project — http://www.w3.org/2001/Annotea/? Somehow that never seemed to gind wide-spead use and was, apparently, abandoned.

  7. Kyle Maxwell

    Have you seen parselets.com? It’s almost exactly this.

  8. Luis Pereira

    Stumpedia is a social semantic project and community effort that relies on human participation and folksonomies to index, organize, and review the world wide web. Their aim is to help build Natural Language Processing and the Semantic Web.

  9. alex

    @tim finin: Just had a quick look at Annotea. It appears to be similar from the technical point of view, but they didn’t take the idea beyond attaching notes to elements. Their main focus seem to be the notes you can attach, not the HTML fragments you’re referencing when you’re attaching the note. Maybe there’s some potential for “abusing” this by putting URIs in the notes, but that wouldn’t be exactly user friendly.

  10. alex

    @Vincent Murphy: Solvent looks very interesting. I never tried it out because it requires PiggyBank, which in turn requires an older version of Firefox… But yes, as far as the element tagging part goes it’s almost exactly like what I described in the post. It could be a very useful starting point for a project. Adding more than just DublinCore support to it might be a good place to start (at least the screencast makes it look like they only support DC).

  11. Stephen Arnold

    The April 16, 2009 Google patent US20090100036.

  12. alex

    I just love the patent application language. I could swear that individual words are in English, but sentences are somehow not. :)

  13. Oleksandr Shturmov

    Google. As they crowd-sourced their image search optimization, they can crowdsource, and thus sponsor something like this. However the semantics is only half of the NLP problem.

  14. Oleksandr Shturmov

    *they can crowdsource this*

  15. Dan Brickley

    A lot of RDF folk are more pragmatic than is sometimes assumed. RDFa is nice, but we also made GRDDL, a system that has much in common with your (perfectly sensible) suggestions, and which uses XSLT as the language for describing such extractors. You might also look at http://buzzword.org.uk/2008/rdf-ease/spec which does similar using CSS-based notation.

    Your positive case here stands alone, no need to beat up on strawmen (” It’s obvious that our current approach to building the linked data cloud is just not working.”) to make it. The “current approach” to growing the linked data cloud is that various of do whatever it takes to get the data out there. Sometimes this is tweaking a PHP script to have an RDF/XML mode or add RDFa / microformats into SQL. Sometimes this is a massive download, clean and republish exercise like DBPedia, sometimes it is conducted by transforming from SQL (D2RQ etc) or XML (GRDDL) or JSON sources. Sometimes data is created afresh, or reworked from other systems (Semantic Mediawiki, MusicBrainz, …). The only thing that binds it all together is the shared use of standards. RDF for data model, RDFS/OWL for vocabulary description, URIs for identifiers, SPARQL for querying, SKOS for simple categories, …

  16. Dan Brickley

    s/SQL/HTML/ in my previous comment, ie. “add RDFa / microformats into HTML”

  17. Aleksander Kmetec

    Wow. RDF-EASE looks like something that would be easily understandable to anyone familiar with CSS and would cut down on the number of new acronyms that would need to be mastered by developers. But still… Remember the CSS “movement” from not that long ago? It took a dozen high profile bloggers and an active community something like 5 years, lots of hard work and some well crafted stories to push CSS into the mainstream. All that for one technology. The semantic community, on the other hand, doesn’t currently have any leaders and no infrastructure for spreading ideas at all. All we have is a big bowl of acronym soup and we’re completely baffled by the fact that (almost) nobody wants to eat it.

  18. Fuller

    MetaSeeker toolkit and web-based services are another example. MetaSeeker toolkit is a HTML wrapper factory. The generated HTML wrappers, or called as scrappers, are coded with XML, XSLT and XPath, which are shared and collaboratively maintained on the Web-based MetaSeeker server. Compared to RDF, XML is more light and pragmatic from the point of view of defining semantic data structures of Web pages.

    Please check it at: http://www.gooseeker.com

  19. Rachel

    I thought the post made some good points on web scrapers, I use python for simple html web scrapers, but for larger projects like the web, files, or documents i tried web scrapers which worked great, they build quick custom screen scrapers, web scrapers, and data parsing programs