Today most people are overwhelmed with information. Not only is there an enormous amount of information to read on a variety of devices, a lot of this information links to other content which takes time to fetch and users end up wasting time if the content is not relevant to them.
Wasting time is something we don't want to be doing ever and guess what ? Our bosses don't want us to do that either. They want us to share and learn but they definitely don't want us to waste time.
With that goal in mind the Socialcast team decided to embark on a mission of providing more context for links which are shared in our application. A small description, picture, title and topics it covers. This practice is not new. Other consumer portals already parse html to try and extract this information.
This is a difficult job because before HTML 5, markup was not inherently semantic.
Most of the tags in HTML were only for layout.
Being the agile team that we are we didn't want to spend a lot of time trying to handle html with tags that are not properly closed and figuring which image is the most appropriate one to render, the representative image, based on size. Too complex.
So we asked what is the best technique for capturing the relevant data in a web page. Our research brought us to the following specifications
- RDFa : http://www.w3.org/TR/rdfa-syntax
- Open Graph Protocol: http://opengraphprotocol.org/
- Microdata: See previous post http://montrics.blogspot.com/2010/10/html5-implementors-experience-with-ogp.html
- Microformats: http://microformats.org/wiki/Main_Page
- oEmbed: http://www.oembed.com/
- And we should not leave out HTML5
What do they have in common ?
All of these are specifications detailing how to add semantic data to your web pages. The specifications cover the syntax and concepts and in some cases detailed vocabulary.
How many objects can you read out of an html page ?
- RDFa : As many as you want. You can create new vocabularies and not only describe objects but also entire sentences with subject predicate and object. It also allows cross referencing objects
- Microformats: All the ones which map to a specific microformat.
- Open Graph Protocol : One main object with some predefined relations. The object can have a variety of types specified by Facebook.
- oEmbed one specified via
link rel="alternate" type="text/xml+oembed"
- Microdata: As many as you want and does not enforce global uniqueness or the use of namespaces for types. Objects can be defined adhoc.
What is needed to use ?
- RDFa: http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd in XHtml doctype
- Microformats: Nothing
- Open Graph Protocol: Nothing but it should be same as RDFa
- oEmbed: Another document. So performing a separate http request.
- Microdata: HTML 5 doctype but doesn't break anything in practice
Initially we just wanted to extract a single object and decided to use the Open Graph Protocol. It provides a short set of rules which on one hand is great because the code to parse it is very short but on the other hand its not flexible enough as described in my earlier post where there are issues working in existing closed source Business Systems.
This is why we turned to Microdata. We have posted on our wiki a lot of details on how the parsing works. How to extend it and we have built the ability for other vocabularies to be used like Activity Streams or even the oEmbed vocabulary can be used.
So what do the users get in return ?
- Distributed discussions through out their eco system
- Good sources of material curated by people they trust, their colleagues.
- Rapid deployment
- Easy to add to business systems