RDF, Big Data and The Semantic Web

24 April 2012

I’ve been meaning to write this post for a little while, but things have been busy. So, with this afternoon free I figured I’d write it now.

I’ve spent the last 7 years working intensively with data. Mostly not with RDBMSs, but with different Big Data and Linked Data tools. Over the past year things have changed enormously.

The Semantic Web

A lot has been talked about the Semantic Web for a long time now. In fact, I often advise people to search for Linked Data rather than Semantic Web as the usefulness of the results in a practical context is vast. The semantic web has been a rather unfortunately academic endeavour that has been very hard for many developers to get into. In contrast, Linked Data has seen explosive growth over the past five years. It hasn’t gone mainstream though.

What does show signs of going mainstream is the schema.org

initiative. This creates a positive feedback loop between sites putting structured data into their pages and search engines giving those sites more and better leads as a result.

Much has been said about Microdata killing RDF blah, blah but that’s not important. What is important is that publishing machine-understandable data on the web is going mainstream.

As an aside, as Microdata extends to solve the problems it currently has (global identifiers and meaningful links) it becomes just another way to write down the RDF model anyway. RDF is an abstract model, not a data format, and at the moment Microdata is a simplified subset of that model.

Big Data and NoSQL

In the meantime another data meme has also grown enormously. In fact, it has dwarfed Linked Data in the attention it has captured. That trend is Big Data and NoSQL.

In Planning for Big Data, Edd talks about the three Vs:

To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They’re a helpful lens through which to view and understand the nature of the data and the software plat- forms available to exploit them. Most probably you will contend with each of the Vs to one degree or another.

Most Big Data projects are really focussed on volume. They have large quantities, terabytes or petabytes, of uniform data. Often this data is very simple in structure, such as tweets. Fewer projects are focussed on velocity, being able to handle data coming in quickly and even fewer on variety, having unknown or widely varied data.

You can see how the Hadoop toolset is tuned to this and also how the NoSQL communities focus mostly on denormalisation of data. This is a good way to focus resources if you have large volumes of relatively simple, highly uniform data and a specific use-case or queries.

Apart from Neo4J, which is the odd-one-out in the Big Data community this is the approach.

RDF

So, while we wait for the semantic web to evolve, what is RDF good for today?

That third V of the Big Data puzzle is where I’ve been helping people use graphs of data (and that’s what RDF is, a graph model). Graphs are great where you have a variety of data that you want to link up. Especially if you want to extend the data often and if you want to extend the data programmatically — i.e. you don’t want to commit to a complete, constraining schema up-front.

The other aspect of that variety in data that graphs help with is querying. As Jem Rayfield (BBC News & Sport) explains, using a graph makes the model simpler to develop and query.

Graph data models can reach higher levels of variety in the data before they become unwieldy. This allows more data to be mixed and queried together. Mixing in more data adds more context and more context adds allows for more insight. Insight is what we’re ultimately trying to get at with any data analysis. That’s why the intelligence communities have been using graphs for many years now.

What we’re seeing now, with the combination of Big Data and graph technologies, is the ability to add value inside the enterprise. Graphs are useful for data analysis even if you don’t intend to publish the data on the semantic web. Maybe even especially then.

Microsoft, Oracle and IBM are all playing in the Big Data space and have been for some time. What’s less well-known and less ready for mainstream is that they all have projects in the graph database space: DB2 NoSQL Graph Store; Oracle Database Semantic Technologies; Connected Services Framework 3.0.

Behind the scenes, in the enterprise, is probably the biggest space where graphs will be adding real value over the next few years; solving the variety problem in Big Data.