What people find hard about Linked Data

15 November 2010

This post originally appeared on Talis Consulting Blog.

Following on from the post I put up last talking about Linked Data training, I got asked what people find hard when learning about Linked Data for the first time. Delivering our training has given us a unique insight into that, across different roles, backgrounds and organisations — in several countries. We've taught hundreds of people in all.

It's definitely true that people find Linked Data hard, but the learning curve is not really steep compared with other technologies. The main problem is there are a few steps along the way, certain things you have to grasp to be successful with this stuff.

I've broken those down into conceptual difficulties, the way we think, and practical problems. These are our perception, there are tasks in the course that are the specific what that people find difficult but I'm trying to surmise something beyond that and describe the why of these difficulties and how we might address them.

The main steps we find people have to climb (in no particular order) are Graph Thinking, URI/URL distinction, Open World Assumption, HTTP 303s, and Syntax...

Conceptual

__Graph Thinking__

The biggest conceptual problem learners seem to have is with what we call graph thinking. What I mean by graph thinking is the ability to think about data as a graph, a web, a network. We talk about it in the training material in terms of graphs, and start by explaining what a graph is (and that it's not a chart!).

Non-programmers seem to struggle with this, not with understanding the concept, but with putting themselves above the data. It seems to me that most non-programmers we train find it very easy to think about the data from one point of view or another, but find it hard to think about the data in less specific use-cases.

Take the idea of a simple social network — friend-to-friend connections. Everyone can understand the list of someone's friends, and on from there to friends-of-friends. The step-up seems to be in understanding the network as a whole, the graph. Thinking about the social graph, that your friends have friends and that your friends' friends may also be your friends and it all forms an intertwined web, seems to be the thing to grasp. If you're reading this, you may well be wondering what's hard about that, but I can tell you that trying to think about Linked Data, this is a step up people have to take.

There's no reason anyone should find this easy, in everyday life we're always looking at information in a particular context, for a specific purpose and from an individual point-of-view.

For developers it can be even harder. Having worked with tables in the RDBMS for so long, many developers have adopted tables as their way of thinking about the problem. Even for those fluent in object-oriented design (a graph model) the practical implications of working with a graph of objects leads us to develop, predominantly, trees.

Don't get me wrong, people understand the concept, however, even after experience we all seem to struggle to extract ourselves from our own specific view when modelling the data.

What can we do?

This will take time to change. As we see more and more data consumed in innovative ways we will start to grasp the importance of graph thinking and modelling outside of a single use-case. We can help this by really focussing on explaining the benefits of a graph model over trees and tables.

I hope we'll see colleges and universities start to teach graph models more fully, putting less focus on the tables of the RDBMS and the trees of XML.

Examples like BBC Wildlife Finder, and other Linked Data sites, show the potential of graph thinking and the way it changes user experience.

For developers, tools such as the RDF mapping tools in Drupal 7 and emerging object/RDF persistence layers will help hugely.

Using URIs to name real things

In Linked Data we use URIs to name things, not just address documents, but as names to identify things that aren't on the web, like people, places, concepts. When coming across Linked Data, knowing how to do this is another step people have to climb.

First they have to recognise that they need different URIs for the document and the thing the document describes. It's a leap to understand:

that they can just make these up
that no meaning should be inferred from the words in it (and yet best practice is to make the readable)
that they can say things about other peoples' URIs (though those statements won't be de-referencable)
that they can choose their own URIs and URI patterns to work to

The information/non-information resource distinction forms part of this difficulty too. While for naive cases this is easy to understand, how a non-information resource gets de-referenced and you get back a description of it is difficult. The use of 303 redirects doesn't help, and I'll talk about that a little later in practical issues.

What can we do?

There are already resources discussing URI patterns and the trade-offs that we can point people to. These will help. What I find helps a lot is simply pointing out that they own their URIs, and that they should reclaim them from .Net or Java or PHP or whatever technology has subverted them. More on that below in supporting custom URIs.

As a community we could focus more on our own URIs, talking more about why we made the decisions we did; why natural keys, why GUIDs, why readable, why opaque?

Non-Constraining Nature (Open World Assumption)

Linked Data follows the open-world assumption — that something you don't know may be said elsewhere. This is a sea-change for all developers and for most people working with data.

For developers, data storage os very often tied up with data validation. We use schema-validating parsers for XML and we put integrity constraints into our RDBMS schema. We do this with the intention of making our lives easier in the application code, protecting ourselves from invalid data. Within the single context of an application this makes sense, but on the open web, remixing data from different sources, expecting some data to be missing, wanting to use that same data in many different and unexpected ways this doesn't make sense.

For non-developers often they are used to business rules, another way of describing constraints on what data is acceptable. Also common is that they have particular uses of the data in mind, and want to constrain for those uses — possibly preventing other uses.

What can we do?

Tooling and application development patterns will help here, moving constraints out of storage and into the application's context. Jena Eyeball is one option here and we need others. We need to support developers better in finding, constraining, validating data that they can consume in their applications. Again, this will come with time.

We could also look for case-studies, where the relaxing of constraints in storage can allow different (possibly conflicting) applications to share data, removing duplication. This would be a good way to show how data independent of context has significant benefit.

Practical

__HTTP, 303s and Supporting Custom URIs__

Certainly for most data owners, curators, admins this stuff is an entirely different world; and a world one could argue they shouldn't need to know about. With Linked Data, URI design comes into the domain of the data manager where historically it's always been the domain of the web developer.

Even putting that aside, development tools and default server configurations mean that many of the web developers out there have a hard time with this stuff. The default for almost all server-side web languages routes requests to code using the filename in the URI — index.php, renderItem.aspx and so on. And when do we need to work with response codes? Most web devs today will have had no reason to experience more than 200, 404 and 302 — some will understand 401 if they've done some work with logins, but even then most of the framework will hide that for you.

So, the need to route requests to code using a mechanism other than filename in URL is something that, while simple, most people haven't done before. Add into that the need to handle non-information resources, issue raw 303s and then handle the request for a very similar document URL and you have a bit of stuff that is out of the norm — and that looks complicated.

What can we do?

Working with different frameworks and technologies to make custom URLs the norm and filename based routing frowned-upon wouyld be good. This doesn't need to be a Linked Data specific thing either, the notion of Cool URIs would also benefit.

We could help different tools build in support for 303s as well, or we could look to drop the need for 303s (which would be my preference). Either way, they need to get easier.

Syntax

This is a tricky one. I nearly put this into the conceptual issues as part of the learning curve is grasping that RDF has multiple syntaxes and that they are equal. However, most people get that quite quickly; even if they do have problems with the implications of that.

Practically, though, people have quite a step with our two most prominent syntaxes — RDF/XML and Turtle. The specifics are slightly different for each, but the essence is common; identifying the statements.

Turtle is far easier to work with than RDF/XML in this regard, but even Turtle, when you apply all the semicolons and commas to arrive at a concise fragment, is still a step. The statements don't really stand out.

What can we do?

There are already lots of validators around, and they help a lot. What would really help during the learning stages would be a simple data explorer that could be used locally to load, visualise and navigate a dataset. I don't know of one yet — you?

Summary

None of the steps above are actually hard; taken individually they are all easy to understand and work through — especially with the help of someone who already knows what they're doing. But, taken together, they add up to a perception that Linked Data is complex, esoteric and different to simply building a website and it is that (false) perception that we need to do more to address.

Comments

2 August 2011Problems of Linked Data (3/4): Publishing data

[...] "twitter,facebook,delicious,google_plusone,digg,hackernews,favorites,email,print";In the blog post What people find hard about Linked Data, Rob Styles covered the difficulties that people face when they first learn about publishing Linked [...]