Choosing URIs, not a five minute task.

21 April 2011

This post originally appeared on the Talis Consulting Blog.

Chris Keene at Sussex is having a tough time making a decision on his URIs so I thought I’d wade in and muddy the waters a little.

He’s following advice from the venerable Designing URI Sets for the UK Public Sector. An eleven page document from the heady days of October 2009.

Chris discusses the choice between data.lib.sussex.ac.uk and www.sussex.ac.uk/library/ in terms of elegance, data merging and running infrastructure. He’s leaning toward data.lib.sussex.ac.uk on the basis that data.organisation.tld is the prevailing wind.

There are many more aspects worth considering, and while data.organisation.tld may be a way to get up and running quickly you might get longer term benefit from more consideration; after all we don’t want these URIs to change.

The key requirements are outlined well in ‘Designing URI Sets’ as follows

3. In particular, the domain will:

Expect to be maintained in perpetuity

Not contain the name of the department or agency currently defining and naming a concept, as that may be re-assigned

Support a direct response, or redirect to department/agency servers

Ensure that concepts do not collide

Require the minimum of central administration and infrastructure costs

Be scalable for throughput, performance, resilience

These are all key points, but one in particular stands out for me in terms of choosing the hostname part of a URI

Not contain the name of the department or agency currently defining and naming a concept, as that may be re-assigned

That simple sentence contains a lot more than at first reading and suggests that any or all of the concepts defined in the data may become someone else’s responsibility in time. I think over time we will see this becoming key to the longevity of URIs, along with much better redirect maintenance.

The approach data.gov.uk has taken is to break the data into broad subject areas within which many different types of data might sit – education.data.gov.uk, transport.data.gov.uk, crime.data.gov.uk, health.data.gov.uk and so on. This is one example of breaking up the hosts and while right now they all point to one cluster of web servers they can be moved around to allow hosting in different places.

This is good, yet I can’t help thinking that those subject matter areas are really rather broad. Then there are others that seem to work on a different axis, statistics.data.gov.uk and research.data.gov.uk. Leaving me confused at first glance as to where the responsibility for publishing crime research would lie. Then there is patents.data.gov.uk, not “innovation” or “invention” but “patents”, the things listed.

Data.gov.uk has done a great job trailblazing, making and publishing their decisions and allowing others to learn from them, develop on them and contribute back. I think we can push their thinking on hostnames still further. If we consider Linked Data to be descriptions of things, rather than publishing data, then directories of those things would be useful.

For example, we could give somebody the responsibility of publishing a list of all schools in the UK at schools.example.gov and that would be one part of the puzzle. A different group may have the responsibility of publishing the list of all universities and yet another the list of all companies at companies.example.gov.

Of course, we would expect all of these to interlink, patents.example.gov would have links to companies.example.gov and universities.example.gov to document the ownership of patents. We’d expect to see links in schools.example.gov to inspections.schools.example.gov and so on.

Notice that I’ve dropped the word data from those examples, as much of this is about making machine (and human) readable descriptions of things. It’s only because we describe lots of things at the same time and describe them uniformly we call it data.

I’d still expect health.example.gov to appear as well, but the responsibility would be one of aggregating what could be considered health data in order to support querying; it would aggregate doctors.example.gov, hospitals.examples.gov and more. I would expect as many of these aggregates to pop up as are useful.

Of course, in this approach, as in the current data.gov.uk approach, everyone who wants to say something about a particular doctor, school or patent has to be able to get access to that host to say it and, perhaps, conflicting things said by different people get mixed up.

At this point you’re probably thinking well, we might as well just use data.organisation.tld and be done with it then. Unfortunately that simple moves the same design decisions from the hostname to the resource part of the URI, the bit after the hostname. You still have to make decisions and with only one hostname your hosting options are drastically reduced.

Data.gov.uk places the type of thing in the resource part of the URI using what they call concept/reference pairs:

2. Examples of concept/reference pairs:
• road/M5
• school/123
3. The concept/reference construct may be repeated as necessary, for example:
• road/M5/junction/24
• school/123/class/5

I tend to do this slightly differently, using container/reference pairs so I would use “roads” rather than “road” as this lends itself better to then putting listings at those URIs.

The antithesis

We can often learn something by turning an approach on its head. In this case I wonder what would happen if we embraced the idea that many people will have different world-views about the same thing, their own two-penneth so to speak. None of them necessarily authoritative.

In that case we end up with me publishing data on data.my.domain and you publishing data about the same things on data.your.domain. Just as happens all over the web today. If I choose my domains carefully then maybe I can hand bits on as I find someone else to run them better, as above, but always there is more than world view.

There are two common ways to make this work and be interconnected. A common approach is to use owl:sameAs to indicate that data.my.domain/Winston_Churchill and data.your.domain/Winston_Churchill are describing the very same thing. The OWL community is not entirely supportive of that use.

The other approach is to use the annotation pattern and rdfs:seeAlso; in which case documents describing a resource live in many places, but they agree on a single, canonical, URI.

So what would that mean for Sussex?

Well, I’m not sure.

Fortunately, Chris has a limited decision to make right now, choosing a URI, or URIs, for the Mass Observation Archive. It is for this he is considering data.lib.sussex.ac.uk and www.sussex.ac.uk/library/.

Thinking about changing responsibilities over time, I have to say I would choose neither. It is perfectly conceivable that the mass observation may at some time move and not be under the remit of the University of Sussex Library, or even the university.

I would choose a hostname that can travel with the archive wherever it may live. Fortunately it already has one, http://www.massobs.org.uk/. Ideally the catalogue would live it something like catalogue.massobs.org.uk or maybe massobs.org.uk/archive or something like that.

My leaning on this is really because this web of data isn’t something separate from the web of documents, it’s “as well as” and “part of” the web as one whole thing. data.anything makes it somehow different; which in essence it’s not.

Postscript

Oh, on just one more thing…

URI type, for example one of:
• id – Identifier URI
• doc – Document URI, Representation URI
• def – Ontology URI
• set – Set URI

Personally, I really dislike this URI pattern. It leaves the distinguishing piece early in the URI, making it harder to spot the change as the server redirects and harder to select or change when working with the URIs.

I much prefer the pattern

/container/reference to mean the resource
/container/reference.rdf for the rdf/xml
/container/reference.html for the html

and expanding to

/container/reference.json, /container/reference.nt, /container/reference.xml and on and on.

My reasoning is simple, I can copy and paste the document URI from the address bar, paste it to curl on the command line and simply backspace a few to trim off the extension. Also, in the browser or wget, this pattern gives us files named something.html and something.rdf by default. Much easier to work with in most tools.