I _Really_ Don't Know

A low-frequency blog by Rob Styles

Distributed, Linked Data has significant implications for Intellectual Property Rights in Data.

What P2P networks have done for distribution of digital media is phenomenal. It is possible, easy even, to get almost any TV show, movie, track or album you can think of by searching one of the many torrent sites. As fast as the media industry take down one site through legal action another has appeared to take its place.

I don't want to discuss the legal, moral or social implications of this, but discuss how the internet changes the nature of our relationship with media - and data. The internet is a great big copying machine, true enough, but it's also a fabric that allows mass co-operation. It's that mass peer-to-peer co-operation that makes so much content available for free; content that is published freely by its creator as well as infringing content.

Sharing of copyrighted content is always likely to be infringing on p2p networks, regardless of any tricks employed, but for data the situation may be different and the Linked Data web has real implications in this space.

Taking the Royal Mail's Postcode Address File as my working example, because it's been in the news recently as a result of the work done by ErnestMarples.com, I'll attempt to show how the Linked Data web changes the nature of data publishing and intellectual property.

First, in case you're not familiar, a quick introduction to Linked Data. In Linked Data we use http web addresses (which we call URIs) not only to refer to documents containing data but also to refer to real-world things and abstract concepts. We then combine those URIs with properties and values to make statements about the things the URIs represent. So, I might say that my postcode is referred to by the URI http://someorg.example.com/myaddress/postcode. Asking for that URI in the browser would then redirect you to a document containing data about my postcode, maybe sending you to http://someorg.example.com/myaddress/postcode.rdf if you asked for data and http://someorg.example.com/myaddress/postcode.html if you asked for a web page (that's called content negotiation). All of that works today and organisations like the UK Government, BBC, New York Times and others are publishing data this way.

Also worth noting is the distinction between Linked Data (the technique described above) and Linked Open Data, the output of the W3C's Linking Open Data project. An important distinction as I'm talking about how commercially owned and protected databases may be disrupted by Linked Data, whereas Linked Open Data is data that is already published on the web under an Open license.

Now, Royal Mail own the Postcode Address File, and other postcode data such as geo co-ordinates. They are covered in the UK under Copyright and Database Right (which for which bits is a different story) so we assume it is "owned". The database contains more than 28 million postcodes, so publishing my own postcode could not be considered an infringement in any meaningful way, publishing the data for all the addresses within a single postcode would also be unlikely to infringe as it's such a small fraction of the total data.

So I might publish some data like this (the format is Turtle, a way to write down Linked Data)

<http://someorg.example.com/myaddress/postcode>
  a paf:Postcode;
  paf:correctForm "B37 7YB";
  paf:normalisedForm "b377yb";
  geo:long -1.717336;
  geo:lat 52.467971;
  paf:ordnanceSurveyCode "SP1930085600";
  paf:gridRefEast 41930;
  paf:gridRefNorth 28560;
  paf:containsAddress <http://someorg.example.com/myaddress/postcode#1>;
  paf:googleMaps <http://maps.google.co.uk/maps?hl=en&source=hp&q=B377YB&ie=UTF8&hq=&hnear=Birmingham,+West+Midlands+B377YB,+United+Kingdom&gl=uk&ei=Zs8HS_KVNNOe4QbIpITTCw&ved=0CAgQ8gEwAA&t=h&z=16>.

<http://someorg.example.com/myaddress/postcode#1>
  a paf:Address;
  paf:organisationName "Talis Group Ltd";
  paf:dependentThoroughfareName "Knight's Court";
  paf:thoroughfareName "Solihull Parkway";
  paf:dependentLocality "Birmingham Business Park";
  paf:postTown "Birmingham";
  paf:postcode <http://someorg.example.com/myaddress/postcode>.

I've probably made some mistakes in terms of the PAF properties as it's a long time since I worked with PAF, but it's clear enough to make my point with. So, I publish this file on my own website as a way of describing the office where I work. That's not an infringement of any rights in the data and perfectly legitimate thing to do with the address.

As the web of Linked Data takes off, and the same schema become commonly used for this kind of thing, we start to build a substantial copy of the original database. This time, however, the database is not on a single server as ErnestMarples.com was, but spread across the web of Linked Data. There is no single infringing organisation who can be made to take the data down again. If I were responsible for the revenue brought in from sales of PAF licenses this would be a concern, but not major as the distributed nature means it can't be queried.

The distributed nature of the web means the web itself can't be queried, but we already know how to address that technically - we built large aggregations of the web, index them and call them search engines. That is also already happening for the Linked Data web. As with the web of documents, some people are attempting to create general purpose search engines over the data and others specialised search engines for specific areas of interest. It's easy to see that areas of value, such as address data, are likely to attract specialist attention.

Here though, while the individual documents do not infringe, an aggregate of many of them would start to infringe. The defence of crowd-sourcing used in other contexts (such as Open Street Map) does not apply here as the PAF is not factual data - the connection between a postcode and address can only have come from one place, PAF, and is owned by Royal Mail however it got into the database.

So, with the aggregate now infringing it can be taken down through request, negotiation or due process. The obvious answer to that might be for the aggregate to hold the URIs only, not the actual values of the data. This would leave it without a useful search mechanism, however. This could be addressed by having a well-known URI structure as I used in the example data. We might have

<http://addresssearch.example.net/postcodes/B37_7YB>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

This gets around the data issue, but the full list of postcodes itself may be considered infringing and they are clearly visible in the URIs. Taking them out of the URIs would leave no mechanism to go from a known postcode to data about it and addresses within it, the main use case for PAF. It doesn't take long to make the link with other existing technology though, an area where we want to match a short string we know with an original short string, but cannot make the original short string available in clear text... Passwords.

Password storage uses one-way hashes so that the password is not available in its original form once stored, but requests to login can be matched by putting the attempted password through the same hash. Hashes are commonplace in the P2P world for a variety of functions, so are well-known and could be applied by an aggregator, or co-operative group, to solve this problem.

If I push the correct form of "B37 7YB" through MD5, I get "bdd2a7bf68119d001ebfd7f86c13a4c7", but there is no way to get back from that to the original postcode. So a service that uses hashed URIs would not be publishing the postcode list in a directly useable form, but could be searched easily by anyone knowing how the URIs were structured and hashed.

<http://addresssearch.example.net/postcodes/bdd2a7bf68119d001ebfd7f86c13a4c7>
  owl:sameAs <http://someorg.example.com/myaddress/postcode>

Of course, a specialist address service, advertising address lookups and making money could still be considered as infringing by the courts regardless of the technical mechanisms employed, but what of more general aggregations or informal co-operative sites? sameAs, a site for sharing sameAs statements, already provides the infrastructure that would be needed for a service like this and the ease with which sites that do this can be setup and mirrored would make it hard to defend against using the law in the same way that torrent listing sites are difficult for the film and music industries to stop. Regardless of the technical approach and the degree to which that provide legal and practical defence, this is still publishing data in a way that is against the spirit of Copyright and Database Right.

The situation I describe above is one where many, many organisations and individuals are publishing data in a consistent form and that is likely to happen over the next few years for common data like business addresses and phone numbers, but much less likely for less mainstream data. The situation with addresses is one where it is clear there is a reason to publish your individual data other than to be part of the whole, in more contrived cases where the only reason to publish is to contribute to a larger aggregate the notion of fair-use for a small amount of the data may not stand up. That is, over the longer term, address data will not be crowd-sourced - people deliberately creating a dataset - but web-sourced - the data will be on the web anyway.

We can see from this small example that the kinds of data that may be vulnerable to distributed publishing in this way are wide-ranging. The Dewey Decimal Classification scheme used by libraries, Telephone directories (with lookups in both directions), Gracenote's music database, Legal case numbering schemes, could all be published this way. The problem, of course, is that the data has to be distributed sufficiently that no individual host can be seen as infringing. For common data this will happen naturally, but the co-ordination overhead for a group trying to do this pro-actively would be significant; though that might be solved easily by someone else thinking about how to do this.

As I see it a small group of unfunded individuals would have difficulty achieving the level of distribution necessary to be defensible. Though could 1% of a database be considered infringing? Could/Would 100 people use their existing domains to publish 1% of the PAF each? Would 200 join in for ½% each? Then, of course, there are the usual problems of trust, timeliness and accuracy associated with non-authoritative publication.

These problems not withstanding, Linked Data has the potential to provide a global database at web-scale. Ways of querying that web of data will be invented, what I sketch out above is just one very basic approach. The change the web of data brings has huge implications for data owners and intellectual property rights in data.

Comments

Rob Styles

@Robert Richards, thanks for the pointers, I'm not familiar with those cases so will have a read - Could you say what you thought the implications were for what we're talking about? @Jonathan you're right, there is a degree of speculation in my post about what would or would not be considered infringing. Without precedent it's difficult to say, but I think it's worth exploring the notion that Linked Data changes our perception of what 'publishing' data means - as a result of being able to rejoin all the parts to form a whole.

Charles Cox

there are so many intellectual property and copyright violations these days'~,

webtrendmap

Distributed, Linked Data has significant implications for Intellectual Property Rights in Data. …: http://bit.ly/5zu0om/ (via @mmmmmrob)

This comment was originally posted on Twitter

Rob Styles

@Eric, I'm familiar with the absence of database right in the US - you know it's reciprocal right? US databases don't qualify for protection here ;-) PAF is more than just a database, however, it's an invented numbering scheme rather than factual data. That means it qualifies for Copyright protection I suspect (just as Dewey does) so would still be subject to Copyright in the US. Though the jurisdictional differences would certainly make it harder to get it taken down.

Eric Hellman

You're probably thinking of how OCLC sued the Library Hotel over Dewey- that was a trademark infringement suit, not a copyright suit. If the Hotel had not used the name "Dewey", they couldn't have been sued. Numbering systems have never, to my knowledge, been found to be copyrightable, per se, in the US. In the relevant case, a phone company's telephone directory was found not to be copyrightable. Do you know of some precedent to establish that a "US database" can't be protected by copyright in the UK if it is also published in the UK? I was not aware of that.

futurescape

Distributed Linked Data http://bit.ly/4ysoFg via @Annemcx -

This comment was originally posted on Twitter

rantersparadise

RT @webtrendmap:Distributed,Linked Data has significant implications for Intellectual Property Rights http://bit.ly/5zu0om/ (via @mmmmmrob)

This comment was originally posted on Twitter

Rob Styles

@Eric, The US case you're referring to is Feist Publications vs Rural Telephone Service. http://en.wikipedia.org/wiki/Feist_Publications_v._Rural_Telephone_Service

That case settled several things. Firstly that the effort required to produce a database did not give it protection (no sweat-of-the-brow protection) and secondly that the intention to be complete made the database unoriginal in terms of Copyright as it had no editorial selection, only a mechanical process to collate it.

I wasn't really thinking much about the OCLC vs The Library Hotel Dewey case, you're right that that was a trademark case. OCLC claims Copyright over the labeling of the numbers, but not the numbering scheme itself. I can't find any case law that upholds or rejects Copyright protection of a numbering scheme; anyone else?

As to the non-US nature, wikipedia cites the bill itself as saying that a database qualifies for protection if the producer is an individual living in the EU or a company registered in the EU or primarily doing business here. http://en.wikipedia.org/wiki/Database_Directive

The big question remains in my mind (as Ian Davis put it) - Can a thousand non-infringing releases be used as a whole?

rantersparadise

RT @webtrendmap:Distributed,Linked Data has significant implications for Intellectual Property Rights http://bit.ly/5zu0om/ (via @mmmmmrob)

This comment was originally posted on Twitter

gluejar

Post by @mmmmmrob http://bit.ly/5zu0om/ has me working on a new post about effect of database copyrights.

This comment was originally posted on Twitter

richards1000

Fine post by @mmmmmrob about distributed retrieval using #linkeddata , with implications for legal information retrieval http://j.mp/82p6e7

This comment was originally posted on Twitter

hochstenbach

Linked Data and IR http://bit.ly/78JqEV by @mmmmmrob: provide a schema for a IR-protected dataset, the crowd+aggregators will set it free

This comment was originally posted on Twitter

rdmpage

RT @mmmmmrob Distributed, Linked Data has significant implications for Intellectual Property Rights in Data. http://bit.ly/5zu0om/

This comment was originally posted on Twitter

Jonathan Rochkind

"PAF is more than just a database, however, it’s an invented numbering scheme rather than factual data. That means it qualifies for Copyright protection I suspect (just as Dewey does) so would still be subject to Copyright in the US." Depends on exactly how "creative" or "original" the numbering scheme is. For instance, in the famous case that firmly expanded that "facts" (and thus databases) are not copyrightable in the US, Lexis's page numbers were considered not copyrightable, as they were just, well, page numbers. Incremental integers assigned one after another to pages. No creativity or originality whatsoever. Now, if a system has _some_ creativity or originality, does it have enough to be copyrightable? A judge would decide. And then it would be appealed and another judge would decide, who knows. [But, incidentally, while I'm not a lawyer, I find OCLC numbers, incremental integers assigned in order of receipt of a record by OCLC, to be _quite_ analagous to page numbers, with the obvious implications. ]

richards1000

“Legal case numbering schemes could be published this way” Distributed Linked Data has implications for IP Rights in Data http://j.mp/7PYkl3

This comment was originally posted on Twitter

Robert Richards

See American Dental Association v. Delta Dental Plans Association, 126 F.3d 977 (7th Cir. 1997), http://j.mp/7aiNhB ; CCC Information Services v. Maclean Hunter Market Reports, Inc., 44 F.3d 61 (2d Cir. 1994), http://j.mp/64SVUx .

gluejar

In dialog with Rob Styles on #linkeddata implications of US-Europe database #copyright mismatch http://bit.ly/5zu0om/

This comment was originally posted on Twitter

Eric Hellman

Thanks to Robert Richards for the references. To summarize, the first precendent establishes the applicability of copyright to taxonomies, based on the creative effort their authorship requires. But PAF is not a taxonomy. The second precedent establishes that it is possible for a table of numbers to be a protectable expression, if they represent the result of creative effort. But PAF isn't that either. The commentary clearly indicates that a numbering system cannot be protected by copyright, even if it is original and non-obvious- that would be the province of patent law.

Rob Styles

Which would leave only Database Right within Europe, but some doubt over the applicability of that due to the date the PAF was first created. And, is not reciprocated overseas, leaving the possibility of a full copy of PAF being hosted by someone in the US perhaps. Interesting, but not the focus of what I was trying to convey - that the Linked Data web has the potential to be a distributed database; in which case an entire database may be published and queryable without any infringement taking place.

richards1000

“Legal case numbering schemes could be published this way” Distributed Linked Data has implications for IP Rights in Data http://j.mp/7PYkl3

This comment was originally posted on Twitter

Eric Hellman

Great write-up. I've been thinking along similar lines, but in some different contexts. It's worth noting that the database right, i.e. the ability to copyright collections of facts, does not exist in the US. So a USA-based service that reconstituted the PAF database using the linked data techniques you describe ought to be immune from take-down, no matter how much of the database it accumulates. So ship us the data, we can let freedom ring!

soslab

Distributed, Linked Data has significant implications for … http://bit.ly/5j6jAe

This comment was originally posted on Twitter

infopeep

Styles, Rob: Distributed, Linked Data has significant implications for Intellectual Property Rights in Data. http://bit.ly/5j6jAe

This comment was originally posted on Twitter

Eric Hellman

I've written a follow-up to this post focusing on the comparison between US and UK postcode data availability.