I _Really_ Don't Know

A low-frequency blog by Rob Styles

Reification, Triples, Quads and not getting it...

I've been working with RDF for almost 3 years now. There's not much evidence of that here and I was recently challenged on why that is.

In large part it's because I don't get it. There are a lot of things I'm still struggling with in terms of how to think about solutions when using RDF and how best to work with it. Sure, I can write SPARQL with patterns several levels deep. Sure I can work with Turtle and RDF/XML in several programming languages (Java, XSLT, PHP and sed of course). I think I even understand how to think in an open-world way.

But one big thing has bugged the hell out of me for ages and ages...

I WANT QUADS

At least, I thought I did. And I thought I was alone, but then I got this in an email from Alan Dix:

One of the LBi attendees mentioned a community site they had designed for a client that allowed users to create linkages between things on the site (e.g. song/artists) ... and then annotate the links. This led to short discussion (on one of my old hobby horses) on the way RDF privileges nodes over relationships because statements of triples are not labelled (do not have URIs). While the system described would have required everything to have been reified if done using RDF technology.

This sums up one of the things I've been struggling with so much - that there is no way to refer to the arc between two nodes. When we describe a node we use an instance URI, we say

<http://example.com/foo> a <http://example.com/schema#thing>

but standard practice when specifying predicates is to simply use the predicate, we simply use "a" rather than:

<http://.../foo> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Type> <http://.../schema#thing> .

This means that while all 'things' have unique URIs, all type relationships use the same URI, meaning you can't refer to the instance of a relationship directly. A URI identifying the triple would act as a surrogate, allowing you to say "The predicate on statement 97824". This is also appealing as it could also act as a surrogate for the object, where the object is a literal.

I was thinking about a problem involving incrementing a value, where I was thinking in a way that led me to want an update facility like "Increment the object of statement 87642".

Now that was just plain wrong-thinking! A statement only has identity by virtue of what it says, unlike a row in an rdbms table which has identity because of its position in the table. That is, saying "increment field 3 of row 87642" makes sense, but saying "Increment the object of statement 87642" does not. It doesn't because as soon as the object is incremented it is a different statement. So, having triple identity to allow modification of the predicate or the object is not consistent with the way RDF is.

I was thinking about a problem involving how many times a statement had been made. So, imagine a very simple tagging statement like:

<http://.../something> tags:taggedWith "Interesting" .

I was wanting to know how many times a statement had been made, so with tagging it would give you relative sizes for a tag cloud, for example.

This is a desire for a way to refer to the statement as a whole, rather than my previous wrong-thinking which was a desire to address the parts of a statement. Other common problems that I've come across discussing this are around provenance or audit - who said what, when; how did that statement come to be.

Whenever I tried to discuss this I would get a blanket "REIFICATION" response. I'd read the re-ification spec and re-read it and it took me ages to get why I kept getting pointed that way.

If a triple only has identity by virtue of what it says, and giving it identity other than that leads to the kind of wrong-thinking I described earlier, then the only way to identify a statement is by virtue of what it says - that's all re-ification is.

So, if I want to know about the tagging statement earlier

DESCRIBE ?statement WHERE {
?statement a rdf:Statement .
?statement rdf:subject <http://.../something> .
?statement rdf:predicate tags:taggedWith .
?statement rdf:object "Interesting" .
}

This allows us, simply, to identify a statement purely on the basis of what it says rather than any notion of identity other than that.

So the conclusion is, I'm wrong to want a URI for each triple and I need to fix my wrong thinking and embrace re-ification; just as soon as stores have real good support for it ;-)

Comments

Jonathan Rochkind

So, still trying to get into RDF and understand this kind of stuff, but do I read you right to say that you no longer currently wish for quads, you realize that everything CAN be done with triples, at least with reification?

bryan thompson

No, you do want quads. If you use a BNode for the context position then the store will assign a unique identifier for the statement. If you only do this for the "distinct triples" (s,p,o) then you have unique identifiers for your statements in the triple store and you can go ahead and make assertions about that statements. Reification is wrong here from two perspectives. First, reification is a statement model and says nothing about whether or not the statement itself is asserted. Second, while a lot of people have handled this problem using reification, it blows up the size of the database considerably since it adds 4 assertions for each original triple. Use quads. Use a bnode for the context position. You'll be fine. Caveat: SPARQL allows the quad position to be interpreted in a variety of manners and the semantics basically depend on your application's commitment to how it is going to manage the context position. Different applications can do different things and this can lead to confusion when you try to combine the data together.

Peter Murray

I'm with Jonathan -- if you've come to a great realization, I don't get it yet (why quads aren't needed) and would appreciate some further pointers.

Sam Tunnicliffe

WRT the tag cloud example, and aside from any arguments over the semantics of duplicate statements, many stores do not support them*, so the answer to the sparql query you use here will be the same no matter how many times you assert or reassert the statement <http://..../something> tags:taggedWith "Interesting". So I guess what you really want is for stores to internally reify statements so that one assertion of s p o can be differentiated from another. In addition to this, large swathes of the RDF community consider reification in RDF to be fundamentally broken, due to the inability to differentiate between quoted and asserted statements. An alternative approach could be to make the modelling slightly more complex, along the lines of like: <http://.../tags/123> a <http://.../schema/Tag>. <http://.../tags/123> tags:tagValue "Interesting" . <http://.../taggings/abc> a <http://.../schema/Tagging> . tags:tag <http://.../tags/123> . tags:thingBeingTagged <http://.../somthing> . Incrementing the tag count, when someone else creates the same tag is a case of adding a few statements. <http://.../taggings/def> a <http://.../schema/Tagging> / tags:tag <http://.../tags/123> . tags:thingBeingTagged <http://.../something> . If you add graph support on top of this, I guess you could partition the individual taggings into distinct graphs and get some rudimentary provenance information (or you could add properties to the Tagging instances). I can't help but feel that I'm missing something here, so if we're at cross purposes here, sorry. *I'm sure there's better documentation about this, but http://lists.w3.org/Archives/Public/semantic-web/2005Oct/0193.html was all I could find right off Cheers, Sam

Rob Styles

I suspect my terminology has confused things - when many people talk about quads they are talking about the fourth aspect being the graph that a triple belongs to. I was saying that I have reached the conclusion that all the ways in which I was thinking about triples having identity, other than by virtue of what they state, were wrong. That's not to say others don't have good arguments in support - I just don't have any, anymore. as far as the rest of the comments go... ? eh? took me several minutes to parse most of those sentences ;-)

Ivan Mikhailov

I don't like reification, it adds too much data for too little result. Reification can be replaced with proper use of graphs in many cases. Say, the best way of tracking multiset of tags is to keep personal tags of every origin in a separate graph and use the quad store that can efficiently query all graphs in a single triple pattern.

Nik

Informative. I want to read more. Will it be continued?