Marc, RDF and FRBR
I've been playing with some ideas over the past six months on how we can really move bibliographic data forwards into a structure that could have huge benefits.
The impetus to describe some of that for everyone finally came in the form of a conference, with a deadline for submission that is in just a few days time. The conference is WWW2008 and the workshop is entitles Linked Data on the Web. There are a whole load of reasons to go - I got to go to last years and learnt a huge amount as well as getting to speak about data licensing.
This year I've submitted a substantial paper about the work I've done on finding relationships in MARC data, something I'm already scheduled to present on at Code4Lib late next month. I've had a lot of help thinking about these problems from Nad and he's helped out enormously on getting the paper finished. Thanks also have to go to Danny, he's been of huge help understanding how to think about RDF and how to describe it - he wrote chunks of the paper too.
Please, grab the paper, have a read and let me know what you think.
Semantic Marc, MARC21 and The Semantic Web. (PDF, 440Kb)
Update: Thanks to Damian for pointing out the error in the example turtle - don't know how we missed that!
Still working my way through, but it looks interesting (maybe I'll finally understand MARC21). One major (yet trivial) issue needs fixing in you syntax examples: <marc21:recordStatus>. You don't want the > < there.
It's an interesting paper and approach. A few small things: o When matching personal names I believe the subfields needed are abcdq o I'd recommend you NACO-normalize the names. That's not quite the same as 'paying no attention to case, whitespace, punctuation...' see http://outgoing.typepad.com/outgoing/2006/03/naco_normalizat.html The approach of just doing some normalization on the fields and hoping the resulting strings will match across records works fairly well, but not to the level that people expect from their library systems, and not as well as can be done. With a bit more effort you can put controlled terms in the records and have unambiguous URIs. Before building WorldCat Identities, the first thing we do is make sure we have all the names we can linked to an appropriate authority. For example, J. K. Rowlings Identity page is: http://worldcat.org/identities/lccn-n97-108433, which should be a fairly stable URI. We do the same thing when linking to the authority file itself: http://errol.oclc.org/laf/n97-108433.html. Of course for many names such an URI isn't available and we fall back to something similar to what you are doing, although we use the standard NACO normalization rules; anything else risks ambiguity even with perfect MARC-21 records. These strings are useful, but are just not as stable as the LCCN's. This year we should have the Virtual International Authority File up, and that will be completely open for use, offering another larger set of names of interest to the library community. --Th
Hi Rob, Sorry - I've been meaning to write this for weeks, but only just got round to it. Overall I think the paper is a good intro to using RDF to present bib data instead of MARC - I'm not sure I have a lot of wisdom to add, but here I my comments: I'm suprised in section 8 you say that you still get a high degree of uniqueness even when you re-order the characters (NB - just suprised, not claiming you are wrong!). Have you done any work on this for non-Name fields? In section 9 you mention using FOAF to represent people/organisations. Have you looked at whether FOAF can adequately represent the information from MARC name fields? I'm not incredibly familiar with FOAF, but it seems limited to me in comparison In section 10 you mention the approach taken by Thom Hickey et al of using authority information to clean data - I'm not very clear as to why you don't do this? This section loses me a bit. If you are committed to creating the URIs algorithmically then don't we need to see a similar mechanism for creating a URI algorithmicallly from an authority record, and check that these match up - obviously logically they ought to, but in my (limited) experience UK libraries aren't very uniform in terms of using the LC authority files, and are more interested in internal consistency in author usage rather than an authority file. I also don't understand when you say "The authority data may also contain additional information about relationships between authors’ names; one example being that Iain Banks also publishes under the name Iain M Banks, another common example being that Mark Twain was the pen name of Samuel Clemens. These relationships are between different resources rather than different URIs representing the same resource, so require a "see also" relationship rather than a "same as" relationship. " I would have thought ideally you'd want Mark Twain and Samuel Clemens to point to the same URI? You seem to be suggesting that it is correct for them to point at different URIs - I don't understand why? In section 11/14 I like the idea of disambiguation along the lines of wikipedia - it seems to work quite well Section 13 - I think I like the idea of creating a URI for the work based on author/title information, as it goes along with my feeling that rather than saying explicitly 'these are the same work', it may be better to say 'we will call these the same work if they share x attributes'. However, you don't mention how this extends to manifestations and expressions - presumably the more information you feed to an algorithm creating a URI the more likely you are going to see data variation causing different URIs for things that are actually equivalent (I guess that at the manifestation level things are particularly messy as you are dealing with largely uncontrolled fields where consistency won't have been a key concern during the cataloguing process) Clearly you are dealing with data that is already in existence - which is something we clearly need to worry about - but I think another question is around how we expect to catalogue in the future. If we catalogue 'into' a FRBR environment that seems to point at different requirements to existing records than if we continue to essentially catalogue at an item level, and then present in a FRBR way. A thought that has just occurred to me - why do you go back to the author/title when you do the URI for the Work? Why not say that all things that share the same URI for author and title (based on the URIs already created by your algorithm) they are the same work - and if you want a URI for the work then create it based on the existing URIs rather going back to the 'raw' author and title info again?
Styles, Ayers, Shabir: Semantic MARC, MARC 21 and the Semantic Web Last week Talisman Rob Styles posted MARC, RDF and FRBR, two initialisms and an acronym that probably get your heart racing like they do mine. In it, he points to a paper he wrote with fellow Talismen Danny Ayers and Nadeem Shabir: Semantic MARC, MARC...