• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Stop wasting time looking for files and revisions. Connect your Gmail, DriveDropbox, and Slack accounts and in less than 2 minutes, Dokkio will automatically organize all your file attachments. Learn more and claim your free account.


Futures and Options: After MARC

Page history last edited by Karen Coyle 8 years ago

You may recall that the JSC developed a 3-scenario model for bibliographic records. They are couched in terms of database design, but inherent in these descriptions is the design of the data itself (although not necessarily the transfer formats. The three RDA scenarios (which should be read from the bottom up) are

  1. Relational database model -- In this model, data is stored as separate entities, presumably following the entities defined in FRBR. Each entity has a defined set of data elements and the bibliographic description is spread across these entities which are then linked together using FRBR-like relationships.
  2. Linked authority files -- The database has bibliographic records and has authority records, and there are machine-actionable links between them. These links should allow certain strings, like name headings, to be stored only once, and should reflect changes to the authority file in the related bibliographic records.

  3. Flat file model -- The database has bibliographic records  and it has authority records, but there is no machine-actionable linking between the two. This is the design used by some library systems, but it is also a description of the situation that existed with the card catalog.

These move from 3, being the least desirable, to 1, being the intended format of RDA data. I imagine that the JSC may not precisely subscribe to these descriptions today because of course in the few years since the document was created the technology environment has changed, and linked data now appears to be the goal. The models are still interesting in the way that they show progress. The "relational" intention behind scenario 1 could be achieved with today's relational database technology, but it also so happens, and fortunately so, that this model is reasonably compatible with a linked data model. It also is necessary for the implementation of RDA. In the question of how we might make the transition from MARC21 to linked data, I too see a set of three possible scenarios. I will list these, like the RDA scenarios with #3 as the least sophisticated and least desirable result, and #1 as the true goal. #1 Using linked data as our native format #2 Extraction of linked data from MARC records #3 Serialization of MARC into RDF


#3 Serialization of MARC into RDF


Serialization takes the same data and the same data elements and re-writes them using a different record technology. MARCXML is a serialization of the ISO 2709 MARC format. It makes no change to the data or the coding of the data, but it puts MARC into a format that some programmers may prefer because it is more compatible with their tools. (The one change that it does make is that it does not have the record and field size limits of MARC21, and that is the only thing that could potentially make some data created in MARCXML not round-trippable with MARC21.) The question of course comes up whether we can't simply re-serialize MARC in RDF? That would allow us to keep our current systems, continue cataloging the way we do today, and yet participate in the linked data world. Unfortunately, such a serialization will not get us far in achieving the goals of linked data.


Linked data expresses metadata as three-part statements. In each statement there is a subject (the thing being described), a predicate (which is often called a "verb" but which corresponds to a data element in pre-LD metadata), and an object, which is the information about the subject. So a linked data statement could be: (this book) (has as author) (Herman Melville) (subject) (predicate) (object) In MARC we have tags, indicators and subfields. In turning this into linked data, the subject would be an identifier for the thing being described, such as the book. Since MARC records have record IDs, logically the record ID could be used as the identifier. The equivalent of "has as author" would be the 100 field. However, the 100 field has numerous subfields, each of which would need to be represented in a serialization if we don't want to lose most of the meaning of the information in the MARC record. So let's posit a serialization of a typical 100 (author) field and a 260 (publication) field:


100     1_ $a Tolkien, J. R. R. $q (John Ronald Reuel), $d 1892-1973

260     __ $a Boston, $b Houghton Mifflin $c [c1966]


and let's assume that the record id is http://lccn.loc.gov/67029221, and that we've made up predicates for each of the MARC field/subfield combinations. Our linked data version of these fields would be:


(http://lccn.loc.gov/67029221) (marc:100a) ("Tolkien, J. R. R.")

(http://lccn.loc.gov/67029221) (marc:100q) ("(John Ronald Rueul)")

(http://lccn.loc.gov/67029221) (marc:100d) ("1892-1973")

(http://lccn.loc.gov/67029221) (marc:260a) ("Boston")

(http://lccn.loc.gov/67029221) (marc:260b) ("Houghton Mifflin")

(http://lccn.loc.gov/67029221) (marc:260c) ("[c1966]")


This looks good... but it isn't. First, all of the objects will be text strings, and text strings do not link; in linked data, only identifiers link. One can write programs that act on text, but linking requires identifiers. This means that none of our objects will link, not even to other serialized MARC data. Next, the elements represented by the MARC field/subfield combinations are quite specific to the MARC format. No other community uses the data element 100 $a for a creator. Linked data allows you to code identified things and elements as the same as or similar to elements created by others. The complex subfielding of MARC makes this tricky, however: if one states that the MARC 100 $a can be considered exchangeable with Dublin Core's creator, what then should be done with the other subfields in the 100 field, such as the person's birth and death dates? Should a 245 $b subtitle be related to dc:title? What use could you make of the 250 $b - Remainder of edition statement?


Those examples illustrate another point: the MARC subfields are not stand-alone data elements. Many of them represent parts of something, not a whole something. Furthermore, the subject of the 100 $q and $d is not the book identified by the LCCN URI; the subject of the $d is J. R. R. Tolkien the person, and the subject of the $q is a portion of the text string of the 100 $a. Neither have a direct relationship to the book. The third statement above could be read perhaps as meaning "this book has author dates 1892-1973." But what we probably really want to say is "J. R. R. Tolkien is a person with birth and death dates 1892-1973." Yet in some fields, like the 260, each subfield does make a statement about the subject: the book was published in Boston, the book was published by Houghton Mifflin, the book was published with date [c1966].


All of this is to say that we would get very little linking out of a serialized MARC record. The next step, extraction, solves some of these problems.


Advantages Disadvantages


#2 Extraction


Rather than the mechanical application of an RDF-like serialization to the MARC data, an extraction requires some transformation of the data elements. We already have a number of examples of transformation of MARC into RDF -- however, they are each quite different from the others, and none has been determined to be a "best practice" or a "better practice" than any of the others. Making such a determination is not easy; it requires the kind of decisions that are always inherent in the development of standards. The key step in the creation of an extraction from MARC is identifying the "things" of the bibliographic data encoded in the MARC format. The next step is deciding how to represent those things in linked data, and in particular to determine identifiers for descriptive properties and for data values that will provide maximum linking in the wider world of linked data.


#3 Natively linked data


While some of the decisions made during the development of an extraction from MARC may produce things that can be used in native linked data, in this scenario we are finally freed from the constraints of MARC.  


Comments (0)

You don't have permission to comment on this page.