| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

MARC elements

This version was saved 13 years, 3 months ago View current version     Page history
Saved by Karen Coyle
on January 6, 2011 at 3:36:53 pm
 


Any future metadata format for library bibliographic information must be able to address the legacy data that is stored in MARC format.  A first step in addressing that data is to analyze MARC into a set of data elements, independent of the ISO 2709 record structure. As much as possible, the elements must be atomistic and clearly defined. This is in contrast to the MARC record structure, which bundles data elements into complex fields, and which has areas that are more in the nature of marked-up text than data elements in the usual sense.

 

The set of atomistic elements will not provide the same overall structure as the MARC record; that should be achieved by an actual application. Throughout the process of developing a MARC21 element description, it will be important to keep in mind the difference between the declaration of properties and vocabularies and the actual creation of instance data. Properties and vocabularies should be tested to make sure that they do funciton for instance data, but they do not have to replicate an actual MARC21 record in structure.

 

The Field Types in the MARC Record

 

There are three general field types in the MARC record:

  1. Fixed fields (00X)
  2. Number and code fields (0XX)
  3. Variable bibliographic fields (100-899)

 

The fixed fields meet the definition of "data elements" in most cases: they are coded values with defined value lists. (There are a few structured values, e.g. for dates.) Each field is of fixed length, and the data elements are assigned to fixed positions within the fields. This should be the easiest area to translate into a set of discrete data elements.

 

The number and code fields are variable length fields with variable lenth subfields. They contain data that is not considered to be very human-friendly such as identifiers, mathematical data (mainly geographic), and classification numbers. These fields are mostly textual in nature, however, and some key data elements, like ISBNs, are in subfields that also can contain free-text notes:

 

  020 __ |a 046503912X (hardback)

 

Variable bibliographic fields make up the majority of the MARC format. These are for the most part the fields that display to users and that are searchable: authors, titles, information about publishing details, notes, and subjects. All of these are textual in nature. Most contain both the full punctuation that would normally be included in text as well as punctuation required by the International Standard Bibliographic Description (ISBD). For example, in the following, the use of " = " is the ISBD indication that the second set of text is a translation of the first into another language:

 

245 14 |a Der Ring des Nibelungen. |p Das Rheingold = |b The ring of the Nibelung. The Rhinegold /

 

Many of the variable fields that represent agents (persons, corporate bodies, events) and subjects have a corresponding record in an authority file. The authority file is a form of controlled list and to some extent the agents and subjects could be represented by the identifier for the entry in the appropriate authoritative list.

 

Complicating the analysis of variable fields is an internal structure of links between fields, and qualifiers that can significantly change the meaning of fields.

 

Fixed Fields

 

As elements, the fixed fields are relatively simple, although they also pose some problems for analysis. Most of the fixed field elements take controlled lists ("value vocabularies" in semantic web terminology), and the list implicitly has the same name as the data element. Many of the lists are specific to a particular physical resource type which is coded in the record or in the field.

 

List of fixed fields (with MARC identifier)

List of 007-008 fields and their values (with identifiers for fields and each code)

 

Identification of MARC Origin of Data Elements

 

To identify the MARC data element, we can create an identifier made up of the tag, the material designation, and the position or position range. This identification gives us a link back to the MARC field for mapping:

 

reduction ratio = 007microform05

projection = 008map22-23

 

The 006 terms are all repetitions of an 008 term. In that case, both identifiers could be associated with the same data element (rather than creating a duplicate data element for the 006). The positions of the elements in the two fields will be different. (Note that if round-trip mapping between MARC21 and RDA is desired, applications will need to store the MARC21 element of origin in the resulting data.)

 

Identification of the vocabulary terms is similar, but adds the term code to the string:

 

   reduction ratio: low reduction = 007microform05a
   projection: sinosoidal = 008map22-23bg

 

Property Names

 

Some MARC fixed elements have names that stand alone:

  • braille music format
  • azimuth

 

Others need something added for them to make sense:

  • "data type" in Remote sensing, probably needs to be named "Remote sensing data type"
  • "tape width" in Sound recording, probably needs to be named "Sound recording tape width"

 

There are terms that are used for more than one material type and with different values:

  • "dimensions" - used with video, sound recording, microform, etc.
  • "color" - used with motion picture, microform, globe, map

 

Each of these needs to have the material type in the name:

  • dimensions of video recording, or dimensions (video recording)
  • color of map, or color (map)

 

There are a few data elements, mostly in the 008, that are repeated exactly for different material types:

  • Government document
  • Audience

 

These could be a single data element, with links to multiple material-specific MARC 008 elements.

 

We do not want to include the MARC tag in the property names, partly because of the overlap between 008 and 006, but also because the names need to make sense to people who do not think in terms of MARC tags.

 

URIs

 

The URIs could all follow the pattern:

 

http://marc21.info/element/  -- for "properties" (this could also be "/property/". The use of "element" follows the pattern that is used in the registry of RDA.

http://marc21.info/vocab/ -- for value vocabularies

For the fixed fields the name of the property and the name of the vocabulary list are the same,

 

There are three options:

  • Opaque names -- simply number the elements and vocabulary terms, "1001" "1002" etc. 
  • Human-friendly name -- "azimuth" "dimensions(map)"
  • A MARC21 identifier, e.g. following the pattern: tag+indicator, or tag+subfield

 

Discussion:

 

Using the human-friendly data element names in the URIs is risky because it may take  a while to develop a set of names that people agree on.

 

Using opaque names makes it harder for humans to visualize the term meanings when URIs are used. Also, some utility programs that analyze RDF assume that the last part of the last part of the URI string is the name of the property, and use those in graphs.

 

Using the MARC21 "code" will make sense only to persons well-versed in MARC (e.g. librarians). Also, using the code in the URI does not make it easily searchable or easily linked to MARC.

 

 

Overlaps and Conflicts

 

This is for the future, but at some point we may want to reconcile some of the overlapping term lists.

 

There is real overlap and partial overlap between vocabularies. For real overlap, more than one material uses the same values for Government publication. This vocabulary can be defined once, but identified with more than one 008 material.

 

Partial overlap is when similar vocabularies, like Color, have a different selection of values. For example:

 

color = 007map03

  •  One color
  •  Multicolored

color = 007microform09

  • Black-and-while
  • Multicolored
  • Mixed

 

These seem to be a selection from a larger universe of color, and this select may be better handled in an application profile than in the definition of the vocabulary terms themselves, if that is possible.

 

007 Data Elements

 

The first byte of the 007 gives a material type. The list of 007/00 values provides a list of material types. Within those material types there are subtypes, so the full 007 material type vocabulary will be a list of types and sub-types. e.g.

 

Globe

-- Celestial globe

-- Planetary or lunar globe

-- Terrestrial globe

-- Earth moon globe

-- Unspecified

-- Other

 

(Aside: Obviously it would be nice to do without "Unspecified" and "Other." Removing "Other" means having a way to add values to the list at the time of cataloging, however.)

 

The remaining bytes in the 007 are properties of the resource type that is represented by byte 007/00. (I do not know of a case where the value of a fixed field byte changes based on the value of 007/01, the specific material type. Need to confirm that is the case.)

 

006/008 Data Elements

 

Every 006 element is also an 008 element, but with a different byte position, since the 006 is only a partial repeat of the 008. There may be no need to create two sets of vocabularies; instead, a single vocabulary description with both the 008 and 006 identifier could suffice. When MARC records are transformed to another format, the originating field+position could be included in the data.

 

The 008 has different values for different formats. In this case, the format information is found in the record Leader, not in the 008 itself.  Position 00 of the 006 contains a code for the resource format.

 

NOTE that the 008/006 resource types and the 007 resource types are not the same. (link to table)

 

Number and Code Fields (0xx)

 

The fields in this range are either simple (a single string or unit) or complex (multiple units in the data element). Each of these units may have qualifying information -- meta information about the value itself. Most frequently this qualifying information provides the agency, controlled list, or rules that define the provenance of the value. There are a few binary values. These latter cannot be qualified and are generally found in indicators.

 

There are many fields in the MARC 0XX area that make up a logical gathering of elements. One example is the field for various language codes (041). It is easy to assume that the data elements in this field need to be kept together through a data structure of some kind. While it may be the case the applications will display these together as a unit, they will be treated as independent simple data elements if there is no inherent dependency between them. In the case of the 041, each subfield is meaningful on its own without the others. A counter example is that of classification codes that have two parts: the class code and the item number. While the class code has meaning on its own, the item number is only meaningful in the context of the "call number" which comprises both the class code and the item number.

 

This is primarily to point out that the analysis here follows the logic of the data elements and their relation to each other and to the focus bibliographic item, and not the structure of the MARC record. It is this generalization of the underlying data elements that should make it possible to create different record structures from the same data that is in MARC.

 

Note on Control subfields ($w, $3, $5, $6, $8)

 

I don't include the control subfields in this analysis. To the extent that they inform the list of MARC data elements they will need to be included at some point. In many cases, the control subfields have to do with the structure of the record not the semantics of the data. The $3 links together statements about the same subunit in a record for a complex item. This may be handled differently in a different  record structure. The $5 qualifies a field based on the institution to which it applies, often for item-level notes. These obviously could be placed in an item-level structure rather than the body of the bibliographic record for the manifestation. The $6 and $8 are most often used to link together parallel fields using different scripts. This gives a structural link that is particular to the structure of the MARC record.

 

Simple elements

These are elements that stand alone and can be used in any context. Examples are:

  • Constant ratio linear vertical scale
  • Library of Congress Catalog Number
  • ISBN
  • Terms of availability
  • Report number

 

Note that there may be more than one simple element in a MARC field. Thus,

 

020 $a ISBN $c Terms of availability $z Canceled/invalid ISBN

 

is actually three separate single elements:

 

ISBN

Terms of availability

Canceled/invalid ISBN

 

Each of these is a statement about the focus of the bibliographic record, and they are not dependent on each other. (I think; if I'm wrong, let me know.)

 

 

Complex Elements

 

Many of the elements are made up of two or more dependent parts. The call numbers are a relatively simply example of this, illustrated with LC call Number:

 

050 $a Classification number $b Item number

 

The Classification number can be used without the Item number, but the Item number is not meaningful without the Classification number.

 

Note that the Classification number is repeatable, while the Item number is not. When there is more than one Classification number, any one of them can be combined with the single Item number to creation a full call number.

 

Other call numbers that follow this general pattern are: UDC (MARC 080), DDC (MARC 082).

 

Another example of a complex field is the Fingerprint Identifier (026). This is used for antiquarian books, and has the following subelements:

 

$a First and second groups of characters

$bThird and fourth group of characters

$c Date

$d number of volume or part

 

It also has a subfield that can contain the entire fingerprint and is a simple element:

 

$e Unparsed fingerprint

 

Simple strings, Complex elements

 

There are string values in the 0XX area that are simple strings but that can contain more than one data element. The ISBN is an excellent example of this:

 

020 $a 0914378260 (pbk. : v. 1)

020 $a 0670033480 (hbk. : alk. paper)

 

This is clearly more than just an ISBN, but it is coded only as a single data element. The punctuation could be considered to delineate sub-elements, but is probably not regular enough or rigorously enough followed to be processed as such.

 

Another field that is a single string but can contain more than one data point is the Number of Musical Instruments or Voices Code (028). This field has two (independent) subfields:

 

$a Perform or ensembler

$b Soloist

 

Each one carries a two-character code for the musical form, and "may be followed by a two digit number (01-99) that indicates the number of parts or performers (e.g., va02, a two-part composition for Voice - Soprano)." This is a complex element -- musical form plus number of parts -- that is coded as a single string.

 

At some future date it may be desirable to further analyze these into their logical parts. For now, they will be listed here as single strings, matching their use in MARC.

 

Qualified elements

The qualified elements are ones that have additional information about the data element itself. These qualifying subfields are about the data element, not about the bibliographic item being described. The qualification changes the meaning or semantics of the data.

 

Due to its growth over time, there are different treatments of data in MARC that relate to qualification. For example, there are fields for a number of different classification numbers:

Library of Congress Call Number

National Library of Medicine Call Number

National Agricultural Library Call Number

Dewey Decimal Classification Number

 

In a sense, each of this is a classification or call number with the type of number (or "source" of the number) included in the definition of the field.

 

At some point it became clear that if MARC were to create a separate field for every possible call/classification number, it would run out of 0XX fields. The next step was to create a field for "Other classification number." This field takes a different approach to defining a call/class number and its source. In effect, this is a general field where any call/class number can be input:

$a Classification number $b Item number $2 Source of number

 

Were MARC being developed anew today it might make sense to use this field for all class numbers. Instead of having a separate field for LCC, you would have

$a Classification number $b Item number $2 Source of number=LCC

 

Many elements are qualified by assigning agency or code source. Of these, some have a finite list of types, and others are open-ended.

 

Qualifiers with finite lists

Fully designated types can in most cases be turned into a single element for each type. For example, the 082 DDC, which is typed as either the full edition or the abridged edition, can be created as two separate elements: full edition DDC and abridged edition DDC. If there are URIs in the future that distinguish between them, then only one element will be needed, to be used with the appropriate URI. In other cases, such as that of 033 Date of Event, it is possible that the structure of the data will be sufficient to distinguish between date types. If not, then each date type should be treated as a separate element.

 

Examples:

 

 033 Date of event

  • Single date
  • Multiple single dates
  • Range of dates

 

 082 Dewey Decimal Classification

  • Full edition
  • Abridged edition

 

Open ended qualifiers

Many of the elements with open-ended typing have a qualifier that gives the source or issuing agency for the value. There is no set list of agencies that can be used to with these fields, although codes should be selected from the MARC Institution Code list.

 

In an environment where lists of values and the values themselves have URIs, the URI itself will be sufficient for the value to be fully qualified as to its source. Qualifiers are needed only where URIs are not available (which is still the majority of cases for MARC values). Qualified elements could be given a structure that includes the value and the qualifier for the value. The structure can be combined with the value (e.g. URN-like, ISMN:2222) or stored as a multi-part element (type="ISMN", number="2222")

 

Examples:

 

  024 Other standard identifier

  • 0 - International Standard Recording Code
  • 1 - Universal Product Code
  • 2 - International Standard Music Number
  • 3 - International Article Number
  • 4 - Serial Item and Contribution Identifier
  • 7 - Source specified in subfield $2
  • 8 - Unspecified type of standard number or code

 

 015 National Bibliographic Number

  • $a - National bibliography number (R)
  • $z - Canceled/invalid national bibliography number (R)
  • $2 - Source (NR) Code that identifies the source of the National Bibliography Number. Code from: National Bibliography Number Source Codes.

Notes on Specific Fields

028 - Publisher number

This has a second indicator that controls whether or not notes or added entries are derived from the field. I have skipped that in my analysis, and would be interested to hear if it is considered an important bit of info. If so, I would probably create it as a single subelement representing the 2nd indicator and its four different values.

033 - Date/Time of Event, 045 - Time period of content

These two have an indicator to say whether the date is single, multiple, or a range.While this could be covered in the date/time format, I am making it a modifier on the dates, treating it like a subfield that modifies each date subfield.

035 - System control number

This is a simple data element, but in fact it could be consider complex because the data source is included in the string: "(CaOTULAS)41063988." I am treating it as simple for the time being.

040 - Cataloging source

I'm not sure which of these, if any, can be treated as simple and which as complex, so for the moment I am moving past it. Because the 040 is not repeatable and only the "Modifying agency" subfield is repeatable, it would seem that all of the subfields would be treatable as simple -- that there is no dependency between the subfields. But I need confirmation on that.

046 - Special Coded Dates

There are instructions here that I do not understand ("The field must also contain...."), and I do not know which combinations of subfields make sense. I am skipping this one for now, and would love to talk to someone who understands the input conventions.

 

 

Connecting to MARC21 in ISO 2709

 

There is a need to make clear the connection between elements in MARC21 RDF and the MARC21 elements stored in ISO 2709 format. At the same time, it is not desirable to use the MARC21 2709 field/subfield designators as the identities of MARC21 in RDF because of the unfriendliness of the tag/subfield conventions to non-librarians. For this reason, it seems best to embed a link to MARC21 2709 in the description for each MARC21 in RDF element. One possibility is to use a MARC-centric URI for the MARC21 in 2709 elements:

 

http://marc21.info/element/007map03

 

To encode this, it may be suitable to use OWL sameAs. However, there is the disadvantage that these URIs do not resolve (at least, not on their own). Ideas about this would be appreciated!

Comments (0)

You don't have permission to comment on this page.