| 
View
 

DataFormatIssues

Page history last edited by PBworks 17 years, 3 months ago

Data Format Issues and Ideas


Issues (cooked)

 

Problems with MARC21

Record format limitations

  1. Limitations on the size of records that can be created with Z39.2: a maximum of 9999 characters per field and 99999 characters per record. The latter effectively limits the number of fields that a record can contain.
  2. Inherent limitations in the MARC implementation of Z39.2: a maximum of 26 distinct content subfields and 10 control subfields can be defined per tag. (Note, numeric subfields have been designated as having a special function.)
  3. Limited hierarchical levels: tag and subfield.
  4. Two indicator positions means that each field can have only two attributes.

Redundancy of data elements

  1. A large number of data elements with some degree of redundancy (X00 fields, X10 fields, title fields and subfields, etc.) Comment from Martha Yee: moved to comments.
  2. Inconsistency between the treatment of same or similar data elements across fields (e.g. author and title information in 77x linking fields does not have the same subfields as those data elements in other fields).

Fixed field data elements

  1. Fixed fields have values that are actually embedded in the standard. To add a new value means you have to modify the standard itself. They should all be external authoritative lists (if they should exist at all)
  2. Fixed fields that should be parallel to textual fields are a) located separately from those fields b) may not have the same values, either because of input problems or because of limitations in the value list.
  3. Variable fields that extend fixed fields (i.e. 041 extending language code in 008) because of lack of flexibility in the fixed fields. These data elements should be brought together.
  4. The use of defaults in fixed fields, which therefore convey little information because the fixed position must carry a value, even if that value is blank.

Levels and linking

  1. Record linking in MARC21 is awkward to use and is not implemented by many systems.
  2. A mixture of logical levels (from the work to the item level) in a single record with no structural differentiation. Comment from Martha Yee: moved to comments.
  3. No way to create a multi-level record for works within works.

 

The Field

The basis of any new metadata will be the definition of a field. The MARC21 field has four elements:

 

  1. Field tag
  2. Two indicator positions; the indicators have general information about the content of the field - how to display, what is the authority over the data, qualifiers for the data (type of heading, type of scale)
  3. General subfields (a-z); these carry the main data elements of description
  4. Control subfields (1-8); these perform functions within the record - linking between fields, identification of authority lists, role codes

 

Both indicators and control subfields perform multiple roles in relation to the field data, and some data (authority lists, as an example) can be coded either as indicators or as control subfields. The elements of the field, when sorted out, are:

 

  1. Field identifier (MARC tag)
  2. Data qualifiers (type, level)
  3. Data source (authority list)
  4. Display rules (display constants, note controllers)
  5. Data subfields
  6. Field linking (between fields in the same record)

 

There is possibly also a need for metadata about the field itself:

 

  1. Field identifier
  2. Creation time and date
  3. Cataloging rules that govern the content, with Version

 

This metadata will make it possible to update individual fields rather than having all updates be a full record update, as they are in MARC, and to mix data from different sources or sets of rules, if desired.

 

In addition, there is a control subfield ($7) that provides a complex array of information about a linking entry field. As we have seen with MODS, this function is due to the limitations of the MARC format and can be overcome in other formats by allowing more levels of hierarchy in the data itself.

 

Ideas

Extensible controlled vocabulary lists

One of the problems that occurs in the current library metadata is that there are authority lists that cannot be easily extended. For the lists in the MARC format, actual changes to the standard are needed in order to extend a list (with the exception of the large lists managed by Library of Congress for languages and place of publication). There needs to be an easy way to add terms to a controlled vocabulary list without breaking the standard.

 

There are many different ways that you can develop extensibility for a set of terms. The main thing is that you want the newly minted term to have a clear context (what list does it belong to?), and you want to be able to get people to the definition of the term when they encounter it. In this case, the context is that it is a carrier of information, and it is specifically a new kind of computer carrier. It is also extending an existing list, say, the RDA carrier list.

 

Let's pretend that we have a registry of terms. And let's pretend that the registry has some management mechanism, such as a small group of participants that oversees the various lists in the registry (so it's not total anarchy). Our thumb drive could be added such that:

 

http://authoritylists.info/RDA:carrier:computer_carrier:USB_flash_drive

 

returns this information in a machine-readable format:

 

owner: RDA

list: carrier

sublist: computer carrier

element: USB flash drive

status: provisional

date added: 2007-03-30

description: "USB flash drives are NAND-type flash memory data storage devices integrated with a USB (universal serial bus) interface." (quotes because I took that from wikipedia, but generally the expert adding the term would write a suitable description.)

synonyms: thumb drive, jump drive, flash drive

 

The vocabulary list could be used by anyone who finds it useful. Systems could make use of the registry to support the creation of new records and the reading of existing records. With some periodicity, these systems would check that their lists are up to date (like the automatic update of virus lists in anti-virus software). Such a system could decide that provisional entries would be flagged in some way (maybe they would show up as red on the screen). Or a system receiving a record with a previously unknown item in an authority list or a previously unknown list could quickly grab the description from the registry and use that to provide services, like definitions and synonyms, to its users.

 

 


Issues (raw)

  • Relationship to FRBR and RDA. Is this a time of change that requires a new format, or should these be treated as separate needs?
  • Can we tackle just MARC bibliographic, or do we also need to include at least Authorities and Holdings in our analysis? Comment from Martha Yee:
  • MARC causes library data to be marginalized (but MODS, a friendlier XML format, hasn't had much more success at crossing over to other fields)
  • How do you express the current table of contents RSS feed for a journal title from CiteULike?
  • There are a lot of advantages to MARC. We have a lot of data in MARC. It would be expensive to move whole hog off of it. Maybe we keep MARC for some bibliographic data.
  • Advantages of moving to XML:
    • Ability to more easily share data with non-librarians
    • Ability to use existing XML tools and applications
    • XML can be human-readable as well as machine-readable
  • MARC has many fields and data elements (fixed) that can no longer be expanded, so new data elements cannot be added.
  • MARC Bibliographic is not just bibliographic -- there is info for ordering, URLs to related web items, holdings info...
  • What fixed fields are used by systems? What need is there to carry this information into a new data format?
  • social tagging -- can it work? (kc: I'm copying this to the catalog discussion)
  • Browsing -- a requirement of the data format, or a system feature, or not needed (because it would be better to use topic maps)? (kc: Ditto, copying to catalog discussion). Comment from Martha Yee: moved to comments.
  • Need to list problem data elements, i.e. dates (some ambiguity), genres (see MODS)
  • Examine the results of the data analysis done by Moen for 007 field in particular. Suspect that few vendors do anything with many of the elements defined in the 007. This could be an interesting place to experiment with: if there is very little legacy data there in the first place (check Moen's results); and second, for the legacy data that IS there, if no one is using the majority of the data elements (i.e., end-user retrieving or browsing in a meaningful way, not simply encoding on the creator side); then, maybe it would be fertile ground to consider 'lifting' (liberating) this part of the construct and remodeling this portion elsewhere.
  • 007 needs attention, but include staff reporting (e.g. for preservation, whatever) as a meaningful use of the data.
  • Hierarchy... as noted, MARC supports hierarchical description very little and poorly, and that's been a constraint. Given the scale and complexity with which we communicate records among systems, better support for linked or hierarchical descriptions will require sophistication in handling inheritance, identifiers, and update dates. Maintaining metadata that flows between hierarchical and flat environments is ugly. (R. Wendler) Comment from Martha Yee: moved to comments.
  • How well can the data format support resource discovery?

Comments (10)

Anonymous said

at 5:53 am on Jan 17, 2007

Comment from Martha Yee on differentiating levels: This could be pretty tricky to tease out. Consider the fact that the title, for example, could be the title of the work (if it coincides with the uniform title), the title of the expression (but only after at least one different expression has been published with a different title), AND the title of the manifestation (but only after at least one different manifestation (identical underlying content) has been published with a different title on the chief source of information); in other words, which level(s) the title "belongs to" will change over time as more and more manifestations and expressions are published, and as this change occurs, the title will be associated with more and more of the FRBR levels (not just one).

Anonymous said

at 5:58 am on Jan 17, 2007

I think this is a weakness in the FRBR model. But without going there, it would still be good to differentiate between the manifestation and the item (for item-level notes, for example). I know that some see the role for this in the holdings record, but if so then we need to re-think the holdings record and make it a true item record. Well, actually, we need to start over.

Anonymous said

at 6:01 am on Jan 17, 2007

Comment from Martha Yee on redundancy: Does redundancy refer here to the current necessity to record both a transcribed form of a name (how it appeared on the document cataloged) and a normalized form of a name? If so, I don't think the need to record a transcribed form goes away until every document ever created by humanity is digitized and linked to its description; that may never happen, and as long as we are describing things that are not digital, an important part of determining the names by which entities are commonly known is the recording of the way they actually appear on every document we collect...

Anonymous said

at 6:07 am on Jan 17, 2007

I was intending this to be redundancy in that data elements like "personal author" and "title" appear in many different fields, and each time they are re-described. In a data structural sense, MARC has many different title fields. It could have a data element defined as title that could be used in different fields (like MODS). This has various advantages for coding and for the upkeep of the standard. It would also mean that all title data elements would be consistent, which they are not today.

Anonymous said

at 6:16 am on Jan 17, 2007

Comment from Martha Yee on whether we need to tackle bibliographic and authority records: The FRBR entities work, expression, person, corporate body, topic, etc. are represented by authority records, not bibliographic records, so I would say it would be essential to include Authorities. At the UCLA Film & Television Archive, we have noticed that the only MARC 21 hierarchy supported by current systems is that created by authority record linked to bibliographic record linked to holdings record, so we have used the authority record to represent the work, the bibliographic record to represent the expression and the holdings record to represent the manifestation; that would argue for considering the holdings record, as well.

Anonymous said

at 6:18 am on Jan 17, 2007

Comment from Martha Yee on browsing: Could we define browsing here please? Does it refer to a search of headings as opposed to a search of bibliographic records? If so, I would say it is essential, since it is the heading that represents the FRBR entity of interest to the user, not the bibliographic record (currently defined as a particular manifestation of a particular expression of a particular work). Does it refer to a left-to-right match of a heading? If so, I would say systems would provide better service if they used a keyword in heading search as the default FRBR entity search, rather than the current left-to-right match which is the only possibility offered by most current systems, and which requires the users to know entry terms.

Anonymous said

at 6:21 am on Jan 17, 2007

Comment from Martha Yee on hierarchy: Is it possible that the major constraint on demonstrating hierarchy here is not MARC, but rather the requirement that we be able to communicate records among thousands of different systems? Will we need to do that into the indefinite future (see my article called "One catalog or no catalog")? And another question for people who know more about systems than I do: What exactly is it that prevents current systems from making the cross reference from FBI on the authority record for 'United States. Federal Bureau of Investigation' available to users who search on subdivisions of the FBI? Could current systems, if properly programmed, recognize that any cross reference that refers to the parent organization should refer also to any subdivision of the parent organization? Or are there underlying hardware or software constraints outside of MARC that make this currently impossible?

Anonymous said

at 6:25 am on Jan 17, 2007

Martha, you can have many hierarchical levels and still exchange records -- XML does this nicely, as do other data formats. As for the "FBI" question, I think the authority records would have to be designed to facilitate that. Instead the authority records are designed for a linear approach where the user would encounter the higher level, then work down to the specific. In other words, the authority records are designed for a different system.

Anonymous said

at 9:35 am on Mar 26, 2007

The basic concept of replacing numbered tags with terms I see as problematic. Canada is bilingual. Special Libraries Cataloguing even catalogues for a library in a trilingual country. Increasingly even the United States is becoming multilingual. Why abandon language and script neutral numbers as element/field tags?

Anonymous said

at 7:53 am on Mar 27, 2007

Actually, there's nothing in this that suggests that tags will be terms. Tags are tags and can be anything. The example needed to be human-readable, but the point here is the structure, not the content. In fact, the "tags" will probably be URLs (which are fully international), and each community will choose its own vocabulary for the display of the tags. If you look at the SKOS documentation on the w3c site, you can see how that works.

You don't have permission to comment on this page.