Navigation

Registry Update November 2014-April 2015

During these six months staff have focused on development of the registry, in anticipation of an Alpha release in early June. Work has centered around refining the mechanism for detecting duplicate and related records, incorporating that mechanism into a preliminary system for the manual review of records, and exploring approaches to the detection of gaps in holdings.

Project staff

Joshua Steverman began work on December 1, 2014 as the primary Applications Developer in support of the US Federal Documents Registry.

Relationship detection

Staff continued to refine the duplicate detection mechanism, focusing first on detecting record pairs with matching identifiers (ie, OCLC number, Library of Congress Control Number, SuDoc call number), then moving on to title. By January 31st, four sets of records had been added, indexed, and gone through the relationship detection process: the US federal documents records already identified in the HathiTrust repository and records contributed by the University of Michigan, the University of Minnesota, and the Committee on Institutional Cooperation (CIC). Additional sets of records received in response to the Fall 2013 call for records are currently being processed, and records from nine other institutions have been added to-date.

One area of work that has proved to be very challenging for duplicate detection is the normalization of enumeration and chronology, or volume/date information. This is due in part to the inconsistency of the data as entered and the variances in library binding practices. Currently staff are able to normalize ~50% of the available data. It is anticipated that the implementation of a manual review process, where human reviewers will make decisions about whether or not records are for the same object, will inform and improve this effort.

Manual review of registry metadata

While the relationship detection process will aid the automated de-duplication of registry metadata, we recognize that some records will need to be reviewed manually in order to determine whether or not they are for the same object.

In January, two University of Washington iSchool students began working with registry staff on a targeted manual review project. The students have been involved in project planning and have researched several different ways to approach the review of record pairs. They have also reviewed record pairs, making decisions as to whether or not the pairs are duplicate records, related records, not related, or if the pieces begin described by the records need to be consulted. In preparation for the student review of records, project staff developed a minimal review interface, incorporating pairs of records identified via the relationship detection mechanism as needing review. Staff also developed training documentation that can be repurposed for use by future reviewers. The students will continue to work on the project through May.

Gap detection/Comprehensiveness work

Project staff continued to identify records currently in the HathiTrust repository which are US federal government documents but are not cataloged as such, along with those records for non-US federal documents items cataloged as US federal documents. Over 6,000 items from more than 25 institutions have been identified to-date, and those partner institutions were contacted and encouraged to update and resubmit their records. Staff also worked with members of TRAIL (Technical Reports Archive and Image Library) to identify and update over 1,600 records for US federal documents already in the repository. As of April 30, there are 621,188 US federal government documents open in HathiTrust.

In addition to that work, staff have begun exploring methods for identifying gaps in the registry records. One approach has been to utilize authority records for US government authors. Project staff have compared data from the 110 and 710 fields of registry MARC records to entries in VIAF (Virtual International Authority File). 90% of authors in the registry records have VIAF IDs, and thousands of VIAF IDs for US government authors that aren’t currently in any registry records have been identified.

Staff also participated in conversations with the Government Publishing Office (GPO) regarding a potential bibliographic record exchange and content analysis partnership. No formal agreement is in place at this time. 

Outreach

Project staff participated in discussions with several interested parties, including:

  • CIC Heads of Government Publications

  • Documents librarians from several University of California campuses

  • ALA Government Documents Round Table Cataloging Committee

  • Library staff from the University of Michigan Transportation Research Institute

We anticipate much more outreach and interaction with interested parties, both in HathiTrust partner institutions and the broader community, in the next year.

Staff also assisted with the development of some Frequently Asked Questions about the HathiTrust US Federal Government Documents Initiative.


Please contact valglenn@umich.edu with any questions.