Navigation

Registry Update November 2015 - April 2016

The Federal Documents Registry is now considered to be a beta release, available at  https://www.hathitrust.org/usdocs_registry.

There are currently 5.5 million records in the Registry. This is a decrease from 6.2 million Registry records in the last update. Given that more than 800,000 new source records have been added, this is evidence that duplicate detection techniques have been improved.  

Much of the project team’s work over the past six months was spent preparing the Registry for the beta release. The database was moved to a dedicated production environment; the interface is now more accessible for those who use screen readers and other assistive technologies; there is a daily feed of new and/or updated records from the HathiTrust repository; and Registry records have a persistent unique identifier.

Duplicate and Relationship Detection

Source records are clustered based on identifiers - while staff have experimented with string matching, the focus over the past few months has been on improving enumeration and chronology, also known as item description. In some cases this has meant eliminating information in the enumeration and chronology field that doesn’t describe the item, repeats data from another field, and/or is preventing the record from being clustered with other source records. An example is the use of the term “ONLINE” in the enumeration and chronology field for some records that include data in the MARC 856 field.

Another way in which staff have reduced the amount of duplication is by clustering together records for the same item in different formats, based on information in the MARC 776 field. The Registry record includes all relevant information for the item (ie title, publisher, OCLC number) and also lists each format (ie Print, Microform, Online).

Staff are also developing specifications for parsing enumeration and chronology for individual titles such as the Federal Register in an attempt to decrease the amount of duplication in the Registry.

Gap Detection and Comprehensiveness

Staff have begun attempting to identify gaps between Registry holdings and HathiTrust repository holdings. Several methods have been tested in an attempt to identify items in the HathiTrust repository that are federal documents but are not coded as such. Two of these methods have been incorporated into the daily updates from the repository - in addition to checking records for the presence of ‘f’ and ‘u’ in the MARC 008 field, the script also checks for the presence of a SuDoc call number. Records are added if there is the presence of an 086 field with indicator 1 equal to ‘0’ or the presence of a colon in the 086a field.

Aside from the daily updates from the repository, the Registry has not acquired much new data since the 2013 call for records. Data from the GPO Catalog of Government Publications is now being added to the Registry on an ongoing basis, which ensures that the Registry is kept up-to-date.

Project staff have also begun identifying gaps between the Registry and the HathiTrust repository  by building several sample pick lists. These lists, based on criteria such as title, agency, or source library, allow us to identify records in the Registry which do not have a HathiTrust ID.

Outreach

Bill Dueber and Valerie Glenn presented on the challenges of duplicate detection at the LITA Forum in Minneapolis, MN in November 2015.

Mike Furlough and Valerie Glenn will be presenting a paper on the Registry during the IFLA World Library and Information Congress in August 2016.

Project staff participated in discussions with several interested parties, including:

  • Committee on Institutional Cooperation Heads of Government Publications

  • ALA Government Documents Round Table Cataloging Committee

  • GODORT of Michigan

Future plans:

In the next six months, staff will continue to refine duplicate and gap detection methods. HathiTrust print holdings data will be incorporated into data analysis - specifically, staff will be looking for records coded as federal documents that are not currently held in the Registry, as well as identifying potential source libraries for items that are in the Registry but not in the HathiTrust repository. There will also be an initial assessment of the Registry based on use cases and other factors. Staff will be having discussions with a variety of stakeholders about policies and services around bulk record availability.

 

Please contact valglenn@umich.edu with any comments or questions.