Opening Up 15,000+ Federal Documents: An Algorithm Story

April 10, 2019

By Heather Christenson

Over 15,000 federal documents previously in limited view are now full view in HathiTrust thanks to an adjustment in our metadata management system. This is something to celebrate for a number of reasons, first and foremost, these thousands of digitized federal documents are now available to anyone! Additionally, we’ve improved our system so that new federal documents coming into HathiTrust will be identified more effectively and thus available in full view. This work is a great example of librarians owning and managing our system, and in this case, engaging with an algorithm and optimizing it for public good.

When our libraries contribute digital volumes to HathiTrust, we receive bibliographic data along with them. For many volumes in HathiTrust we have received multiple contributions of the same title, both the metadata and digital volumes. In those cases the multiple bibliographic records are clustered together in our system, with one record chosen as the preferred record that is then used in HathiTrust discovery and access services.  The preferred record is chosen by a record scoring algorithm that assigns points to different aspects of each record and chooses the record with the highest score. Importantly, the viewability status of an item relies on data within the preferred record, so viewability within the HathiTrust Digital Library can be affected by which record the algorithm designates “preferred”.

In HathiTrust, it is our intent to provide federal documents in full view to the extent legally permissible. A group of us decided to look at the record scoring algorithm to see if there were adjustments that could be made to prefer any record within a cluster that indicates that a given volume is a federal document.

For some libraries, cataloging of federal documents has necessarily been minimal, and this is generally reflected in the bibliographic data that HathiTrust receives. We also saw cases where more richly cataloged bibliographic records for federal documents — that had been sometimes specifically invested in by member libraries — were not being chosen as a preferred record by the record scoring algorithm.

We found that the record scoring algorithm’s generalized actions sometimes chose preferred records for federal document volumes that did not include the information that the volume was a federal document. When these preferred records were acted upon in the bibliographic rights determination process, we were not able to identify them as federal documents, so in many cases were not able to present them to users in full view.

We decided to improve the algorithm by assigning higher weight to records indicating U.S. federal document status. (For those wanting details — checking the 008 field for the “f” and “u” flags, the ‘f’ in 008/28 and ‘u’ in 008/17).

By making this change we’re now providing more than 15,000 federal documents in full view that were previously in limited view, bringing to a total of 1,182,913 federal documents in full view in HathiTrust as of April 1, 2019. As new digitized volumes and associated bibliographic data are contributed by our member libraries, federal documents will be identified more effectively and thus provided in full view.

Examples of the range of federal documents now in full view include volumes of Statistics of Income from the Internal Revenue Service, initially requested by a user, and this U.S. Senate Hearing on the Role of Giant Corporations, from 1971. Or, the briefly titled Leaflet from the U.S.D.A., covering the Cotton Aphid, The Meadow Spittlebug and How to Control It, or Centipedes and Millipedes in the House (eek!). These publications and many many more can be found in the U.S. Federal Documents collection.

Reflecting on this project, it is important to recognize that it’s possible to make these kind of improvements because we as libraries control the processes, we can bring collective expertise to bear on problems, and we do this in the spirit of greater access!

Many thanks to all who contributed to this project, especially: Tim Prettyman, Charlie Collett, Angelina Zaytsev, Josh Steverman, Kathryn Stine, Sandra McIntyre, and former HathiTrust staffer Valerie Glenn.

Illustration of a cotton aphid with text "Leaflet no. 467" and a library stamp of May 18,1960