Federal Documents in HathiTrust: A Look at Our Collective Collection

March 20, 2017

By Heather Christenson

HathiTrust has an ambitious goal to build a comprehensive digital collection of U.S. federal documents distributed in print format. But what do we already have in our collective digital collection? And what is it that can we learn about that collection? It is these questions that HathiTrust staff set out to answer in a project that we’ve called the “Federal Documents Collection Profile.” In January, we concluded an initial analysis of the U.S. Federal Documents collection as it existed September 1, 2016. “Initial”, because this hadn’t been done before, and because we expect it to be the precursor of more robust collection analysis and comparisons to come. A goal of the project was to investigate a variety of metrics based on the data available to us in order to establish a baseline for reporting on the collection. We were cautiously optimistic that we would be able to characterize at least some aspects of the collection.

We began the project by tackling the challenge of defining a set to analyze. What is the best way to identify federal documents in the large mass of HathiTrust metadata? Not that this question is entirely new to us, but, given the variations in completeness and accuracy in cataloging over the years, as we described in Detecting U.S. Federal Documents to Expand Access, is it even possible to accurately or reasonably delineate this set?  We settled on an approach: detection of “f” and “u” in the MARC 008 field, detection of a SuDoc number in the MARC 086 field, and, making use of our U.S. Federal Documents Registry, checking for a match between the HathiTrust record and the Registry record. By this method, we narrowed the universe of federal documents in the HathiTrust digital collection to 412,205 bibliographic records and 970,315 digital objects. 94% of the bibliographic records in this set represent monographs and 6% represent serials, while 56% of the digital objects are monographs, and 44% are serials.

Because we did not limit our set to full view, we found that, in our federal documents collection, approximately 852,488 digital objects/documents are fully viewable in the U.S., while 117,827 are limited view/search only. Clearly more investigation can be done here to understand this breakdown.

One of the great strengths of HathiTrust is our large community of members. The power of aggregation was clear in our finding that fifty-one different organizations had deposited federal documents in HathiTrust. HathiTrust member partnerships with Google generated the great majority of digitized federal documents, but we have documents in our collection from almost twenty different digitization sources.

Things got more challenging when we dove into bibliographic data. Duplicates? We made some progress, but identifying true duplicates will require a focused in-depth analysis project. Breakdowns by subject? Corporate author? Publisher? Clearly there is work ahead of us to overcome decades of inconsistent cataloging practices and textual complications for a meaningful characterization. During this analysis we found more than forty-nine variations on “Government Printing Office” in the publisher field (260 $b), let alone the name change to “Government Publishing Office”!

But other aspects of the collection did come into view. The date curve peaked nicely in a pattern mimicking overall government publishing, 1960s through mid-1990s. Our subset of records that contained SuDoc numbers (64% of the full set), broke out to show strengths in Congressional Publications, Forest Service documents, NASA, and more. We found 147 languages represented in the bibliographic records, the vast majority English with a very long tail (and including some head-scratchers like Ancient Greek–perhaps another sign of inconsistent cataloging in the collection).

A brief look at usage metrics revealed “Library of Congress Catalogs 1976 V. 4” and “Annual Report of the Commissioner of Patents for 1916” in first and second place, as well as “A short guide to New Zealand” in ninth place, apparently having gained fame by being discussed in a reddit thread.

Finally, in addition to analysis of the full HathiTrust federal documents collection set, we zeroed in on a set of individual titles, as well as one agency, the Civil Rights Commission, to see what we could learn about comprehensiveness in HathiTrust. Although we estimate that our comprehensiveness measures are on the conservative side, clearly HathiTrust has a ways to go to fill in our collection, since comprehensiveness for individual titles ranged from around 3% to 60%.

We believe that our collection profile is one of the first attempts to delineate and characterize a collection from within the aggregate mass digitized library collection. We know that identifying the gaps and filling them in will be a task measured in years. 970,315 digitized documents is a great starting point, and our sleeves are rolled up.

Read the full report, Collection Profile: U.S. Federal Documents in HathiTrust.