It’s No Secret – Millions of Books Are Openly in the Public Domain

October 10, 2019

Kristina Hall, Copyright Review Program Manager, HathiTrust
Greg Cram, Associate Director of Copyright and Information Policy, New York Public Library


Since 2008 the HathiTrust Copyright Review Program has been researching hundreds of thousands of books to find ones that are in the public domain and can be opened for view in the HathiTrust Digital Library. Over the past 11 years, 168 people across North America have worked together for a common goal: the ability to share public domain works from our libraries. As of September 2019, the HathiTrust Copyright Review Program has performed copyright reviews on 506,989 US publications; of those, 302,915 (59.7%) have been determined to be in the public domain in the United States. The opening of these works in HathiTrust has brought the total of openly available volumes to 6,540,522.

The Copyright Review Program, now an operational program of HathiTrust, began as a grant-funded ambition of the University of Michigan Library, under the leadership of Melissa Levine. The Institute of Museum and Library Services (IMLS) funded three consecutive grants enabling the University of Michigan Library and grant collaborators to build a copyright review management system. The program is still going strong eleven years later, resulting in hundreds of publications determined to be in the public domain each week.

One way the Copyright Review Program determines the copyright status of items in the HathiTrust corpus is to determine whether they were properly renewed. In the United States, the copyright in works published between 1924 and 1964 had to be renewed about 28 years after the item was published; works could move into the public domain when their initial term of protection expired. The Stanford Copyright Renewal Database was one of the first to host monograph renewal records in an open access database, but much of the initial copyright registration information remains difficult to search.

In 2018, the New York Public Library began the difficult process to unlock the record of American creativity embedded in the US Copyright Office’s Catalog of Copyright Entries (CCE), comprised of 450,000 pages of registration and renewal records. The CCE is the published index of the records that are critical to understanding the copyright status and ownership of copyrighted works. The Copyright Office has been working to make images of these records available online, but searching these imaged records with precision and confidence remains elusive. No search function exists to reliably search the entire CCE; instead, users rely on analog techniques by opening multiple digitized volumes and paging through the records.

NYPL has embarked in an effort to enable accurate searching of the CCE by converting CCE records for 1923–1977 publications into a machine-searchable format. To make the records searchable, NYPL has begun to extract the CCE data as text. NYPL’s approach is to accurately transcribe the data, then parse the data into the appropriate fields so that users can facet their searches. The raw data is then made available on the project page and is freely accessible and usable. NYPL is actively gathering user stories for how users might access and use this data to build a set of requirements for a search interface.

HathiTrust has been enthusiastically following the work of NYPL to determine the possibilities of this data set. Copyright determination is an information problem at heart. The work that NYPL is doing to make information about copyright registrations available in a powerfully searchable format will greatly assist libraries who want to make digital collections broadly available to the public.

Before we jump to the possibilities, remember that not all books lacking a copyright renewal are public domain.  From our experience, the media articles claiming 80% of titles from this period are public domain don’t appear to take into account the complexities such as:

  • restoration of copyright for foreign authorship or foreign publication
  • layers of copyright in a translated work
  • qualifying for copyright registration in another format like serialized novels, drama, poetry, lectures, and short stories
  • inclusion of materials reproduced by permission such as illustrations
  • renewal of a prior edition

HathiTrust sampled a small set of the NYPL registration records, specifically records lacking a  renewal where the item had not yet been opened as public domain in HathiTrust. HathiTrust staff discovered it could help prioritize items awaiting a copyright review and to identify places where incomplete metadata in catalog records was hindering the search for items in the public domain. Out of 1,946 registration records in the NYPL sample set, 15% were already awaiting a HathiTrust copyright review and could be prioritized to go first. 18% were completely new items to add to the review queue. They had originally been passed over due to missing place of publication in the catalog record. 42% had already received a HathiTrust copyright review and were either opened public domain or had encountered one of the previously mentioned complexities. The remaining 25% had some indication in the catalog record that a public domain outcome would not be likely, such as having a foreign publisher.

Based on the sample set, HathiTrust has begun acting on the NYPL data to prioritize items for copyright review. The next round of work will include efforts to match more of the NYPL records to HathiTrust records. Then HathiTrust will be able to identify more books where insufficient metadata in the catalog record has prevented earlier review.

NYPL’s efforts to convert the entire CCE to machine-searchable text continue. NYPL was awarded a National Leadership Grant from IMLS in July 2019 to convert another 10,000 CCE pages, which would complete the registrations for Class A works registered between 1970 and 1977. At the conclusion of this grant, NYPL will have converted all Class A registrations for books registered between 1923 and 1977. Because the Copyright Office’s modern, machine-readable records begin in 1978, completing these registrations would close the gap to the historical records and enable searching across nearly a century of records. In addition to IMLS, NYPL’s project has been generously funded by the Ford Foundation and Arcadia Fund.

This is an exciting time for libraries as we strive to make digital collections broadly available. HathiTrust and NYPL are just one example where new access to old data has helped improve ongoing work in copyright. Yes, millions of books are public domain, it’s no secret. We are grateful for the chance NYPL and HathiTrust have to work together and put more public domain books in the hands of the public.