Navigation

Datasets

HathiTrust makes the texts of public domain works in its corpus available for research purposes. The works fall into two categories: non-Google-digitized volumes, which are freely available, and Google-digitized volumes, which are available through an agreement with Google. Within each category there is a distinction between public domain works available only in the US versus public domain works available anywhere in the world. These sets are described further below. Please contact feedback@issues.hathitrust.org with questions and inquiries.

Non-Google-digitized volumes

Description

Approximately 550,000 public domain volumes as of March 2015, primarily, though not exclusively, English language materials published prior to 1923.

Access and Use

  • There are no restrictions on the availability or use of texts for non-Google-digitized public domain volumes for those users in the US. For users outside the US, only the subset of non-Google-digitized public domain volumes available anywhere in the world will be made available.

Obtaining Datasets

  • A version of the set, updated on a monthly basis, is available for download. Please contact us for more information.
  • Small numbers of works (practical limit is about 10,000) in this set can be retrieved through our Data API
  • Researchers receiving datasets must sign and return this statement regarding management of the dataset.
  • If you would like a custom set of volumes drawn from this set, we will need you to provide a list of HathiTrust ids for the items. Please see the section on Custom Datasets below.
  • See "Retrieving Datasets via rsync" below for specific instructions on receiving datasets.

Google-digitized volumes

Description

Approximately 4.8 million public domain volumes as of March 2015, representing a wide variety of languages, subjects, and dates. See the visualizations of HathiTrust public domain volumes.

Access and Use

These volumes were digitized by Google and are available through an agreement with Google that must be signed on the behalf of researchers by an institutional sponsor (someone with appropriate signing authority at a researcher's institution). In general, the limits on use of these materials are as follows:

  • They can only be used for scholarly research purposes
  • May not be used commercially
  • May not be re-hosted or used to support publicly available search services
  • May not be shared with third parties

In addition, for users outside the US, only the subset of Google-digitized public domain volumes available anywhere in the world will be made available.

Obtaining Datasets

To begin the process of receiving datasets, researchers or institutional representatives should:

  1. Review the agreement with Google.
  2. Determine who the authorized sponsor at the institution will be (because of liability and indemnification terms in the agreement, legal counsel will likely be needed). Interested researchers should begin this process as soon as possible.
  3. Send a brief proposal to feedback@issues.hathitrust.org that specifies
    1. the authorized sponsor
    2. a characterization of the desired texts (please be specific about subjects, dates, languages, etc. If all texts are desired, please be clear about the reason all subjects, languages, dates, etc. are needed).
    3. volume ids for the specific volumes requested (see Custom Datasets below)
    4. what research is to be done
    5. what the results will be, and
    6. how they will be used
  4. Sign and return this statement related to the use and management of the dataset.
  5. When you submit your proposal, please indicate whether or not you give permission to share the proposal publicly. 

The following institutions have signed, or are in the process of signing an agreement with Google for use of texts. If you are a researcher affiliated with one of these institutions, you may proceed directly to submitting the proposal and we will be in touch with your institutional sponsor.

  • Boston University
  • Carnegie Mellon University
  • Clemson University
  • Columbia University
  • Cornell University
  • Dartmouth College
  • Indiana University
  • Lehigh University
  • McGill University
  • Michigan State University
  • New York University
  • Oxford University
  • Princeton University
  • Stanford University
  • Texas A&M University
  • University of California
  • University of Colorado
  • University of Illinois
  • University of Iowa
  • University of Massachusetts, Amherst
  • University of Massachusetts, Lowell
  • University of Michigan
  • University of Oxford
  • University of Toronto
  • University of Vermont
  • University of Virginia
  • University of Waikato
  • Yale University

See "Retrieving Datasets via rsync" below for specific instructions on receiving datasets.

Custom Datasets

If you are not interested in receiving the entire set of non-Google-digitized or Google-digitized materials, you may also rsync only the volumes you are interested in. To do so, you will need a list of the volume ids of desired volumes. Volume ids are present in the persistent identifiers for HathiTrust volumes (e.g., http://hdl.handle.net/2027/mdp.39015021715670). IDs can be retrieved in the following ways:

  • HathiTrust's tab-delimited metadata files. These files are an inventory of repository holdings, containing a variety of identifiers for volumes (ISBN, LCCN, OCLC, etc.), copyright information, and limited bibliographic metadata for each volume in HathiTrust. Users wishing to select batches of materials that cannot be defined by search queries may find these files useful in selecting volumes. A description of the files is available at http://hathitrust.org/hathifiles_description.
  • HathiTrust Data API. In addition to retrieving entire volume packages from HathiTrust (including images and OCR), the Data API can be used to find ids for volumes digitized from a particular source. The University of Michigan has built a demonstration application using the Data API that illustrates how this can be done. Please see http://www.lib.umich.edu/two-over-threehundred.
  • Defining a query or set of queries from a catalog or full-text search (see the search portals on the home page). If the desired texts can be defined using facets or queries in a catalog or full-text search, (e.g., English-language materials published before 1800), we can generate the list of ids that are needed. Please send queries to feedback@issues.hathitrust.org.

Retrieving Datasets via rsync

Researchers must provide a static non-NAT IP address in order for us to grant access to retrieve datasets using the rsync command line tool (rsync is our preferred method of transfer). Once the IP address is configured for access, you will be able to rsync the dataset using the instructions found in this GitHub gist.

Most of the options in the gist are necessary, but a few are optional. For example, you could use --verbose only on updates to get a list of what changed. Let us know if you want more information about the options. It's possible that rsync will report errors, and quite likely that most will be harmless. We do not have a catalog of likely errors yet, so if you receive some, please forward them to us. Depending on the size of your dataset, the initial rsync may take up to a day or more. One benefit of rsync is that if the process is terminated, you will not waste time transferring the same information again when you restart the sync. It will pick up where it left off, after comparing the source and target files.

We update the dataset regularly. If you have a download in progress while we make an update to a dataset, your rsync will be terminated. To recover, simply restart rsync. It will spend time comparing the source and destination trees, but it will not waste time downloading anything already present on the destination.

At the top level you will find a directory for each namespace, a file containing a list of IDs in the dataset (id), and one or more files containing the bibliographic data for the items in the dataset (meta*.json.gz). If you need more information on what this all means, let us know. The directory tree under each name space is in pairtree format (http://tools.ietf.org/html/draft-kunze-pairtree-01) based on the HathiTrust Identifier. Tools to help with pairtree usage can be found athttps://confluence.ucop.edu/display/Curation/PairTree. Data for any given volume is found at the end of a pairtree directory structure created using the HathiTrust item identifier. For each item there are two files, a metadata file (<id>.mets.xml) and a zip file with the page text files (<id>.zip). Again, if you need more information about these files, let us know (feedback@issues.hathitrust.org).