Navigation

Datasets

HathiTrust makes the texts of public domain works in its corpus available for research purposes. The works fall into two categories: non-Google-digitized volumes, which are freely available, and Google-digitized volumes, which are available through an agreement with Google. These sets are described further below. Please contact feedback@issues.hathitrust.org with questions and inquiries.

Non-Google-digitized volumes

Description

Approximately 350,000 public domain volumes as of February 2013, primarily, though not exclusively, English language materials published prior to 1923.

Access and Use

  • There are no restrictions on the availability or use of texts for non-Google-digitized public domain volumes.

Obtaining Datasets

  • A version of the set, updated on the quarterly basis, is available for download. Please contact us for more information.
  • Small numbers of works in this set can be retrieved through our Data API
  • Bibliographic data for volumes in the set is available for download.
  • Researchers receiving datasets must sign and return this statement regarding management of the dataset.
  • If you would like a custom set of volumes drawn from this set, we will need you to provide a list of HathiTrust ids for the items. Please see the section on Custom Datasets below.
  • See "Retrieving Datasets via rsync" below for specific instructions on receiving datasets.

Google-digitized volumes

Description

Approximately 2.8 million public domain volumes as of February 2013, representing a wide variety of languages, subjects, and dates. See the visualizations of HathiTrust public domain volumes.

Access and Use

These volumes were digitized by Google and are available through an agreement with Google that must be signed on the behalf of researchers by an institutional sponsor (someone with appropriate signing authority at a researcher's institution). In general, the limits on use of these materials are as follows:

  • They can only be used for scholarly research purposes
  • May not be used commercially
  • May not be re-hosted or used to support publicly available search services
  • May not be shared with third parties

Obtaining Datasets

To begin the process of receiving datasets, researchers or institutional representatives should:

  1. Review the agreement with Google.
  2. Determine who the authorized sponsor at the institution will be (because of liability and indemnification terms in the agreement, legal counsel will likely be needed). Interested researchers should begin this process as soon as possible.
  3. Send a brief proposal to feedback@issues.hathitrust.org that specifies
    1. the authorized sponsor
    2. a characterization of the desired texts (please be specific about subjects, dates, languages, etc. If all texts are desired, please be clear about the reason all subjects, languages, dates, etc. are needed).
    3. volume ids for the specific volumes requested (see Custom Datasets below)
    4. what research is to be done
    5. what the results will be, and
    6. how they will be used
  4. Sign and return this statement related to the use and management of the dataset.
  5. When you submit your proposal, please indicate whether or not you give permission to share the proposal publicly. 

The following institutions have signed, or are in the process of signing an agreement with Google for use of texts. If you are a researcher affiliated with one of these institutions, you may proceed directly to submitting the proposal and we will be in touch with your institutional sponsor.

  • Carnegie Mellon University
  • Clemson University
  • Columbia University
  • Dartmouth College
  • Indiana University
  • McGill University
  • Michigan State University
  • New York University
  • Princeton University
  • Stanford University
  • University of California
  • University of Illinois
  • University of Massachusetts, Amherst
  • University of Massachusetts, Lowell
  • University of Michigan
  • Yale University

See "Retrieving Datasets via rsync" below for specific instructions on receiving datasets.

Custom Datasets

If you are not interested in receiving the entire set of non-Google-digitized or Google-digitized materials, we can assist in the creation of custom datasets. To do so, we need a list of the volume ids of desired volumes. Volume ids are present in the persistent identifiers for HathiTrust volumes (e.g., http://hdl.handle.net/2027/mdp.39015021715670). IDs can be retrieved in the following ways:

  1. Defining a query or set of queries from a catalog or full-text search (see the search portals on the home page). If the desired texts can be defined using facets or queries in a catalog or full-text search, (e.g., English-language materials published before 1800), we can generate the list of ids that are needed.
  2. Building a collection in Collection Builder. Volumes can be saved to collections in batches from full-text search results or individually when viewing a given volume. We can retrieve the list of ids from one or more designated collections.
  3. HathiTrust's tab-delimited metadata files. These files are an inventory of repository holdings, containing a variety of identifiers for volumes (ISBN, LCCN, OCLC, etc.), copyright information, and limited bibliographic metadata for each volume in HathiTrust. Users wishing to select batches of materials that cannot be defined by search queries may find these files useful in selecting volumes. A description of the files is available at http://hathitrust.org/hathifiles_description.
  4. HathiTrust Data API. In addition to retrieving entire volume packages from HathiTrust (including images and OCR), the Data API can be used to find ids for volumes digitized from a particular source. The University of Michigan has built a demonstration application using the Data API that illustrates how this can be done. Please see http://www.lib.umich.edu/two-over-threehundred.

Please send lists of ids or queries to feedback@issues.hathitrust.org.

Retrieving Datasets via rsync

Researchers must provide a static non-NAT IP address in order for us to grant access to retrieve datasets using the rsync command line tool (rsync is our preferred method of transfer). Once the IP address is configured for access, you will be able to use the instructions below to retrieve the dataset:

You now have access to the <name of dataset> dataset from <IP address(es)>. The rsync command you should use is:

rsync --copy-links --delete --ignore-errors --recursive --times --verbose datasets.hathitrust.org::<name-of-dataset>/ /local/path/to/save

Most of those options are necessary, but a few are optional. For example, you could use --verbose only on updates to get a list of what changed. Let us know if you want more information about the options. It's possible that rsync will report errors, and quite likely that most will be harmless. We do not have a catalog of likely errors yet, so if you receive some, please forward them to us. Depending on the size of your dataset, the initial rsync may take up to a day or more. One benefit of rsync is that if the process is terminated, you will not waste time transferring the same information again when you restart the sync. It will pick up where it left off, after comparing the source and target files.

Our intention is to update the dataset regularly. If you have a download in progress while we make an update to a dataset, your rsync will be terminated. To recover, simply restart rsync. It will spend time comparing the source and destination trees, but it will not waste time downloading anything already present on the destination.

At the top level you will find a directory for each name space, a file containing a list of IDs in the dataset (id), and a file containing all the bibliographic data for the items in the dataset (meta.tar.gz). If you need more information on what this all means, let us know. The directory tree under each name space is in pairtree format (http://tools.ietf.org/html/draft-kunze-pairtree-01) based on the HathiTrust Identifier. Tools to help with pairtree usage can be found athttps://confluence.ucop.edu/display/Curation/PairTree. Data for any given volume is found at the end of a pairtree directory structure created using the HathiTrust item identifier. For each item there are two files, a metadata file (<id>.mets.xml) and a zip file with the page text files (<id>.zip). Again, if you need more information about these files, let us know (feedback@issues.hathitrust.org).