Available Indexes

Datasets

HathiTrust makes the text data of public domain works in its collection available to researchers to bulk download directly, for non-commercial research purposes. If you wish to access a dataset, complete the approval process described in more detail below.

These datasets are meant for researchers seeking large numbers of public domain texts they wish to analyze locally. HathiTrust also offers other data access and analysis options, including:

  • The Data API for retrieving a relatively small number of volumes’ text data and page images. The practical limit for retrieving items through the Data API is about 10,000 volumes. Volumes digitized by Google are not available via the Data API.
  • The HathiTrust Research Center tools and services for text data mining of any item in  HathiTrust corpus. Access to HTRC tools is separate from  the dataset approval process.

Learn more about the kinds of research that dataset recipients have published.

Available Datasets

The available datasets are divided between:

  • Public domain volumes not digitized by Google: available for non-commercial research by request
  • Public domain volumes digitized by Google: available for non-commercial research by request for those whose institution has signed a distribution agreement with Google

Regardless of who digitized the data, HathiTrust distinguishes between works that are public domain only in the US, and thus available only to researchers in the United States, and works that are public domain worldwide and available outside of the United States. Access to data in a dataset follows HathiTrust's policy for access to public domain materials. The worldwide public domain datasets are subsets of the volumes available to researchers in the United States.

table showing dataset options

Table 1. The datasets of full text available to qualified researchers with approval from HathiTrust.

Public domain text, excluding Google-digitized volumes

Dataset for researchers in the U.S.: ht_text_pd_open_access
Dataset for researchers outside the U.S.: ht_text_pd_world_open_access

  • 814,045 public domain and Creative Commons-licensed volumes as of February 2019 for researchers in the United States (610,575 for researchers outside the United States). Total dataset is at least 480GB for researchers in the United States (351GB for researchers outside the United States).
  • Due to variation in the length of copyright protection in different nations, the version of this dataset available to researchers outside the United States represents only the public domain volumes that are public domain anywhere in the world, a subset of HathiTrust’s overall public domain corpus. Learn more about HathiTrust’s copyright determination policies.

All public domain text, including Google-digitized volumes

Researchers in the U.S.: ht_text_pd
Researchers outside the U.S.: ht_text_pd_world

  • 6,649,535 public domain and Creative Commons-licensed volumes as of February 2019 for researchers in the United States (4,316,648 for researchers outside the United States). Total dataset is at least 5.4TB for researchers in the United States (3.4TB for researchers outside the United States).
  • Due to variation in the length of copyright protection in different nations, the dataset of public domain volumes, including those digitized by Google, available to researchers outside the United States represents the public domain volumes that are available anywhere in the world, a subset of HathiTrust’s overall public domain corpus. Learn more about HathiTrust’s copyright determination policies.
  • Includes volumes digitized by Google and those digitized by other parties.
  • Represents a wide variety of languages, subjects, and dates. See the visualizations of HathiTrust public domain volumes.
  • Volumes digitized by Google are available through a Distribution Agreement with Google that must be signed by an institutional sponsor in order for researchers at an institution to be able to access them. An institutional sponsor is someone with appropriate contractual signing authority at a researcher's institution. The Google Distribution Agreement needs only to be signed once per institution, and once it has been received, it applies to all subsequent dataset requests from researchers at an institution.  Researchers are responsible for understanding and abiding by the terms of the Google Distribution Agreement.  In general the limits on the use of Google-digitized materials are as follows:
    • Can be used only for scholarly research purposes
    • May not be used commercially
    • May not be re-hosted or used to support search services
    • May not be shared with third parties

Accessing a subset of volumes from a dataset

If you do not need the entire sets of the volumes described above, you must limit your sync only the volumes you are interested in within one of the larger sets.

To do so, you will need a list of the HathiTrust volume identifiers of desired volumes. Volume IDs are present in the persistent identifiers for HathiTrust volumes (e.g., for the volume at http://hdl.handle.net/2027/mdp.39015021715670, the volume ID is mdp.39015021715670).

A list of volume IDs can be generated in one of the following ways:

  • Using HathiTrust's tab-delimited metadata files, called “Hathifiles”. These files are an inventory of repository holdings, containing a variety of identifiers for volumes (ISBN, LCCN, OCLC, etc.), copyright information, and limited bibliographic metadata for each volume in HathiTrust. Users wishing to select batches of materials that cannot be defined by search queries may find these files useful in selecting volumes. A description of the files is available at https://hathitrust.org/hathifiles_description.
  • Via the HathiTrust Bibliographic API. Can be used to find HathiTrust volume IDs for volumes with a known alternate identifier, such as ISBN or OCLC number.
  • Building a collection in HathiTrust Collection Builder. Define a query or set of queries from a catalog or full-text search (see the search options on the HathiTrust home page) and add volumes to your collection. You can download the volume IDs for a collection.
  • HTRC’s Beta Workset Builder 2.0. A beta tool and interface built over the HTRC Extracted Features Dataset to enable both volume-level metadata search and volume- and page-level unigram (single word) text search in order to build worksets for use with HTRC tools and services.
  • Once you have a list of volume IDs, it can be used to generate volume paths for the rsync, as detailed in the rsync instructions described below.

For guidance and support crafting a list of volume IDs, please email feedback@issues.hathitrust.org.

Data that is not available

  • Data that is neither in the public domain nor licensed for access via Creative Commons licensing. Text data for all HathiTrust volumes, including those that are not in the public domain, can be analyzed using non-consumptive research methods using the tools and derived datasets from the HathiTrust Research Center under certain access conditions.
  • Volumes that are restricted due to privacy concerns, such as those containing personally identifiable information (e.g., Social Security numbers paired with individual names).
  • Page images. (Only provided in limited cases, and by special approval from HathiTrust.)

Requesting and retrieving a dataset

Datasets are available from HathiTrust only by request. If your request is approved, then you will be given access to an rsync endpoint from which you can transfer and synchronize the files. Rsync is a protocol for transferring files. See "Retrieving Datasets via rsync" for specific instructions on downloading datasets.

If you want to ask questions prior to submitting the research proposal described in step 3 below, please send an email to feedback@issues.hathitrust.org to start a conversation.

The request process

  1. Determine which dataset you would like to request access to. Even if you want to sync a subset of volumes, you will be granted access to the endpoint for the entire dataset. Your options are:
    • Public domain, excluding Google-digitized volumes: all available volumes you can access based on your location, but does not include Google-digitized volumes. Datasets are either ht_text_pd_open_access (for US-based researchers) or ht_text_pd_world_open_access (for researchers outside the US).
    • All public domain, including Google-digitized volumes: all available volumes you can access based on your location, including Google-digitized volumes. Datasets are either ht_text_pd (for US-based researchers) or ht_text_pd_world (for researchers outside the US).
  2. For those requesting Google-digitized volumes, check whether your institution has signed the Distribution Agreement with Google.
    • If not, determine who the authorized sponsor is at your institution. Because of liability and indemnification terms in the agreement, review from legal counsel could be needed. Interested researchers should begin this process as soon as possible. Contact legal counsel at Google, Tamar Fruchtman, tamarf@google.com, to begin the process.
    • If so, review the Distribution Agreement with Google to understand what your responsibilities are.
  3. Send a brief proposal in an email message to feedback@issues.hathitrust.org that specifies:
    • The name, institutional affiliation, and country of residence of all researchers who will contribute to the research project
    • The dataset to which you would like to have rsync access
      • Public domain, excluding Google-digitized volumes (ht_text_pd_open_access or ht_text_pd_world_open_access)
      • All public domain, including Google-digitized volumes (ht_text_pd or ht_text_pd_world)
      • Note: If you would like to sync only a subset of data relevant to your work but are unsure which of the datasets contains the items you desire, provide either a characterization of the desired texts (dates, languages, subjects, etc.) or details about the specific volumes requested (either in the form of a link to a public HathiTrust Collection or a list of HathiTrust volume IDs). Someone will help you determine which dataset is both best for your needs and is available to you based on the access limitations described elsewhere on this page.
    • What research is to be done
    • What the result outputs will be
    • How the research outputs will be used
    • Whether or not you give permission for HathiTrust to share the details of your proposal
      • Note: With your permission, we may share your proposal with other researchers as an example if asked, or describe the details of your proposal to other HathiTrust users. We will not post your proposal online. In all cases, we reserve the right to describe how datasets have been used by recipients, based on publicly-available information, including linking to research outputs from HathiTrust dataset usage.
  4. Sign and return the researcher agreement that applies to the data you are requesting.
  5. When your request is approved, we will ask you to send a static IP address. Once your access is cleared, then you will be able to rsync the data you have requested only from that IP address.

Institutions with a signed Google Distribution Agreement

The following institutions have signed a Distribution Agreement with Google for controlled use of texts.

  • Arizona State University
  • Boston University
  • Carnegie Mellon University
  • Clemson University
  • Columbia University
  • Cornell University
  • Dartmouth College
  • Georgetown University
  • Georgia Institute of Technology
  • Harvard College
  • Indiana University
  • Lehigh University
  • McGill University
  • Michigan State University
  • Northeastern University
  • Pennsylvania State University
  • Princeton University
  • Purdue University
  • Queen's University at Kingston
  • San José State University
  • Stanford University
  • State University of New York at Stony Brook
  • Texas A&M University
  • University of California
  • University of Chester
  • University of Chicago
  • University of Colorado
  • University of Illinois
  • University of Iowa
  • University of Massachusetts, Lowell
  • University of Michigan
  • University of Oxford
  • University of Toronto
  • University of Vermont
  • University of Virginia
  • University of Waikato
  • Westfälische Wilhelms-Universität Münster (University of Münster)
  • Yale University

    Deletion notifications

    Occasionally we determine that items have erroneously been identified as public domain. When this happens we close the item and update the rights status of volumes in the public domain datasets. Dataset recipients are then responsible for deleting those items from their dataset. Dataset recipients are added to an email list, from which they receive emails when there are files to delete. It is HathiTrust’s expectation that you will remove them from your dataset, and this requirement is stipulated in the agreements signed by each recipient.

    HathiTrust expects that you will either continue with the deletions for the long term, or delete the entire dataset from your machine. If you want to stop receiving the deletion notifications, then you can delete the dataset and let HathiTrust know by emailing feedback@issues.hathitrust.org.

    Retrieving Datasets via rsync

    Datasets are retrieved using rsync.

    During the dataset approval process, you must provide a static non-NAT IP address in order for us to grant access to retrieve datasets using the rsync command line tool.

    Once the IP address is configured for access, you may follow these instructions to rsync the dataset. For those syncing a subset of volume, the instructions include steps to generate the file paths for volumes based on their HathiTrust ID.

    Most of the options in the gist are necessary, but a few are optional. For example, you could use the option --verbose only on updates to get a list of what changed. Please contact HathiTrust if you want more information about the options.

    It's possible that rsync will report errors, and quite likely that most will be harmless. We do not have a catalog of likely errors, so if you receive some, please forward them to us.

    Depending on the size of your dataset, the initial rsync may take up to a day or more. One benefit of rsync is that if the process is terminated, you will not waste time transferring the same information again when you restart the sync. It will pick up where it left off, after comparing the source and target files.We update the datasets regularly as new materials are updated to HathiTrust. If you have a download in progress while we make an update to a dataset, your rsync will be terminated. To recover, simply restart rsync. It will spend time comparing the source and destination trees, but it will not waste time downloading anything already present on the destination.

    Data format

    Once you sync the dataset, you will find a directory for each namespace at the top level, a file containing a list of the volume IDs in the dataset (id), and one or more files containing the bibliographic data for the items in the dataset (meta*.json.gz).

    The directory tree under each namespace is in pairtree format based on the HathiTrust volume IDs. Tools to help with pairtree usage can be found at https://confluence.ucop.edu/display/Curation/PairTree. Data for any given volume is found at the end of a pairtree directory structure created using the HathiTrust IDs.

    For each item there are two files, a metadata file (<id>.mets.xml) and a zip file with the page text files (<id>.zip).

    If you need more information about the files or file structure, let us know (feedback@issues.hathitrust.org).

    Other policies and procedures

    Dispersed research teams

    If the primary researcher on the request is from an institution with a signed Google Distribution Agreement and/or inside the U.S. and another team member is not:

    • The data with the higher approval level may be rsynced only to the primary researcher
    • Others on the project team can be allowed access to the data by the primary researcher, but only without moving the data off the servers of primary researcher. There may be no redistribution of the data, and no copying or moving of the data to the servers of the other research team members located in other jurisdictions.
    • If the research team cannot agree to this, we will authorize the entire team for the lower access level only (e.g., non-Google Digitized, or public domain worldwide items).

    Sharing results

    Dataset recipients are welcomed to share the results of any research outcomes from their use of the dataset.

    Preferred citation format

    The citation should contain a few key components:

    • An author name. If you are using a full dataset
    • Recognition of that the data was received from HathiTrust.
    • A title describing your dataset.
    • If you are working with a subset of volumes, link to the HathiTrust collection, HTRC workset, or volume ID list that corresponds to your dataset.
    • An access date.

    HathiTrust. TITLE DESCRIBING YOUR DATASET. Distributed by HathiTrust (https://www.hathitrust.org/datasets). LINK TO COLLECTION/WORKSET/VOLUME ID LIST. Accessed DATE.

    Example for dataset corresponding to HathiTrust collection:
    HathiTrust. Adventure Novels: G.A. Henty. Distributed by HathiTrust (https://www.hathitrust.org/datasets). https://babel.hathitrust.org/cgi/mb?a=listis;c=464226859. Accessed December 1, 2019.