HathiTrust makes the texts of public domain works in its corpus available for research purposes. The works fall into two categories: non-Google-digitized volumes, which are freely available, and Google-digitized volumes, which are available through an agreement with Google. These sets are described further below. Please contact hathitrust-datasets@umich.edu with questions and inquiries.
Non-Google-digitized volumes
Description
Approximately 350,000 public domain volumes as of February 2013, primarily, though not exclusively, English language materials published prior to 1923.
Access and Use
- There are no restrictions on the availability or use of texts for non-Google-digitized public domain volumes.
Obtaining Texts
- A version of the set, updated on the quarterly basis, is available for download. Please contact us for more information.
- Small numbers of works in this set can be retrieved through our Data API.
- Bibliographic data for volumes in the set is available for download.
- If you would like a custom set of volumes drawn from this set, we will need you to provide a list of HathiTrust ids for the items. Please see the section on Custom Datasets below.
Google-digitized volumes
Description
Approximately 2.8 million public domain volumes as of February 2013, representing a wide variety of languages, subjects, and dates. See the visualizations of HathiTrust public domain volumes.
Access and Use
These volumes were digitized by Google and are available through an agreement with Google that must be signed on the behalf of researchers by an institutional sponsor (someone with appropriate signing authority at a researcher's institution). In general, the limits on use of these materials are as follows:
- They can only be used for scholarly research purposes
- May not be used commercially
- May not be re-hosted or used to support publicly available search services
- May not be shared with third parties
Obtaining Texts
To begin the process of receiving texts (or making them available to researchers), researchers or institutional representatives should:
- Review the agreement with Google.
- Determine who the authorized sponsor at the institution will be (because of liability and indemnification terms in the agreement, legal counsel will likely be needed). Interested researchers should begin this process as soon as possible.
-
Send a brief proposal to hathitrust-datasets@umich.edu that specifies
- the authorized sponsor
- a characterization of the desired texts (please be specific about subjects, dates, languages, etc. If all texts are desired, please be clear about the reason all subjects, languages, dates, etc. are needed).
- volume ids for the specific volumes requested (see Custom Datasets below)
- what research is to be done
- what the results will be, and
- how they will be used
When you submit your proposal, please indicate whether or not you give permission to share the proposal publicly.
The following institutions have signed, or are in the process of signing an agreement with Google for use of texts. If you are a researcher affiliated with one of these institutions, you may proceed directly to submitting the proposal and we will be in touch with your institutional sponsor.
- Clemson University
- Columbia University
- Dartmouth College
- Indiana University
- McGill University
- Michigan State University
- New York University
- University of California
- University of Illinois
- University of Michigan
- Yale University
Custom Datasets
If you are not interested in receiving the entire set of non-Google-digitized or Google-digitized materials, we can assist in the creation of custom datasets. To do so, we need a list of the volume ids of desired volumes. Volume ids are present in the persistent identifiers for HathiTrust volumes (e.g., http://hdl.handle.net/2027/mdp.39015021715670). IDs can be retrieved in the following ways:
- Defining a query or set of queries from a catalog or full-text search (see the search portals on the home page). If the desired texts can be defined using facets or queries in a catalog or full-text search, (e.g., English-language materials published before 1800), we can generate the list of ids that are needed.
- Building a collection in Collection Builder. Volumes can be saved to collections in batches from full-text search results or individually when viewing a given volume. We can retrieve the list of ids from one or more designated collections.
- HathiTrust's tab-delimited metadata files. These files are an inventory of repository holdings, containing a variety of identifiers for volumes (ISBN, LCCN, OCLC, etc.), copyright information, and limited bibliographic metadata for each volume in HathiTrust. Users wishing to select batches of materials that cannot be defined by search queries may find these files useful in selecting volumes. A description of the files is available at http://hathitrust.org/hathifiles_description.
- HathiTrust Data API. In addition to retrieving entire volume packages from HathiTrust (including images and OCR), the Data API can be used to find ids for volumes digitized from a particular source. The University of Michigan has built a demonstration application using the Data API that illustrates how this can be done. Please see http://www.lib.umich.edu/two-over-threehundred.
Please send lists of ids or queries to hathitrust-datasets@umich.edu.