Datasets
HathiTrust is making sample datasets of two different sizes available to researchers for computational processing and analysis. The first sample is available to all researchers through an application process (see below). The second sample is available to participants in the Digging Into Data Challenge.
Sample 1
The first sample is composed of 5,000 texts*, which may be requested in one of 3 bundles. Texts in all bundles are pre-1923 (pre-1869 for works published outside of the United States).
*Note that the Classics and ancient religious texts dataset has only 2,540 volumes. There are not currently 5,000 texts in this category in the repository. This dataset will be added to as more volumes enter the repository.
- A random sample representing 4 character sets and 5 languages (Arabic, English, French, Japanese, and Russian)
- A random sample of English language literary and historical texts
- A random sample of Classics and ancient religious texts, including original language texts and translations.
Download Multilingual Dataset Part 1
Download Multilingual Dataset Part 2
Download MARCXML for Multilingual Dataset
Download English Language Dataset
Download MARCXML for English Language Dataset
Download Classics/Ancient Dataset
Download MARCXML for Classics/Ancient Dataset
Sample 2 - Digging Into Data
A corpus of 50,000 volumes will be made available for the Digging into Data Challenge. The corpus represents a mix of dates (as above, all pre-1923, and pre-1869 for materials published outside the United States), countries of origin, languages, character sets, and formats (i.e., some serial literature in a body of mostly monographic literature). Though delivered in a single package, each volume will be in a separate directory associated with a METS file. The structure of the METS file is given below. Descriptive metadata such as authors and titles will not be provided along with the data, but the unique identifier for each volume can be used to gather bibliographic data through the steps below:
Retrieving Bibliographic Information
Download MARCXML for Digging Into Data Dataset
Data API
A data-oriented API will be available soon, which will allow researchers to retrieve small portions of these texts (e.g., a page at a time perhaps) at any given time. The spec for this API is available now.
Application and Use
Although individual texts in HathiTrust have different conditions of use, many HathiTrust texts may not be redistributed or used in commercial applications. Researchers using either sample should submit a description of the research they wish to conduct and a brief form statement (printed below) confirming their intention to use the dataset for research purposes and their commitment to not further distributing the texts in whole or in part. Upon submission of the statement on official letterhead and approval by HathiTrust, the texts will be made available to researchers for download.
Contact
Inquiries and Applications should be sent to hathitrust-datasets@umich.edu.

