Navigation

HTRC Collections and Tools

The HathiTrust Research Center makes available a toolkit of services for text data mining. HTRC Analytics is the main gateway for accessing the Research Center's offerings. You must log-in to interact with much of the site.

Collections and data

Worksets

HTRC worksets are user-created collections of HathiTrust volumes to be treated as data and analyzed using HTRC tools and services. Worksets are curated by researchers, and they can be shared and cited to improve reproducibility.

Derived Data

HTRC Extracted Features is an open, large-scale dataset of content pulled from over 15 million volumes in HathiTrust. The dataset consists of one file per volume from the digital library. Each JSON-formatted file contains metadata about the volume, including bibliographic metadata, computationally-inferred metadata about page contents and structure, and page-level tokens (words) and token counts.

Tools and services

The HTRC offers a suite of tools for computational text analysis. These tools cover a wide variety of functions ranging from simple statistical analysis of words to complex algorithms relating concepts and meaning.

HTRC Analytics

HTRC Analytics is the primary site for interacting with HTRC. It provides access to HTRC worksets and off-the-shelf algorithms to analyze them. It also contains a dashboard where researchers can create a secure computing environment, called a Data Capsule (see below). Several of the HTRC algorithms are based off the Software Environment for the Advancement of Scholarly Research (SEASR, pronounced “Caesar”), a legacy project developed with funding by the Andrew W. Mellon Foundation.

HathiTrust+Bookworm

The HathiTrust+Bookworm visualization tool allows researchers to graph word trends across the HathiTrust corpus and facet their search by bibliographic metadata.

Data Capsules

The HTRC Data Capsules secure compute environment allows researchers to create a virtual machine desktop “capsule” that can be used to run customized research methods and tools not supported by the pre-built algorithms. Researchers control their research process while in a capsule, and only derived data may be released when they are finished