HathiTrust Research Center
Non-Consumptive Use Policy
Created by: HTRC Task Force for Non-Consumptive Research Use Policy
Approved: 20 Feb 2017, HathiTrust Research Center Executive Management
The HathiTrust (HT) is a partnership of research institutions and libraries operating a shared repository of cultural heritage materials. HathiTrust preserves and provides access to digitized library collections. The HathiTrust Research Center (HTRC) leverages the scope and scale of the repository to develop avenues for non-consumptive research of the HathiTrust Digital Library. The user modalities for non-consumptive research making up the HTRC include:
- Web-accessible data analysis and visualization tools. Provided by the HTRC, these tools allow researchers to assemble collections (worksets) of volumes, and analyze them using the HTRC supported off-the-shelf algorithms and visualization interfaces. No substantial portion of text in any individual volume is revealed by interaction with these tools.
- Derived downloadable datasets. Most notably, the HTRC Extracted Features dataset is derived from bibliographic and paratextual metadata and includes part-of-speech-tagged unigram counts. No substantial portion of the text in any individual volume is revealed in any derived downloadable dataset that the HTRC provides for use.
- HTRC Data Capsules. A system that grants a user access to a virtual machine which is a dedicated, secure desktop environment (called a “Capsule”) that exists within the HTRC’s secure compute environment located in the United States through which a user can carry out non-consumptive research on HT collection using the HTRC-provided or their own data analysis and visualization tools. Technical and policy constraints (outlined in this document) work together to guarantee that resulting data products are non-consumptive exports and improper outputs (e.g., leaks) are prevented.
This policy document defines non-consumptive research and non-consumptive exports as implemented for non-profit research and educational analytical use of the HT collection. The HathiTrust-provided web-accessible data analysis and visualization tools or derived downloadable datasets (items A and B above) are inherently non-consumptive and the allowable data exports (datasets and tool outputs) have been pre-verified to comply with the following policy. Alternatively, HTRC Data Capsules give a user direct access to the HT collection and flexibility in choice of analysis tools. The data exports users request to release from a Capsule must be a non-consumptive data export in compliance with the following policy.
This policy defines Non-consumptive Research, Non-consumptive Exports, and the levels of access permitted at various stages of this research using HTRC Data Capsules. Its goal is to ensure that the HTRC is facilitating the widest possible variety of non-consumptive research and educational use with the HT collection while remaining clearly within the bounds of the fair use rights courts have recognized as applying to this type of activity. More generally, the policy aims to achieve the same goals as copyright itself: to promote progress in the discovery and spread of knowledge, without harming the commercial interests of authors, publishers, and other stakeholders.
- Non-consumptive Research (also called “non-consumptive analytics”) means research in which computational analysis is performed on one or more volumes (textual or image objects) in the HT collection, but not research in which a researcher reads or displays substantial portions of an in-copyright or rights-restricted volume to understand the expressive content presented within that volume. Non-consumptive analytics includes such computational tasks as text extraction, textual analysis and information extraction, linguistic analysis, automated translation, image analysis, file manipulation, OCR correction, and indexing and search.
- “Substantial portion” means a portion of an individual volume sufficient in quality or quantity to provide a substitute for access to the volume’s expressive content. A portion that merely reveals factual information (about the work or about the world) is not thereby a substitute for access to the volume’s original expressive content.
- A Capsule is not a reading app and should only be used for non-consumptive research purposes as defined in Section 1. Examples of acceptable in-capsule uses of corpus text that may facilitate non-consumptive research include referrals to specific passages in order to verify or evaluate results, to develop and revise algorithms for processing the text, and to select appropriate short quotes as necessary examples in reporting the research, as may be supported by fair use.
- The responsibility for the fair use judgment with respect to the types of research resides with the researcher.
- By using the HTRC Data Capsules environment, users acknowledge that their actions in the Capsule, including their use of the HTRC Data API, may be monitored solely to verify proper use as it is described in this policy and to improve HathiTrust services.
- HTRC employs manual or automated means to review data exports from a Capsule prior to their release to the user solely to ensure that a substantial portion as defined above is not released from the HTRC Trust Ring.
- In interpretations of “substantial portion” or any other possible disagreement on a Non-consumptive Export, the HTRC reserves the right to refuse release of a Non-consumptive Export that it deems to be not in compliance with this policy.
- Non-consumptive Export is a data product emerging as an output of computational analysis that meets the criteria of non-consumptive, that is, it would pass the HTRC manual or automated results review. Web-browser accessible tools are pre-confirmed to produce only Non-consumptive Exports before their hosting is allowed in the HTRC secure compute environment. Non-consumptive Exports created through Capsule use are released from a Capsule through a specific action by the Capsule user. Non-consumptive Exports from a Capsule must be in human readable form (such as Unicode or ASCII.) These exports undergo manual or automated review by HTRC prior to release in order to affirm compliance with this policy for non-consumptive research.
Non-consumptive Exports may include, but are not limited to:
- Statistical summaries and derived data, which consists of factual information about volumes in the HT collection such as token counts, topic models, extracted named entities, N-grams, and visualizations. In a non-consumptive form such that, to the best of the HTRC’s knowledge, the information cannot easily be processed to reconstruct a substantial portion of the original expression of any individual volume.
- Keywords-in-context or concordances, whichconsist of excerpts from volumes in the HT collection that may contain expression protected by copyright and shall be released in quality and quantity such that, to the best of the HTRC’s knowledge, the excerpts cannot be combined to reveal a substantial portion of the original expression of any individual volume.
- Examples of compliant data exports are included below. These examples are illustrative, do not represent thresholds for allowable data release, and are not representative of all fields of research.
Token counts: the number of times a token (generally a word) appears in a text.
Topic models: generated by machine learning techniques that identify frequently co-occurring words in a text that may represent a topic. The output of topic modeling tools is often structured, such as in XML format.
Extracted named entities: words, such as names and dates, that are extracted from a text and may be presented as a simple list or in a structured format with metadata about use in the text.
"entity": "GEORGE V. JOURDAN",
N-grams: sequences of text where N represents the number of items in the sequence. The example below are bigrams.
Keywords-in-context or concordance: selected words and the strings of text preceding or following to demonstrate how a word is used.
the next danger ahead. The
is the greatest blessing that
of time. The first authentic
was Sigeum, on the Hellespont
name in connection with the
, which in France is called
Statistical summaries: the results of running statistical analysis on text, such as a confusion matrix.