Navigation

Non-Consumptive Use Research Policy

HathiTrust Research Center
Non-Consumptive Use Policy

Effective Immediately

Created by: HTRC Task Force for Non-Consumptive Research Use Policy

Approved: 20 Feb 2017, HathiTrust Research Center Executive Management

Preamble

The HathiTrust (HT) is a partnership of research institutions and libraries operating a shared repository of cultural heritage materials.  HathiTrust preserves and provides access to digitized library collections. The HathiTrust Research Center (HTRC) leverages the scope and scale of the repository to develop avenues for non-consumptive research of the HathiTrust Digital Library. The user modalities for non-consumptive research making up the HTRC include:

  1. Web-accessible data analysis and visualization tools. Provided by the HTRC, these tools allow researchers to assemble collections (worksets) of volumes, and analyze them using the HTRC supported off-the-shelf algorithms and visualization interfaces. No substantial portion of text in any individual volume is revealed by interaction with these tools.
  2. Derived downloadable datasets. Most notably, the HTRC Extracted Features dataset is derived from bibliographic and paratextual metadata and includes part-of-speech-tagged unigram counts. No substantial portion of the text in any individual volume is revealed in any derived downloadable dataset that the HTRC provides for use.
  3. HTRC Data Capsules. A system that grants a user access to a virtual machine which is a dedicated, secure desktop environment (called a “Capsule”) that exists within the HTRC’s secure compute environment located in the United States through which a user can carry out non-consumptive research on HT collection using the HTRC-provided or their own data analysis and visualization tools. Technical and policy constraints (outlined in this document) work together to guarantee that resulting data products are non-consumptive exports and improper outputs (e.g., leaks) are prevented.

This policy document defines non-consumptive research and non-consumptive exports as implemented for non-profit research and educational analytical use of the HT collection. The HathiTrust-provided web-accessible data analysis and visualization tools or derived downloadable datasets (items A and B above) are inherently non-consumptive and the allowable data exports (datasets and tool outputs) have been pre-verified to comply with the following policy. Alternatively, HTRC Data Capsules give a user direct access to the HT collection and flexibility in choice of analysis tools. The data exports users request to release from a Capsule must be a non-consumptive data export in compliance with the following policy.

Policy

This policy defines Non-consumptive Research, Non-consumptive Exports, and the levels of access permitted at various stages of this research using HTRC Data Capsules. Its goal is to ensure that the HTRC is facilitating the widest possible variety of non-consumptive research and educational use with the HT collection while remaining clearly within the bounds of the fair use rights courts have recognized as applying to this type of activity. More generally, the policy aims to achieve the same goals as copyright itself: to promote progress in the discovery and spread of knowledge, without harming the commercial interests of authors, publishers, and other stakeholders.

  1. Non-consumptive Research (also called “non-consumptive analytics”) means research in which computational analysis is performed on one or more volumes (textual or image objects) in the HT collection, but not research in which a researcher reads or displays substantial portions of an in-copyright or rights-restricted volume to understand the expressive content presented within that volume. Non-consumptive analytics includes such computational tasks as text extraction, textual analysis and information extraction, linguistic analysis, automated translation, image analysis, file manipulation, OCR correction, and indexing and search.
    1. “Substantial portion” means a portion of an individual volume sufficient in quality or quantity to provide a substitute for access to the volume’s expressive content. A portion that merely reveals factual information (about the work or about the world) is not thereby a substitute for access to the volume’s original expressive content.

 

  1. Non-consumptive research using a Capsule is research occurring within the controlled environment of the Data Capsule service. A separate HTRC Data Capsules Terms of Use enumerates terms of consent; .
    1. A Capsule is not a reading app and should only be used for non-consumptive research purposes as defined in Section 1. Examples of acceptable in-capsule uses of corpus text that may facilitate non-consumptive research include referrals to specific passages in order to verify or evaluate results, to develop and revise algorithms for processing the text, and to select appropriate short quotes as necessary examples in reporting the research, as may be supported by fair use.
    2. The responsibility for the fair use judgment with respect to the types of research resides with the scholar.
    3. The scholar must explicitly agree to the HTRC Data Capsule Terms of Use  in order to access the service.
    4. HTRC employs manual or automated means to review data exports from a Capsule prior to their release from the HTRC environment. HTRC reserves the right to refuse, in part or totality, any release of Data Capsule content that is deemed not in compliance with the HTRC Non-Consumptive Use Policy.

 

  1. Non-consumptive Export is a data product emerging as an output of computational analysis that meets the criteria of non-consumptive, that is, it would pass the HTRC manual or automated results review. Web-browser accessible tools are pre-confirmed to produce only Non-consumptive Exports before their hosting is allowed in the HTRC secure compute environment. Non-consumptive Exports created through Capsule use are released from a Capsule through a specific action by the Capsule user.  Non-consumptive Exports from a Capsule must be in human readable form (such as Unicode or ASCII.)  These exports undergo manual or automated review by HTRC prior to release in order to affirm compliance with this policy for non-consumptive research.

Non-consumptive Exports may include, but are not limited to:

  1. Statistical summaries and derived data, which consists of factual information about volumes in the HT collection such as token counts, topic models, extracted named entities, N-grams, and visualizations. In a non-consumptive form such that, to the best of the HTRC’s knowledge, the information cannot easily be processed to reconstruct a substantial portion of the original expression of any individual volume.
  2. Keywords-in-context or concordances, which consist of excerpts from volumes in the HT collection that may contain expression protected by copyright and shall be released in quality and quantity such that, to the best of the HTRC’s knowledge, the excerpts cannot be combined to reveal a substantial portion of the original expression of any individual volume.

Examples of compliant data exports are included below. These examples are illustrative, do not represent thresholds for allowable data release, and are not representative of all fields of research.

Token counts: the number of times a token (generally a word) appears in a text.

that:     8871
his:      8283
was:    7355
he:       7044
with:    6865
my:      5149
you:     5063
had:     5016
her:      4413
Mr.:      4346

Topic models: generated by machine learning techniques that identify frequently co-occurring words in a text that may represent a topic. The output of topic modeling tools is often structured, such as in XML format.

 <topic id="0">
 <word weight="1146">feet</word>
 <word weight="1018">found</word>
 <word weight="823">form</word>
 <word weight="697">strata</word>
 <word weight="683">beds</word>
 <word weight="632">part</word>
 <word weight="626">colour</word>
 <word weight="603">rocks</word>
 <word weight="569">red</word>
 <word weight="562">substance</word>
 </topic>

Extracted named entities: words, such as names and dates, that are extracted from a text and may be presented as a simple list or in a structured format with metadata about use in the text.

{
 "volumeID": "yale.39002053338092",
 "lang": "en",
 "langHitRatio": "362/376(11)",
 "entities": [
 {
  "entity": "GEORGE V. JOURDAN",
  "class": "PERSON",
  "startCharOffset": 231,
  "endCharOffset": 248,
  "linePos": "3",
  "pagePosSeq": "7",
  "pagePosLabel": "N/A"
 },
 {
  "entity": "—\nCamb",
  "class": "PERSON",
  "startCharOffset": 485,
  "endCharOffset": 491,
  "linePos": "10,11",
  "pagePosSeq": "7",
  "pagePosLabel": "N/A"
 },

N-grams: sequences of text where N represents the number of items in the sequence. The example below are bigrams.

 

dark waters

father saw

God would

his child

large book

lonely lighthouse

Mary's father

mother's Bible

mother loved

old Bible

one wick

placed under

Poor Mary

quite fearful

sailors looked

manufactured goods

steam engine

 

Keywords-in-context or concordance: selected words and the strings of text preceding or following to demonstrate how a word is used.

 Left           

 Term          

 Right

 the next danger ahead. The           

 lighthouse

 is the greatest blessing that

 of time. The first authentic   

 lighthouse

 was Sigeum, on the Hellespont

 name in connection with the    

 lighthouse 

 , which in France is called

 

Statistical summaries: the results of running statistical analysis on text, such as a confusion matrix.

 ID

 Class

 Class prediction

 mdp.39015006975265

 Dickens

 Dickens

 nyp.33433074920525

 Austen

 Austen

 nyp.33433074954219

 Dickens

 Dickens

 mdp.49015002596832

 Dickens

 Austen