Navigation

HathiTrust Rights Database

Introduction

Overview of requirements

The repository must store and track rights information for each digitized volume in HathiTrust that are used by mechanisms such as the page-turner access system. Some of the challenges in doing this are

a) modeling the rights information properly to ease maintenance,

b) ensuring accuracy in the semantics of rights, and

c) tracking the changes to rights information over time.

Frequent updates to millions of records, changes to database structure that make the database unavailable for long periods, or subtle changes over time in the meaning of millions of access control rules are some of the consequences of a poor design. Complicating these challenges is the need for flexibility to accommodate different types of rights information and to develop new access rules, including those that come as a result of negotiations with publishers and manual copyright clearance.

Philosophy

Copyright is complex, and although there are good efforts in modeling and expressing copyright status information for library holdings, this project requires a practical and flexible approach. At best, any solution will be imprecise. We should expect this imprecision and hope to improve on the situation over time. The safest approach is, as much as possible, to base the rights database on simple, established copyright policy and terminology that is not likely to change. Attributes in the database should be deliberately defined in ways that are consistent with copyright policy. Using cataloging metadata, we can make a basic determination of rights by, for example, characterizing the publications as being either in the public domain or in copyright, being in-copyright but out-of-print and brittle, or being authoritatively copyright-orphaned. Unfortunately, the exceptions to these general principles do not necessarily follow any sort of rule or pattern.

Storage and Maintenance Strategy

Can MARC handle it?

The MARC record format does not have fields intended for the storage of rights information and is not able to store this or similar information at the level of the volume (e.g., for multi-volume works). We are consistent with our colleagues at other institutions in recommending that this information be stored in a separate, large-scale database.

Storage of Rights Information

Our strategy for storing rights information is two-pronged, based on the notion of two extensible sets of attributes: The first set of attributes characterizes the copyright status of the volume. Examples of this type of attribute are "public domain," "public domain when viewed in the U.S." and "in-copyright"; each attribute is only present when appropriate. The main benefits of this approach are

a) insulation from frequent change and

b) accuracy in legal terms.

The bulk of this rights information is bibliographically-derived (by automated query) at the point of ingest using the relatively stable criteria of US federal government documents, country of publication and publication date. Over time, additional attributes will be defined and added as we identify out-of-print books, orphaned-copyright works, etc. The second set of attributes does not characterize the volume in terms of copyright status, and instead directly specifies access control rules. These can be thought of as "overrides" to the first, more general set of attributes. An obvious example of this type of information is "available to UM affiliates". For accurate representation, or due to changes in copyright status over time, some volumes may have more than one rights attribute. However, to simplify access decisions, rights attributes should be defined so that the most recent rights attribute is authoritative. For example, a volume initially classified as public domain is discovered to be in-copyright, but explicit access to UM affiliates has been granted by the copyright holder. In such a case, three attributes apply to the volume: its original public domain classification (as valuable history), its current status as in-copyright, and the explicit access control granted by the copyright holder. The latter takes precedence; as mentioned above, explicit access controls can be thought of as "overrides" to the more general copyright status attributes.

How Rights Decisions Are Made

At the core of the rights system is an algorithm that considers a) the copyright status and/or explicit access controls associated with the volume, b) the volume's digitizing agent (e.g., Google or the University of Chicago), and c) the identity of the user (if known) in order to determine access rights. The access rights may differ based on any of these criteria. Because most rights attributes will be static and will characterize the copyright status of the volume in general terms (e.g. "out-of-print and brittle"), the decision matrix underlying this algorithm can easily accommodate changes in rights over time. For example, we launched a service that prohibited access to in-copyright materials, even when those volumes were out-of-print and damaged, but over time, a change in policy granted in-library users with access to those materials by virtue of Section 108 provisions in US copyright law. A change in access due to such a policy change will only require a simple change in the decision matrix. Some volumes will have a series of rights attributes applied over time. For these volumes, the most recent rights attribute will be used to determine access rights.

Rights Assignment

General Process Overview

As a volume's images are ingested, they are placed in storage, and the retrieval system sends the identifier to Mirlyn for the item record to be created or updated. Mirlyn then performs a simple test for copyright status: (1) was the volume published in the US or outside the US? (2) depending on where it was published, was it published before a known cutoff date? and, if published in the US, is the volume a US federal government publication? The appropriate attributes are then stored in the rights database. With the volume in storage and represented in the rights database, it is available via the access system. When an action is requested in the access system, the access system consults the rights database. Based on the most recent rights attribute, the access profile or source, and whether the user has authenticated, a list of allowable actions is composed. The access system either performs the requested action, prompts the user for authentication, or denies the action.

Use Cases

The following simplified, hypothetical rights examples help illustrate both the attributes applied and the rules interpreted:

  1. Mass identification of copyright status based on bibliographically-derived information: a) As texts are loaded, a set query in Mirlyn identifies those texts that are:
    • US federal government documents, or
    • published in the US prior to 1923, or
    • published outside of the US before
    • 1870

    These are treated as public domain (ATTRIBUTE name=pd) based on bibliographically-derived information (REASON name=bib). We do not restrict access to these materials. b) Those texts that do not meet these criteria (e.g,. US post-1923 and not a government document) are treated as in-copyright (i.e., ATTRIBUTE name=ic and REASON name=bib). c) An additional attribute is used to represent works published outside the United States from 1870 to 1923 because copyright status for these works depends on the location of the user. Works published outside the US prior to 1923 are in the public domain; however, due to the variations in copyright law in countries outside the US, it is estimated that 1870 is the earliest date works published in these countries may still be under copyright. Therefore, users accessing the volume from US IP addresses will have access to the works published outside the US between 1870 through 1923; however, users with non-US IP addresses will not (ATTRIBUTE name=pdus and REASON name=bib).

  2. Manually-determined public domain: The text, Edward Carpenter, The British Tolstoi by Bell, was published in the United States in 1932. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. Upon investigation, we note that no copyright notice was printed in the text. According to US copyright law, this text is in the public domain. This text is treated as public domain (ATTRIBUTE name=pd) based on the absence of a copyright statement (REASON name=ncn). We do not restrict access to this text.
  3. Negotiated UM access: The text, Wishbone by Stirling Bowen was published in 1930 in the United States. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. Upon investigation, we confirm that copyright has been renewed for the text, and that the current rights holder is Penguin. We contact Penguin and, after negotiation, Penguin grants permission to provide access to affiliated users, stipulates that we not make reprints available, and requires that the agreement must be renewed in five years. We provide open access to partner IP addresses and to authenticated users, and do not provide the page images to our reprint services. This text is treated as in-copyright and restricted to affiliates and walk-in patrons based on a letter or contract (ATTRIBUTE name=umall; REASON name=con).
  4. Negotiated open access: Tradition and design in the Iliad, by C. M. Bowra was published by the Clarendon Press (Oxford University) in 1930. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. A letter requesting permission to provide online access is sent to the rights holder, Oxford University Press, and they grant permission with the stipulation that no reprints are sold. We provide open access to the world, but do not make the page image files available to our reprint services. This text is treated as in-copyright and open access, based on a letter or contract (ATTRIBUTE name=ic-world for REASON name=con).
  5. Out-of-print and brittle: Through standard mechanisms (e.g., review of a volume at a circulation desk or other staff determination), The mission to Spain of Pierre Soule, 1853-1855 by Amos Ettinger, OUP, 1932 is determined by a HathiTrust member institution to either be too brittle to circulate, or lost or missing from the member's library collection. The work is found to be preserved in HathiTrust. Investigation of the work by the member or another HathiTrust institution has determined that the work is in copyright and no unused replacements are available on the market at a fair price. The work is classified in the HathiTrust rights database as out-of-print (ATTRIBUTE name=op) with an indication of research on file (REASON name=ipma). The institution updates the print holdings information submitted to HathiTrust to indicate the brittle status of the work, or that it is lost or missing. The updated holdings information is loaded by HathiTrust, and access to the work is made available to affiliated users and walk-in users at the institution.
  6. Copyright not renewed: Cheese production in Nebraska, by Walter Martin Kollmorgen, published in the United States in 1938, is initially designated in-copyright. Subsequent review of copyright renewal records indicate that the copyright in this work was not renewed. Therefore this volume is reclassified as public domain (ATTRIBUTE name=pd) based on the results of copyright renewal research (REASON name=ren).
  7.  

Precedence of Rights Information

Background

Over time, the rights status of any volume may be redetermined by a number of methods. For example, updates to the bibliographic record or manual copyright determination processes may change the rights status of a volume.

Some determinations are more authoritative than others. Updates to rights data must take that into account and enforce precedence so that the most recent, most authoritative determination is in effect.

Using rights types and reason codes to infer precedence

We have identified five levels of precedence in the rights model we are currently using. These are, in order of increasing authority, as follows:

precedence rights type reason code examples
1 (lowest) copyright bib pd/bib, ic/bib, und/bib, pdus/bib
2 copyright gfv pdus/gfv
3 copyright any but bib, gfv, man ic/unp, pd/ncn, pd/ren,ic/ren, und/nfi, pd/cdpp, ic/cdpp, pdus/cdpp, ic/add, pdus/add, pd/add, pd/exp, op/ipma, ic/ipma, und/ipma, ic/crms, pd/crms, und/crms, icus/gatt
4 any pvt, con ic-world/con nobody/pvt cc-by/con cc-by-nd/con cc-by-nc/con cc-by-sa/con cc-by-nc-nd/con cc-by-nc-sa/con cc-zero/con und-world/con pd/con
5 (highest) any man pd/man pdus/man ic-world/man und-world/man ic/man nobody/man nobody/del nobody/supp
(note: these are never allowed in automatic rights updates, and should have a corresponding explanation in the 'note' field)

In this model, rights of a given precedence should be superseded only by rights of an equal or greater precedence. For example:

  • Rights determined by bibliographic record extract (precedence level 1) are by far the most plentiful but have the least authority. They can only override other bibliographic record-derived rights, and can be overridden by processes that involve manual inspection as well as by access controls placed on volumes.
  • Other copyright-type attributes, such as "ncn" (no printed copyright notice), are the result of manual inspection and have greater authority than those derived from bibliographic records. However, access controls for blocking private information or due to special contractual arrangements should take precedence over these determinations.
  • Access controls are applied when the rights of the volume are independent of copyright status, such as a special release from the copyright holder. Even the most thorough copyright determination should not be allowed to change the availability (or non-availability) of such items.
  • A manually-applied access control (as opposed to an algorithmically-applied) is absolutely authoritative.

Rules for rights precedence

With one exception, the behavior of rights update processes should be to insert new rights information for a given volume when the newly-supplied rights status is of equal or greater precedence to the active (latest) rights status for that volume.

The exception is that with material that Google has determined is viewable in the US. We allow this information to override ic/bib or und/bib, but NOT pdus/bib or pd/bib. That is, if Google has determined a work is open but HathiTrust's bibliographic information would cause it to be closed, we will trust Google's determination. But if HathiTrust's bibliographic information already makes the work open, we do not override that with Google's determination.

Note that rights may be supplied and re-supplied from manual copyright determination processes. According to these rules, those rights would all fall within the same level of precedence, and so would continue to take precedence over each other. It is assumed that a more recent determination is a more accurate determination.

Manual access controls will be ignored by rights updates and must be, currently, entered manually by an administrator.

Security Considerations

In addition to building the model, the following requirements have been identified to ensure proper enforcement of rights:

  • If viewing an item that requires authentication, the access system must re-check for valid authentication on every access.
  • To the extent that rights are enforced via cookies, cookies must be adequately protected from theft.

Database Layout

Diagram

 

RIGHTS_LOG
namespace id attr reason source access_profile user time note
mdp 39015054477651 1 1 1 2 root 2006-01-12 11:34:26  
mdp 39015017678577 1 1 1 2 root 2006-01-12 11:34:27  
mdp 39015017678577 4 4 1 2 sooty 2006-02-08 15:18:24 determined by jaheim as in-copyright, but orphaned
mdp 39015034781842 2 1 1 2 root 2006-01-12 11:34:28  
mdp 39015034781842 7 3 1 2 pwillett 2006-03-08 09:12:45 agreement reached with publisher for open access

 

RIGHTS_CURRENT
namespace id attr reason source access_profile user time note
mdp 39015054477651 1 1 1 2 root 2006-01-12 11:34:26  
mdp 39015017678577 4 4 1 2 sooty 2006-02-08 15:18:24 determined by jaheim as in-copyright, but orphaned
mdp 39015034781842 7 3 1 2 pwillett 2006-03-08 09:12:45 agreement reached with publisher for open access
 
ATTRIBUTES
id name type dscr
1 pd copyright public domain
2 ic copyright in-copyright
3 op copyright out-of-print (implies in-copyright)
4 orph copyright copyright-orphaned (implies in-copyright)
5 und copyright undetermined copyright status
6 umall access available to UM affiliates and walk-in patrons (all campuses)
7 ic-world access in-copyright and permitted as world viewable by the copyright holder
8 nobody access available to nobody; blocked for all users
9 pdus copyright public domain only when viewed in the US
10 cc-by-3.0 copyright Creative Commons Attribution license, 3.0 Unported
11 cc-by-nd-3.0 copyright Creative Commons Attribution-NoDerivatives license, 3.0 Unported
12 cc-by-nc-nd-3.0 copyright Creative Commons Attribution-NonCommercial-NoDerivatives license, 3.0 Unported
13 cc-by-nc-3.0 copyright Creative Commons Attribution-NonCommercial license, 3.0 Unported
14 cc-by-nc-sa-3.0 copyright Creative Commons Attribution-NonCommercial-ShareAlike license, 3.0 Unported
15 cc-by-sa-3.0 copyright Creative Commons Attribution-ShareAlike license, 3.0 Unported
16 orphcand copyright orphan candidate - in 90-day holding period (implies in-copyright)
17 cc-zero copyright Creative Commons Zero license (implies pd)
18 und-world access undetermined copyright status and permitted as world viewable by the depositor
19 icus copyright in copyright in the US
20 cc-by-4.0 copyright Creative Commons Attribution 4.0 International license
21 cc-by-nd-4.0 copyright Creative Commons Attribution-NoDerivatives 4.0 International license
22 cc-by-nc-nd-4.0 copyright Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license
23 cc-by-nc-4.0 copyright Creative Commons Attribution-NonCommercial 4.0 International license
24 cc-by-nc-sa-4.0 copyright Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license
25 cc-by-sa-4.0 copyright Creative Commons Attribution-ShareAlike 4.0 International license
 
REASONS
id name dscr
1 bib bibliographically-derived by automatic processes
2 ncn no printed copyright notice
3 con contractual agreement with copyright holder on file
4 ddd due diligence documentation on file
5 man manual access control override; see note for details
6 pvt private personal information visible
7 ren copyright renewal research was conducted
8 nfi needs further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered)
9 cdpp title page or verso contain copyright date and/or place of publication information not in bib record
10 ipma in-print and market availability research was conducted
11 unp unpublished work
12 gfv Google viewability set at VIEW_FULL
13 crms derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details
14 add author death date research was conducted or notification was received from authoritative source 
15 exp expiration of copyright term for non-US work with corporate author 
16 del deleted from the repository; see note for details
17 gatt non-US public domain work restored to in-copyright in the US by GATT
18 supp suppressed from view; see note for details
SOURCES
id name dscr
1 google Google
2 lit-dlps-dc Library IT, Digital Library Production Service, Digital Conversion
3 ump University of Michigan Press
4 ia Internet Archive
5 yale Yale University
6 umn University of Minnesota
7 mhs Minnesota Historical Society
8 usup Utah State University Press
9 ucm Universidad Complutense de Madrid
10 purd Purdue University
11 getty Getty Research Institute
12 um-dc-mp University of Michigan, Duderstadt Center, Millennium Project
13 uiuc University of Illinois at Urbana-Champaign
14 brooklynmuseum Brooklyn Museum
15 uf University of Florida
16 tamu Texas A&M
17 udel University of Delaware
18 private Private Donor
19 umich University of Michigan (General)
20 clark Clark Art Institute
21 ku Knowledge Unlatched
22 mcgill McGill University
23 bc Boston College
ACCESS PROFILES
id name dscr
1 open Unrestricted image and full-volume download (e.g. Internet Archive)
2 google Restricted public full-volume download - watermarked PDF only, when logged in or with Data API key (e.g. Google)
3 page Page access only: no PDF or ZIP download for anyone (e.g. UM Press)
4 page+lowres Low resolution watermarked image derivatives only (e.g. MDL)

SQL Create Statement

CREATE TABLE rights_log (
    namespace VARCHAR(8) NOT NULL,
    id     VARCHAR(32) NOT NULL,
    attr   TINYINT NOT NULL,
    reason TINYINT NOT NULL,
    source TINYINT NOT NULL,
    access_profile TINYINT NOT NULL,
    user   VARCHAR(32) NOT NULL,
    time   TIMESTAMP NOT NULL default CURRENT_TIMESTAMP,
    note   TEXT,
  PRIMARY KEY (namespace, id, time)
);

CREATE TABLE rights_current (
    namespace VARCHAR(8) NOT NULL,
    id VARCHAR(32) NOT NULL,
    attr TINYINT NOT NULL,
    reason TINYINT NOT NULL,
    source TINYINT NOT NULL,
    access_profile TINYINT NOT NULL,
    user VARCHAR(32) NOT NULL,
    time TIMESTAMP NOT NULL default CURRENT_TIMESTAMP,
    note TEXT,
  PRIMARY KEY (namespace, id)
);

CREATE TRIGGER ins_rights ON INSERT ON rights_current FOR EACH ROW insert into rights_log values(new.namespace, new.id, new.attr, new.reason, new.source, new.access_profile, new.user, new.time, new.note)

CREATE TRIGGER upd_rights ON UPDATE ON rights_current FOR EACH ROW insert into rights_log values(new.namespace, new.id, new.attr, new.erason, new.source, new.access_profile, new.user, new.time, new.note)

CREATE TABLE attributes (
  id     TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY,
  type   ENUM('access','copyright') NOT NULL,
  name   VARCHAR(16) NOT NULL,
  dscr   TEXT NOT NULL);

CREATE TABLE reasons (
  id     TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY,
  name   VARCHAR(16) NOT NULL,
  dscr   TEXT NOT NULL);

CREATE TABLE sources (
  id     TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY,
  name   VARCHAR(16) NOT NULL,
  dscr   TEXT NOT NULL);

CREATE TABLE access_profiles (
  id tinyint(3) unsigned NOT NULL,
  name varchar(16) NOT NULL,
  dscr text NOT NULL,
  PRIMARY KEY (`id`)
)