Rights Database

Introduction

Overview of requirements

The repository must store and track rights information for each digitized volume in HathiTrust that are used by mechanisms such as the PageTurner access system. Some of the challenges in doing this are

a) modeling the rights information properly to ease maintenance,

b) ensuring accuracy in the semantics of rights, and

c) tracking the changes to rights information over time.

Frequent updates to millions of records, changes to database structure that make the database unavailable for long periods, or subtle changes over time in the meaning of millions of access control rules are some of the consequences of a poor design. Complicating these challenges is the need for flexibility to accommodate different types of rights information and to develop new access rules, including those that come as a result of negotiations with publishers and manual copyright clearance.

Philosophy

Copyright is complex, and although there are good efforts in modeling and expressing copyright status information for library holdings, this project requires a practical and flexible approach. At best, any solution will be imprecise. We should expect this imprecision and hope to improve on the situation over time. The safest approach is, as much as possible, to base the rights database on simple, established copyright policy and terminology that is not likely to change. Attributes in the database should be deliberately defined in ways that are consistent with copyright policy. Using cataloging metadata, we can make a basic determination of rights by, for example, characterizing the publications as being either in the public domain or in copyright, or being in-copyright but out-of-print and brittle. Unfortunately, the exceptions to these general principles do not necessarily follow any sort of rule or pattern.

Storage and Maintenance Strategy

Can MARC handle it?

The MARC record format does not have fields intended for the storage of rights information and is not able to store this or similar information at the level of the volume (e.g., for multi-volume works). We are consistent with our colleagues at other institutions in recommending that this information be stored in a separate, large-scale database.

Storage of Rights Information

Our strategy for storing rights information is two-pronged, based on the notion of two extensible sets of attributes: The first set of attributes characterizes the copyright status of the volume. Examples of this type of attribute are “public domain,” “public domain when viewed in the U.S.” and “in-copyright”; each attribute is only present when appropriate. The main benefits of this approach are

a) insulation from frequent change and

b) accuracy in legal terms.

The bulk of this rights information is bibliographically derived (by automated query) at the point of ingest using the relatively stable criteria of US federal government documents, country of publication and publication date. Over time, additional attributes will be defined and added as we identify out-of-print books, etc. The second set of attributes does not characterize the volume in terms of copyright status, and instead directly specifies access control rules. These can be thought of as “overrides” to the first, more general set of attributes. For accurate representation, or due to changes in copyright status over time, some volumes may have more than one rights attribute. However, to simplify access decisions, rights attributes should be defined so that the most recent rights attribute is authoritative.

How Rights Decisions Are Made

At the core of the rights system is an algorithm that considers (a) the copyright status and/or explicit access controls associated with the volume, (b) the volume’s digitizing agent (e.g., Google or the University of Chicago), and (c) relevant characteristics of the user (if known) in order to determine access rights. The access rights may differ based on any of these criteria. Because most rights attributes will be static and will characterize the copyright status of the volume in general terms (e.g., “out-of-print and brittle”), the decision matrix underlying this algorithm can easily accommodate changes in rights over time. For example, we launched a service that prohibited access to in-copyright materials, even when those volumes were out-of-print and damaged, but over time, a change in policy granted in-library users with access to those materials by virtue of Section 108 provisions in US copyright law. A change in access due to such a policy change will only require a simple change in the decision matrix. Some volumes will have a series of rights attributes applied over time. For these volumes, the most recent rights attribute will be used to determine access rights.

Rights Assignment

General Process Overview

As a volume’s images are ingested, they are placed in storage, and the retrieval system sends the identifier to Mirlyn for the item record to be created or updated. Mirlyn then performs a simple test for copyright status: (1) was the volume published in the US or outside the US? (2) depending on where it was published, was it published before a known cutoff date? and, if published in the US, is the volume a US federal government publication? The appropriate attributes are then stored in the rights database. With the volume in storage and represented in the rights database, it is available via the access system. When an action is requested in the access system, the access system consults the rights database. Based on the most recent rights attribute, the access profile or source, and whether the user has authenticated, a list of allowable actions is composed. The access system either performs the requested action, prompts the user for authentication, or denies the action.

Use Cases

The following simplified, hypothetical rights examples help illustrate both the attributes applied and the rules interpreted:

1. Mass identification of copyright status based on bibliographically derived information: (a) As texts are loaded, a set query in Mirlyn identifies those texts that are:
  - US federal government documents, or
  - published in the US prior to 1928, or
  - published outside of the US before 1898
  These are treated as public domain (ATTRIBUTE name=pd) based on bibliographically derived information (REASON name=bib). We do not restrict access to these materials. (b) Those texts that do not meet these criteria (e.g., US 1928 or later and not a government document) are treated as in-copyright (i.e., ATTRIBUTE name=ic and REASON name=bib). (c) An additional attribute is used to represent works published outside the United States from 1898 to 1927 because copyright status for these works depends on the location of the user. Works published outside the US prior to 1928 are in the public domain; however, due to the variations in copyright law in countries outside the US, it is estimated that 1898 is the earliest date works published in these countries may still be under copyright. Therefore, users accessing the volume from US IP addresses will have access to the works published outside the US from 1898 through 1928; however, users with non-US IP addresses will not (ATTRIBUTE name=pdus and REASON name=bib).
2. Manually determined public domain: The text, Edward Carpenter, The British Tolstoi by Bell, was published in the United States in 1932. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. Upon investigation, we note that no copyright notice was printed in the text. According to US copyright law, this text is in the public domain. This text is treated as public domain (ATTRIBUTE name=pd) based on the absence of a copyright statement (REASON name=ncn). We do not restrict access to this text.
3. Negotiated open access: Tradition and design in the Iliad, by C. M. Bowra was published by the Clarendon Press (Oxford University) in 1930. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. A letter requesting permission to provide online access is sent to the rights holder, Oxford University Press, and they grant permission with the stipulation that no reprints are sold. We provide open access to the world, but do not make the page image files available to our reprint services. This text is treated as in-copyright and open access, based on a letter or contract (ATTRIBUTE name=ic-world for REASON name=con).
4. Out-of-print and brittle: Through standard mechanisms (e.g., review of a volume at a circulation desk or other staff determination), The mission to Spain of Pierre Soule, 1853-1855 by Amos Ettinger, OUP, 1932 is determined by a HathiTrust member institution to either be too brittle to circulate, or lost or missing from the member’s library collection. The work is found to be preserved in HathiTrust. Investigation of the work by the member or another HathiTrust institution has determined that the work is in copyright and no unused replacements are available on the market at a fair price. The work is classified in the HathiTrust rights database as out-of-print (ATTRIBUTE name=op) with an indication of research on file (REASON name=ipma). The institution updates the print holdings information submitted to HathiTrust to indicate the brittle status of the work, or that it is lost or missing. The updated holdings information is loaded by HathiTrust, and access to the work is made available to affiliated users and walk-in users at that institution.
5. Copyright not renewed: Cheese production in Nebraska, by Walter Martin Kollmorgen, published in the United States in 1938, is initially designated in-copyright. Subsequent review of copyright renewal records indicate that the copyright in this work was not renewed. Therefore this volume is reclassified as public domain (ATTRIBUTE name=pd) based on the results of copyright renewal research (REASON name=ren).

Precedence of Rights Information

Background

Over time, the rights status of any volume may be redetermined by a number of methods. For example, updates to the bibliographic record or manual copyright determination processes may change the rights status of a volume.

Some determinations are more authoritative than others. Updates to rights data must take that into account and enforce precedence so that the most recent, most authoritative determination is in effect.

We have identified five levels of precedence in the rights model we are currently using. These are, in order of increasing authority, as follows:

Using rights types and reason codes to infer precedence
precedence	rights type	reason code	examples
1 (lowest)	copyright	bib	pd/bib, ic/bib, und/bib, pdus/bib
2	copyright	gfv	pdus/gfv
3	copyright	any but bib, gfv, man	ic/unp, pd/ncn, pdus/ncn, pd/ren, pdus/ren, ic/ren, und/nfi, pd/cdpp, ic/cdpp, pdus/cdpp, ic/add, pdus/add, pd/add, pd/exp, op/ipma, ic/ipma, und/ipma, ic/crms, pd/crms, und/crms, pdus/crms, icus/gatt, icus/ren, und/ren
4	any	pvt, con	ic-world/con nobody/pvt (creative common licenses)/con cc-zero/con und-world/con pd/con pd-pvt/pvt
5 (highest)	any	man	pd/man pdus/man ic-world/man und-world/man ic/man nobody/man nobody/del supp/supp (creative common licenses)/man (note: these are never allowed in automatic rights updates, and should have a corresponding explanation in the 'note' field)

In this model, rights of a given precedence should be superseded only by rights of an equal or greater precedence. For example:

Rights determined by bibliographic record extract (precedence level 1) are by far the most plentiful but have the least authority. They can only override other bibliographic record-derived rights, and can be overridden by processes that involve manual inspection as well as by access controls placed on volumes.
Other copyright-type attributes, such as “ncn” (no printed copyright notice), are the result of manual inspection and have greater authority than those derived from bibliographic records. However, access controls for blocking private information or due to special contractual arrangements should take precedence over these determinations.
Access controls are applied when the rights of the volume are independent of copyright status, such as a special release from the copyright holder. Even the most thorough copyright determination should not be allowed to change the availability (or non-availability) of such items.
A manually-applied access control (as opposed to an algorithmically-applied) is absolutely authoritative.

Rules for rights precedence

With one exception, the behavior of rights update processes should be to insert new rights information for a given volume when the newly-supplied rights status is of equal or greater precedence to the active (latest) rights status for that volume.

The exception is that with material that Google has determined is viewable in the US. We allow this information to override ic/bib or und/bib, but NOT pdus/bib or pd/bib. That is, if Google has determined a work is open but HathiTrust’s bibliographic information would cause it to be closed, we will trust Google’s determination. But if HathiTrust’s bibliographic information already makes the work open, we do not override that with Google’s determination.

Note that rights may be supplied and re-supplied from manual copyright determination processes. According to these rules, those rights would all fall within the same level of precedence, and so would continue to take precedence over each other. It is assumed that a more recent determination is a more accurate determination.

Manual access controls will be ignored by rights updates and must be, currently, entered manually by an administrator.

Security Considerations

In addition to building the model, the following requirements have been identified to ensure proper enforcement of rights:

If viewing an item that requires authentication, the access system must re-check for valid authentication on every access.
To the extent that rights are enforced via cookies, cookies must be adequately protected from theft.

Database Layout

RIGHTS_LOG
namespace	id	attr	reason	source	access_profile	user	time	note
mdp	39015054477651	1	1	1	2	root	2006-01-12 11:34:26
mdp	39015017678577	1	1	1	2	root	2006-01-12 11:34:27
mdp	39015017678577	4	4	1	2	sooty	2006-02-08 15:18:24	determined by jaheim as in-copyright, but orphaned
mdp	39015034781842	7	3	1	2	pwillett	2006-03-08 09:12:45	agreement reached with publisher for open access

RIGHTS_CURRENT
namespace	id	attr	reason	source	access_profile	user	time	note
mdp	39015054477651	1	1	1	2	root	2006-01-12 11:34:26
mdp	39015017678577	4	4	1	2	sooty	2006-02-08 15:18:24	determined by jaheim as in-copyright, but orphaned
mdp	39015034781842	7	3	1	2	pwillett	2006-03-08 09:12:45	agreement reached with publisher for open access

Attributes

Attributes
id	name	type	dscr
1	pd	copyright	public domain
2	ic	copyright	in-copyright
3	op	copyright	out-of-print (impllies in-copyright)
4	orph	copyright	copyright-orphaned (implies copyright)
5	und	copyright	undetermined copyright status
6	umall	access	available to UM affiliates and walk-in patrons (all campuses)
7	ic-world	access	in-copyright and permitted as world viewable by the copyright holder
8	nobody	access	available to nobody; blocked for all users
9	pdus	copyright	public domain only when viewed in the US
10	cc-by-3.0	copyright	Creative Commons Attribution license, 3.0 Unported
11	cc-by-nd-3.0	copyright	Creative Commons Attribution-NoDerivatives license, 3.0 Unported
12	cc-by-nc-nd-3.0	copyright	Creative Commons Attribution-NonCommercial-NoDerivatives license, 3.0 Unported
13	cc-by-nc-3.0	copyright	Creative Commons Attribution-NonCommercial license, 3.0 Unported
14	cc-by-nc-sa-3.0	copyright	Creative Commons Attribution-NonCommercial-ShareAlike license, 3.0 Unported
15	cc-by-sa-3.0	copyright	Creative Commons Attribution-ShareAlike license, 3.0 Unported
16	orphcand	copyright	orphan candidate - in 90-day holding period (implies in-copyright)
17	cc-zero	copyright	Creative Commons Zero license (implies pd)
18	und-world	access	undetermined copyright status and permitted as world viewable by the depositor
19	icus	copyright	in copyright in the US
20	cc-by-4.0	copyright	Creative Commons Attribution 4.0 International license
21	cc-by-nd-4.0	copyright	Creative Commons Attribution-NoDerivatives 4.0 International license
22	cc-by-nc-nd-4.0	copyright	Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license
23	cc-by-nc-4.0	copyright	Creative Commons Attribution-NonCommercial 4.0 International license
24	cc-by-nc-sa-4.0	copyright	Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license
25	cc-by-sa-4.0	copyright	Creative Commons Attribution-ShareAlike 4.0 International license
26	pd-pvt	access	public domain but access limited due to privacy concerns
27	supp	access	suppressed from view; see note for details

Reasons

Rights code reasons document why a certain rights status has been determined for an item.
id	name	description
1	bib	bibliographically-derived by automatic processes
2	ncn	no printed copyright notice
3	con	contractual agreement with copyright holder on file
4	ddd	due diligence documentation on file
5	man	manual access control override; see note for details
6	pvt	private personal information visible
7	ren	copyright renewal research was conducted
8	nfi	needs further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered)
9	cdpp	title page or verso contain copyright date and/or place of publication information not in bib record
10	ipma	in-print and market availability research was conducted
11	unp	unpublished work
12	gfv	Google viewability set at VIEW_FULL
13	crms	derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details
14	add	author death date research was conducted or notification was received from author
15	exp	expiration of copyright term for non-US work with corporate author
16	del	deleted from the repository; see note for details
17	gatt	non-US public domain work restored to in-copyright in the US by GATT
18	supp	suppressed from view; see note for details

Sources

Source codes identify the original holder and contributor of an item.
id	name	dscr
1	google	Google
2	lit-dlps-dc	Library IT, Digital Library Production Service, Digital Conversion
3	ump	University of Michigan Press
4	ia	Internet Archive
5	yale	Yale University
6	mdl	Minnesota Digital Library
7	mhs	Minnesota Historical Society
8	usup	Utah State University Press
9	ucm	Universidad Complutense de Madrid
10	purd	Purdue University
11	getty	Getty Research Institute
12	um-dc-mp	University of Michigan, Duderstadt Center, Millennium Project
13	uiuc	University of Illinois at Urbana-Champaign
14	brooklynmuseum	Brooklyn Museum
15	uf	State University System of Florida
16	tamu	Texas A&M
17	udel	University of Delaware
18	private	Private Donor
19	umich	University of Michigan (General)
20	clark	Clark Art Institute
21	ku	Knowledge Unlatched
22	mcgill	mcgill
23	bc	Boston College
24	nnc	Columbia University
25	geu	Emory University
26	borndigital	Born Digital (placeholder)
27	yale2	Yale University
28	mou	University of Missouri-Columbia
29	chtanc	National Central Library of Taiwan
30	bentley-umich	Bentley Historical Library, University of Michigan
31	clements-umich	William L. Clements Library, University of Michigan
32	wau	University of Washington
33	cornell	Cornell University
34	cornell-ms	Cornell University (with support from Microsoft)
35	umd	University of Maryland
36	frick	The Frick Collection
37	northwestern	Northwestern University
38	umn	University of Minnesota
39	berkeley	University of California, Berkeley
40	ucmerced	University of California, Merced
41	nd	University of Notre Dame
42	princeton	Princeton University
43	uq	The University of Queensland
44	ucla	University of California, Los Angeles
45	osu	The Ohio State University
46	upenn	University of Pennsylvania
47	aub	American University of Beirut
48	ucsd	University of California, San Diego
49	harvard	Harvard University

Access Profiles

Access Profiles
id	name	dscr
1	open	Unrestricted image and full-volume download (e.g. Internet Archive)
2	google	Restricted public full-volume download - watermarked PDF only, when logged in or with Data API key (e.g. Google)
3	page	Page access only: no PDF or ZIP download for anyone (e.g. UM Press)
4	page+lowres	Low resolution watermarked image derivatives only (e.g. MDL)

SQL Create Statement

CREATE TABLE rights_log (
    namespace VARCHAR(8) NOT NULL,
    id     VARCHAR(32) NOT NULL,
    attr   TINYINT NOT NULL,
    reason TINYINT NOT NULL,
    source TINYINT NOT NULL,
    access_profile TINYINT NOT NULL,
    user   VARCHAR(32) NOT NULL,
    time   TIMESTAMP NOT NULL default CURRENT_TIMESTAMP,
    note   TEXT,
  PRIMARY KEY (namespace, id, time)
);

CREATE TABLE rights_current (
    namespace VARCHAR(8) NOT NULL,
    id VARCHAR(32) NOT NULL,
    attr TINYINT NOT NULL,
    reason TINYINT NOT NULL,
    source TINYINT NOT NULL,
    access_profile TINYINT NOT NULL,
    user VARCHAR(32) NOT NULL,
    time TIMESTAMP NOT NULL default CURRENT_TIMESTAMP,
    note TEXT,
  PRIMARY KEY (namespace, id)
);

CREATE TRIGGER ins_rights ON INSERT ON rights_current FOR EACH ROW insert into rights_log values(new.namespace, new.id, new.attr, new.reason, new.source, new.access_profile, new.user, new.time, new.note)

CREATE TRIGGER upd_rights ON UPDATE ON rights_current FOR EACH ROW insert into rights_log values(new.namespace, new.id, new.attr, new.erason, new.source, new.access_profile, new.user, new.time, new.note)

CREATE TABLE attributes (
  id     TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY,
  type   ENUM('access','copyright') NOT NULL,
  name   VARCHAR(16) NOT NULL,
  dscr   TEXT NOT NULL);

CREATE TABLE reasons (
  id     TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY,
  name   VARCHAR(16) NOT NULL,
  dscr   TEXT NOT NULL);

CREATE TABLE sources (
  id     TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY,
  name   VARCHAR(16) NOT NULL,
  dscr   TEXT NOT NULL);

CREATE TABLE access_profiles (
  id tinyint(3) unsigned NOT NULL,
  name varchar(16) NOT NULL,
  dscr text NOT NULL,
  PRIMARY KEY (`id`)
)

Top