HathiTrust Rights Database

Introduction

Overview of requirements

The repository must store and track rights information for each digitized volume in HathiTrust that are used by mechanisms such as the page-turner access system. Some of the challenges in doing this are a) modeling the rights information properly to ease maintenance, b) ensuring accuracy in the semantics of rights, and c) tracking the changes to rights information over time. Frequent updates to millions of records, changes to database structure that make the database unavailable for long periods, or subtle changes over time in the meaning of millions of access control rules are some of the consequences of a poor design. Complicating these challenges is the need for flexibility to accommodate different types of rights information and to develop new access rules, including those that come as a result of negotiations with publishers and manual copyright clearance.

Philosophy

Copyright is complex, and although there are good efforts in modeling and expressing copyright status information for library holdings, this project requires a practical and flexible approach. At best, any solution will be imprecise. We should expect this imprecision and hope to improve on the situation over time.

The safest approach is, as much as possible, to base the rights database on simple, established copyright policy and terminology that is not likely to change. Attributes in the database should be deliberately defined in ways that are consistent with copyright policy. Using cataloging metadata, we can make a basic determination of rights by, for example, characterizing the publications as being either in the public domain or in copyright, being in-copyright but out-of-print and brittle, or being authoritatively copyright-orphaned. Unfortunately, the exceptions to these general principles do not necessarily follow any sort of rule or pattern.

Storage and Maintenance Strategy

Can MARC handle it?

The MARC record format does not have fields intended for the storage of rights information and is not able to store this or similar information at the level of the volume (e.g., for multi-volume works). We are consistent with our colleagues at other institutions in recommending that this information be stored in a separate, large-scale database.

Storage of Rights Information

Our strategy for storing rights information is two-pronged, based on the notion of two extensible sets of attributes:

The first set of attributes characterizes the copyright status of the volume. Examples of this type of attribute are "public domain," "public domain when viewed in the U.S." and "in-copyright"; each attribute is only present when appropriate. The main benefits of this approach are a) insulation from frequent change and b) accuracy in legal terms. The bulk of this rights information is bibliographically-derived (by automated query) at the point of ingest using the relatively stable criteria of US federal government documents, country of publication and publication date. Over time, additional attributes will be defined and added as we identify out-of-print brittle books, orphaned-copyright works, etc.

The second set of attributes does not characterize the volume in terms of copyright status, and instead directly specifies access control rules. These can be thought of as "overrides" to the first, more general set of attributes. An obvious example of this type of information is "available to UM affiliates".

For accurate representation, or due to changes in copyright status over time, some volumes may have more than one rights attribute. However, to simplify access decisions, rights attributes should be defined so that the most recent rights attribute is authoritative. For example, a volume initially classified as public domain is discovered to be in-copyright, but explicit access to UM affiliates has been granted by the copyright holder. In such a case, three attributes apply to the volume: its original public domain classification (as valuable history), its current status as in-copyright, and the explicit access control granted by the copyright holder. The latter takes precedence; as mentioned above, explicit access controls can be thought of as "overrides" to the more general copyright status attributes.

How Rights Decisions Are Made

At the core of the rights system is an algorithm that considers a) the copyright status and/or explicit access controls associated with the volume, b) the volume's digitizing agent (e.g., Google or the University of Chicago), and c) the identity of the user (if known) in order to determine access rights. The access rights may differ based on any of these criteria.

Because most rights attributes will be static and will characterize the copyright status of the volume in general terms (e.g. "out-of-print and brittle"), the decision matrix underlying this algorithm can easily accommodate changes in rights over time. For example, we launched a service that prohibited access to in-copyright materials, even when those volumes were out-of-print and damaged, but over time, a change in policy granted in-library users with access to those materials by virtue of Section 108 provisions in US copyright law. A change in access due to such a policy change will only require a simple change in the decision matrix.

Some volumes will have a series of rights attributes applied over time. For these volumes, the most recent rights attribute will be used to determine access rights.

Rights Assignment

General Process Overview

As a volume's images are ingested, they are placed in storage, and the retrieval system sends the identifier to Mirlyn for the item record to be created or updated. Mirlyn then performs a simple test for copyright status: (1) was the volume published in the US or outside the US? (2) depending on where it was published, was it published before a known cutoff date? and, if published in the US, is the volume a US federal government publication? The appropriate attributes are then stored in the rights database.

With the volume in storage and represented in the rights database, it is available via the access system. When an action is requested in the access system, the access system consults the rights database. Based on the most recent rights attribute, the source, and whether the user has authenticated, a list of allowable actions is composed. The access system either performs the requested action, prompts the user for authentication, or denies the action.

Use Cases

The following simplified, hypothetical rights examples help illustrate both the attributes applied and the rules interpreted:

  1. Mass identification of copyright status based on bibliographically-derived information:
    a) As texts are loaded, a set query in Mirlyn identifies those texts that are:
    • US federal government documents, or
    • published in the US prior to 1923, or
    • published outside of the US before 1869

    These are treated as public domain (ATTRIBUTE name=pd) based on bibliographically-derived information (REASON name=bib). We do not restrict access to these materials.

    b) Those texts that do not meet these criteria (e.g,. US post-1923 and not a government document) are treated as in-copyright (i.e., ATTRIBUTE name=ic and REASON name=bib).

    c) An additional attribute is used to represent works published outside the United States between 1869 and 1908 because copyright status for these works depends on the location of the user. Works published outside the US prior to 1909 are in the public domain; however, due to the variations in copyright law in countries outside the US, it is estimated that 1869 is the earliest date foreign works may still be under copyright. Therefore, users accessing the volume from US IP addresses will have access to the works published outside the US between 1869 through 1908; however, users with non-US IP addresses will not (ATTRIBUTE name=pdus and REASON name=bib).

  2. Manually-determined public domain: The text, Edward Carpenter, The British Tolstoi by Bell, was published in the United States in 1932. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. Upon investigation, we note that no copyright notice was printed in the text. According to US copyright law, this text is in the public domain. This text is treated as public domain (ATTRIBUTE name=pd) based on the absence of a copyright statement (REASON name=ncn). We do not restrict access to this text.
  3. Negotiated UM access: The text, Wishbone by Stirling Bowen was published in 1930 in the United States. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. Upon investigation, we confirm that copyright has been renewed for the text, and that the current rights holder is Penguin. We contact Penguin and, after negotiation, Penguin grants permission to provide access to affiliated users, stipulates that we not make reprints available, and requires that the agreement must be renewed in five years. We provide open access to partner IP addresses and to authenticated users, and do not provide the page images to our reprint services. This text is treated as in-copyright and restricted to affiliates and walk-in patrons based on a letter or contract (ATTRIBUTE name=umall; REASON name=con).
  4. Negotiated open access: Tradition and design in the Iliad, by C. M. Bowra was published by the Clarendon Press (Oxford University) in 1930. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. A letter requesting permission to provide online access is sent to the rights holder, Oxford University Press, and they grant permission with the stipulation that no reprints are sold. We provide open access to the world, but do not make the page image files available to our reprint services. This text is treated as in-copyright and open access, based on a letter or contract (ATTRIBUTE name=world for REASON name=con).
  5. Out-of-print and brittle: Through standard mechanisms (e.g., review of a volume at a circulation desk), The mission to Spain of Pierre Soule, 1853-1855 by Amos Ettinger, OUP, 1932 is evaluated and determined to be too brittle to circulate. Subsequent searching determines that no later imprints or editions are available. We also conclude that Ettinger's volume is in copyright. The volume is reformatted and put online and classified as in-copyright, out-of-print, and brittle (ATTRIBUTE name=opb) with an indication of our research on file (REASON name=cip). Access is restricted to affiliated users and walk-in users in the campus libraries.
  6. Copyright not renewed: Cheese production in Nebraska, by Walter Martin Kollmorgen, published in the United States in 1938, is initially designated in-copyright. Subsequent review of copyright renewal records indicate that the copyright in this work was not renewed. Therefore this volume is reclassified as public domain (ATTRIBUTE name=pd) based on the results of copyright renewal research (REASON name=ren).
  7. Orphaned copyright work: (NOTE: Although all of these use cases are hypothetical, this one depends on legislation that has not been passed.) Alfredo Candia Guzman's Bolivia: un experimento comunista en la America, published in Bolivia in the 1950s, is reformatted as part of a topical conversion effort. Although, based on its bibliographic information (see Use case #1), the volume enters the collection as an in-copyright text and access is restricted, our research determines that the author has died and the publishing house no longer exists. The volume is classified as an orphaned copyright work (ATTRIBUTE name=orph) with noted due diligence (REASON name=ddd) and put online with access open to the world.

Precedence of Rights Information

Background

Over time, the rights status of any volume may be redetermined by a number of methods. For example, updates to the bibliographic record or manual copyright determination processes may change the rights status of a volume.

Some determinations are more authoritative than others. Updates to rights data must take that into account and enforce precedence so that the most recent, most authoritative determination is in effect.

Using rights types and reason codes to infer precedence

We have identified four levels of precedence in the rights model we are currently using. These are, in order of increasing authority, as follows:

precedence rights type reason code examples
1 (lowest) copyright bib pd/bib, ic/bib, und/bib, pdus/bib
2 copyright any but bib and man pd/ncn, orph/ddd, pd/ren,
ic/ren, und/nfi, pd/cdpp, ic/cdpp, opb/cip
3 access any but man umall/con, world/con, nobody/pvt
4 (highest) any man pd/man, world/man, ic/man, nobody/man

(note: these are never allowed in automatic rights updates, and should have a corresponding explanation in the 'note' field)

In this model, rights of a given precedence should be superceded only by rights of an equal or greater precedence. For example:

  • Rights determined by bibliographic record extract (precedence level 1) are by far the most plentiful but have the least authority. They can only override other bibliographic record-derived rights, and can be overridden by processes that involve manual inspection as well as by access controls placed on volumes.
  • Other copyright-type attributes, such as "ncn" (no printed copyright notice), are the result of manual inspection and have greater authority than those derived from bibliographic records. However, access controls for blocking private information or due to special contractual arrangements should take precedence over these determinations.
  • Access controls are applied when the rights of the volume are independent of copyright status, such as a special release from the copyright holder. Even the most thorough copyright determination should not be allowed to change the availability (or non-availability) of such items.
  • A manually-applied access control (as opposed to an algorithmically-applied) is absolutely authoritative.

Rules for rights precedence

With one exception, the behavior of rights update processes should be to insert new rights information for a given volume when the newly-supplied rights status is of equal or greater precedence to the active (latest) rights status for that volume.

The exception to this rule is as follows: because access controls have precedence over other rights attributes, they must always remain in effect. However, we will not discard manually-determined rights information for those volume because, for example, we plan to eventually strip some access controls (such as those which are blocking volumes where private information is visible in images). Our system does not currently handle this very gracefully; the workaround is that when new rights data for such volumes is inserted, the access control should immediately be given a later timestamp so that it remains in effect. This will allow us to, in the near future, remove the access control and the most recent rights determination will take effect.

Note that rights may be supplied and re-supplied from manual copyright determination processes. According to these rules, those rights would all fall within the same level of precedence, and so would continue to take precedence over each other. It is assumed that a more recent determination is a more accurate determination.

Manual access controls will be ignored by rights updates and must be, currently, entered manually by an administrator.

Security Considerations

In addition to building the model, the following requirements have been identified to ensure proper enforcement of rights:

  • If viewing an item that requires authentication, the access system must re-check for valid authentication on every access.
  • To the extent that rights are enforced via cookies, cookies must be adequately protected from theft.
  • more?

Database Layout

Diagram

RIGHTS
id
attr
reason
source
user
time
note
39015054477651
1
1
1
root
2006-01-12
11:34:26

39015017678577
1
1
1
root
2006-01-12 11:34:27

39015017678577 4
4
1
sooty
2006-02-08 15:18:24
determined by jaheim as in-copyright, but orphaned
39015034781842
2
1
1
root
2006-01-12
11:34:28

39015034781842
6
5
1
pwillett
2006-03-08
09:12:45
agreement reached with publisher for campus access


ATTRIBUTES
id
name
type
dscr
1
pd
copyright
public domain
2
ic
copyright
in-copyright
3
opb
copyright
out-of-print and brittle (implies in-copyright)
4
orph
copyright
copyright-orphaned (implies in-copyright)
5
und
copyright
undetermined copyright status
6
umall
access
available to UM affiliates and walk-in patrons (all campuses)
7
world
access
available to everyone in the world
8
nobody
access
available to nobody; blocked for all users
9
pdus
copyright
public domain only when viewed in the US


REASONS
id
name
dscr
1 bib bibliographically-derived by automatic processes
2 ncn no printed copyright notice
3 con contractual agreement with copyright holder on file
4 ddd due diligence documentation on file
5 man manual access control override; see note for details
6 pvt private personal information visible
7 ren copyright renewal research was conducted
8 nfi needs further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered)
9 cdpp title page or verso contain copyright date and/or place of publication information not in bib record
10 cip condition review and in-print status research was conducted
11
unp
unpublished work


SOURCES
id name dscr
1 google Google
2 lit-dlps-dc Library IT, Digital Library
Production Service, Digital Conversion

SQL Create Statement

CREATE TABLE rights (
id VARCHAR(32) NOT NULL,
attr TINYINT NOT NULL,
reason TINYINT NOT NULL,
source TINYINT NOT NULL,
user VARCHAR(32) NOT NULL,
time TIMESTAMP NOT NULL,
note TEXT,
PRIMARY KEY (id, time)
);

CREATE TABLE attributes (
id TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY,
type ENUM('access','copyright') NOT NULL,
name VARCHAR(16) NOT NULL,
dscr TEXT NOT NULL);

CREATE TABLE reasons (
id TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY,
name VARCHAR(16) NOT NULL,
dscr TEXT NOT NULL);

CREATE TABLE sources (
id TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY,
name VARCHAR(16) NOT NULL,
dscr TEXT NOT NULL);