Overview of requirements
The repository must store and track rights information for each digitized volume in HathiTrust that are used by mechanisms such as the PageTurner access system. Some of the challenges in doing this are
a) modeling the rights information properly to ease maintenance,
b) ensuring accuracy in the semantics of rights, and
c) tracking the changes to rights information over time.
Frequent updates to millions of records, changes to database structure that make the database unavailable for long periods, or subtle changes over time in the meaning of millions of access control rules are some of the consequences of a poor design. Complicating these challenges is the need for flexibility to accommodate different types of rights information and to develop new access rules, including those that come as a result of negotiations with publishers and manual copyright clearance.
Copyright is complex, and although there are good efforts in modeling and expressing copyright status information for library holdings, this project requires a practical and flexible approach. At best, any solution will be imprecise. We should expect this imprecision and hope to improve on the situation over time. The safest approach is, as much as possible, to base the rights database on simple, established copyright policy and terminology that is not likely to change. Attributes in the database should be deliberately defined in ways that are consistent with copyright policy. Using cataloging metadata, we can make a basic determination of rights by, for example, characterizing the publications as being either in the public domain or in copyright, or being in-copyright but out-of-print and brittle. Unfortunately, the exceptions to these general principles do not necessarily follow any sort of rule or pattern.
Storage and Maintenance Strategy
Can MARC handle it?
The MARC record format does not have fields intended for the storage of rights information and is not able to store this or similar information at the level of the volume (e.g., for multi-volume works). We are consistent with our colleagues at other institutions in recommending that this information be stored in a separate, large-scale database.
Storage of Rights Information
Our strategy for storing rights information is two-pronged, based on the notion of two extensible sets of attributes: The first set of attributes characterizes the copyright status of the volume. Examples of this type of attribute are “public domain,” “public domain when viewed in the U.S.” and “in-copyright”; each attribute is only present when appropriate. The main benefits of this approach are
a) insulation from frequent change and
b) accuracy in legal terms.
The bulk of this rights information is bibliographically derived (by automated query) at the point of ingest using the relatively stable criteria of US federal government documents, country of publication and publication date. Over time, additional attributes will be defined and added as we identify out-of-print books, etc. The second set of attributes does not characterize the volume in terms of copyright status, and instead directly specifies access control rules. These can be thought of as “overrides” to the first, more general set of attributes. For accurate representation, or due to changes in copyright status over time, some volumes may have more than one rights attribute. However, to simplify access decisions, rights attributes should be defined so that the most recent rights attribute is authoritative.
How Rights Decisions Are Made
At the core of the rights system is an algorithm that considers (a) the copyright status and/or explicit access controls associated with the volume, (b) the volume’s digitizing agent (e.g., Google or the University of Chicago), and (c) relevant characteristics of the user (if known) in order to determine access rights. The access rights may differ based on any of these criteria. Because most rights attributes will be static and will characterize the copyright status of the volume in general terms (e.g., “out-of-print and brittle”), the decision matrix underlying this algorithm can easily accommodate changes in rights over time. For example, we launched a service that prohibited access to in-copyright materials, even when those volumes were out-of-print and damaged, but over time, a change in policy granted in-library users with access to those materials by virtue of Section 108 provisions in US copyright law. A change in access due to such a policy change will only require a simple change in the decision matrix. Some volumes will have a series of rights attributes applied over time. For these volumes, the most recent rights attribute will be used to determine access rights.
General Process Overview
As a volume’s images are ingested, they are placed in storage, and the retrieval system sends the identifier to Mirlyn for the item record to be created or updated. Mirlyn then performs a simple test for copyright status: (1) was the volume published in the US or outside the US? (2) depending on where it was published, was it published before a known cutoff date? and, if published in the US, is the volume a US federal government publication? The appropriate attributes are then stored in the rights database. With the volume in storage and represented in the rights database, it is available via the access system. When an action is requested in the access system, the access system consults the rights database. Based on the most recent rights attribute, the access profile or source, and whether the user has authenticated, a list of allowable actions is composed. The access system either performs the requested action, prompts the user for authentication, or denies the action.
The following simplified, hypothetical rights examples help illustrate both the attributes applied and the rules interpreted:
- Mass identification of copyright status based on bibliographically derived information: (a) As texts are loaded, a set query in Mirlyn identifies those texts that are:
- US federal government documents, or
- published in the US prior to 1928, or
- published outside of the US before 1898
These are treated as public domain (ATTRIBUTE name=pd) based on bibliographically derived information (REASON name=bib). We do not restrict access to these materials. (b) Those texts that do not meet these criteria (e.g., US 1928 or later and not a government document) are treated as in-copyright (i.e., ATTRIBUTE name=ic and REASON name=bib). (c) An additional attribute is used to represent works published outside the United States from 1898 to 1927 because copyright status for these works depends on the location of the user. Works published outside the US prior to 1928 are in the public domain; however, due to the variations in copyright law in countries outside the US, it is estimated that 1898 is the earliest date works published in these countries may still be under copyright. Therefore, users accessing the volume from US IP addresses will have access to the works published outside the US from 1898 through 1927; however, users with non-US IP addresses will not (ATTRIBUTE name=pdus and REASON name=bib).
- Manually determined public domain: The text, Edward Carpenter, The British Tolstoi by Bell, was published in the United States in 1932. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. Upon investigation, we note that no copyright notice was printed in the text. According to US copyright law, this text is in the public domain. This text is treated as public domain (ATTRIBUTE name=pd) based on the absence of a copyright statement (REASON name=ncn). We do not restrict access to this text.
- Negotiated open access: Tradition and design in the Iliad, by C. M. Bowra was published by the Clarendon Press (Oxford University) in 1930. As outlined in Use case #1, it enters the collection as an in-copyright text, and access is restricted. A letter requesting permission to provide online access is sent to the rights holder, Oxford University Press, and they grant permission with the stipulation that no reprints are sold. We provide open access to the world, but do not make the page image files available to our reprint services. This text is treated as in-copyright and open access, based on a letter or contract (ATTRIBUTE name=ic-world for REASON name=con).
- Out-of-print and brittle: Through standard mechanisms (e.g., review of a volume at a circulation desk or other staff determination), The mission to Spain of Pierre Soule, 1853-1855 by Amos Ettinger, OUP, 1932 is determined by a HathiTrust member institution to either be too brittle to circulate, or lost or missing from the member’s library collection. The work is found to be preserved in HathiTrust. Investigation of the work by the member or another HathiTrust institution has determined that the work is in copyright and no unused replacements are available on the market at a fair price. The work is classified in the HathiTrust rights database as out-of-print (ATTRIBUTE name=op) with an indication of research on file (REASON name=ipma). The institution updates the print holdings information submitted to HathiTrust to indicate the brittle status of the work, or that it is lost or missing. The updated holdings information is loaded by HathiTrust, and access to the work is made available to affiliated users and walk-in users at that institution.
- Copyright not renewed: Cheese production in Nebraska, by Walter Martin Kollmorgen, published in the United States in 1938, is initially designated in-copyright. Subsequent review of copyright renewal records indicate that the copyright in this work was not renewed. Therefore this volume is reclassified as public domain (ATTRIBUTE name=pd) based on the results of copyright renewal research (REASON name=ren).
- Mass identification of copyright status based on bibliographically derived information: (a) As texts are loaded, a set query in Mirlyn identifies those texts that are:
Precedence of Rights Information
Over time, the rights status of any volume may be redetermined by a number of methods. For example, updates to the bibliographic record or manual copyright determination processes may change the rights status of a volume.
Some determinations are more authoritative than others. Updates to rights data must take that into account and enforce precedence so that the most recent, most authoritative determination is in effect.
Using rights types and reason codes to infer precedence
We have identified five levels of precedence in the rights model we are currently using. These are, in order of increasing authority, as follows:
In this model, rights of a given precedence should be superseded only by rights of an equal or greater precedence. For example:
- Rights determined by bibliographic record extract (precedence level 1) are by far the most plentiful but have the least authority. They can only override other bibliographic record-derived rights, and can be overridden by processes that involve manual inspection as well as by access controls placed on volumes.
- Other copyright-type attributes, such as “ncn” (no printed copyright notice), are the result of manual inspection and have greater authority than those derived from bibliographic records. However, access controls for blocking private information or due to special contractual arrangements should take precedence over these determinations.
- Access controls are applied when the rights of the volume are independent of copyright status, such as a special release from the copyright holder. Even the most thorough copyright determination should not be allowed to change the availability (or non-availability) of such items.
- A manually-applied access control (as opposed to an algorithmically-applied) is absolutely authoritative.
Rules for rights precedence
With one exception, the behavior of rights update processes should be to insert new rights information for a given volume when the newly-supplied rights status is of equal or greater precedence to the active (latest) rights status for that volume.
The exception is that with material that Google has determined is viewable in the US. We allow this information to override ic/bib or und/bib, but NOT pdus/bib or pd/bib. That is, if Google has determined a work is open but HathiTrust’s bibliographic information would cause it to be closed, we will trust Google’s determination. But if HathiTrust’s bibliographic information already makes the work open, we do not override that with Google’s determination.
Note that rights may be supplied and re-supplied from manual copyright determination processes. According to these rules, those rights would all fall within the same level of precedence, and so would continue to take precedence over each other. It is assumed that a more recent determination is a more accurate determination.
Manual access controls will be ignored by rights updates and must be, currently, entered manually by an administrator.
In addition to building the model, the following requirements have been identified to ensure proper enforcement of rights:
- If viewing an item that requires authentication, the access system must re-check for valid authentication on every access.
- To the extent that rights are enforced via cookies, cookies must be adequately protected from theft.
|1||bib||bibliographically-derived by automatic processes|
|2||ncn||no printed copyright notice|
|3||con||contractual agreement with copyright holder on file|
|4||ddd||due diligence documentation on file|
|5||man||manual access control override; see note for details|
|6||pvt||private personal information visible|
|7||ren||copyright renewal research was conducted|
|8||nfi||needs further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered)|
|9||cdpp||title page or verso contain copyright date and/or place of publication information not in bib record|
|10||ipma||in-print and market availability research was conducted|
|12||gfv||Google viewability set at VIEW_FULL|
|13||crms||derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details|
|14||add||author death date research was conducted or notification was received from author|
|15||exp||expiration of copyright term for non-US work with corporate author|
|16||del||deleted from the repository; see note for details|
|17||gatt||non-US public domain work restored to in-copyright in the US by GATT|
|18||supp||suppressed from view; see note for details|
|2||lit-dlps-dc||Library IT, Digital Library Production Service, Digital Conversion|
|3||ump||University of Michigan Press|
|6||mdl||Minnesota Digital Library|
|7||mhs||Minnesota Historical Society|
|8||usup||Utah State University Press|
|9||ucm||Universidad Complutense de Madrid|
|11||getty||Getty Research Institute|
|12||um-dc-mp||University of Michigan, Duderstadt Center, Millennium Project|
|13||uiuc||University of Illinois at Urbana-Champaign|
|15||uf||State University System of Florida|
|17||udel||University of Delaware|
|19||umich||University of Michigan (General)|
|20||clark||Clark Art Institute|
|26||borndigital||Born Digital (placeholder)|
|28||mou||University of Missouri-Columbia|
|29||chtanc||National Central Library of Taiwan|
|30||bentley-umich||Bentley Historical Library, University of Michigan|
|31||clements-umich||William L. Clements Library, University of Michigan|
|32||wau||University of Washington|
|34||cornell-ms||Cornell University (with support from Microsoft)|
|35||umd||University of Maryland|
|36||frick||The Frick Collection|
|38||umn||University of Minnesota|
|39||berkeley||University of California, Berkeley|
|40||ucmerced||University of California, Merced|
|41||nd||University of Notre Dame|
|43||uq||The University of Queensland|
|44||ucla||University of California, Los Angeles|
|45||osu||The Ohio State University|
|46||upenn||University of Pennsylvania|
|47||aub||American University of Beirut|
|48||ucsd||University of California, San Diego|
|1||open||Unrestricted image and full-volume download (e.g. Internet Archive)|
|2||Restricted public full-volume download - watermarked PDF only, when logged in or with Data API key (e.g. Google)|
|3||page||Page access only: no PDF or ZIP download for anyone (e.g. UM Press)|
|4||page+lowres||Low resolution watermarked image derivatives only (e.g. MDL)|
SQL Create Statement
CREATE TABLE rights_log ( namespace VARCHAR(8) NOT NULL, id VARCHAR(32) NOT NULL, attr TINYINT NOT NULL, reason TINYINT NOT NULL, source TINYINT NOT NULL, access_profile TINYINT NOT NULL, user VARCHAR(32) NOT NULL, time TIMESTAMP NOT NULL default CURRENT_TIMESTAMP, note TEXT, PRIMARY KEY (namespace, id, time) ); CREATE TABLE rights_current ( namespace VARCHAR(8) NOT NULL, id VARCHAR(32) NOT NULL, attr TINYINT NOT NULL, reason TINYINT NOT NULL, source TINYINT NOT NULL, access_profile TINYINT NOT NULL, user VARCHAR(32) NOT NULL, time TIMESTAMP NOT NULL default CURRENT_TIMESTAMP, note TEXT, PRIMARY KEY (namespace, id) ); CREATE TRIGGER ins_rights ON INSERT ON rights_current FOR EACH ROW insert into rights_log values(new.namespace, new.id, new.attr, new.reason, new.source, new.access_profile, new.user, new.time, new.note) CREATE TRIGGER upd_rights ON UPDATE ON rights_current FOR EACH ROW insert into rights_log values(new.namespace, new.id, new.attr, new.erason, new.source, new.access_profile, new.user, new.time, new.note) CREATE TABLE attributes ( id TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY, type ENUM('access','copyright') NOT NULL, name VARCHAR(16) NOT NULL, dscr TEXT NOT NULL); CREATE TABLE reasons ( id TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY, name VARCHAR(16) NOT NULL, dscr TEXT NOT NULL); CREATE TABLE sources ( id TINYINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE KEY, name VARCHAR(16) NOT NULL, dscr TEXT NOT NULL); CREATE TABLE access_profiles ( id tinyint(3) unsigned NOT NULL, name varchar(16) NOT NULL, dscr text NOT NULL, PRIMARY KEY (`id`) )