Available Indexes

Guidelines for Digital Object Deposit

Version 1.0 - January 14, 2011
Version 1.1 - Updated number of partners September 21, 2011
Version 1.2 - Updated number of partners December 10, 2011
Version 1.3 - Updated date for U.S. copyright check from 1923 to 1924, January 3, 2019
Version 1.4 - Updated date for U.S. copyright check from 1924 to 1925, December 17, 2019
Version 2.0 - Light editing throughout; updated links to revised digitization spec and submission package spec, February 20, 2020

Summary

The purpose of these guidelines is to facilitate deposit of digital content from a variety of sources into the HathiTrust repository. The guidelines contain an introduction to the HathiTrust Digital Library, a description of its guiding principles and design, a brief overview of the ingest process, and definitions, policies and procedures related to the ingest of digitized book and journal content and associated metadata. Additional information is available at the following links

Outline

I. Introduction

HathiTrust is a web-scale digital preservation repository launched as a partnership of major research libraries in 2008. We preserve and provide access to digital materials contributed by our growing list of member institutions.   Together with our members, we:

  • maintain an increasingly comprehensive archive of published literature from around the world, and

  • Collaborate on the development of shared strategies for managing and developing digital and print collections at scale.

HathiTrust primarily serves the constituents of its member libraries. The materials in HathiTrust are also available to non-members as permitted by U.S. Copyright law, providing access to the scholarly record as a public good to users worldwide.

II. Principles and Design

HathiTrust is guided by principles of trustworthiness, openness and responsible stewardship. We strive to provide reliable, long-term preservation of and access to content in ways that:

  • maximize the contributions of partner institutions, and

  • make the most efficient use of available resources. 

The HathiTrust repository was designed according to the framework for Open Archival Information Systems (OAIS)[1] and is a TRAC certified, Trustworthy Digital Repository.[2] Operating a preservation repository at the scale of HathiTrust[3] requires implementation solutions that favor:

  • consistency and standardization over variation, 

  • simplicity over complexity (in design, not function), and 

  • practicality over ideology.

III. Ingest Overview

Members wishing to deposit content in HathiTrust must prepare and provide the following materials for each digital object:

  • MARC21-compliant bibliographic metadata

  • A Submission Implementation Package containing:

    • Page images that meet our requirements

    • OCR text for each page

    • A metadata file describing the contents of the item

    • Fixity information in the form of a checksum.md5 file

Institutions must provide accurate bibliographic records for submitted content before ingest into the repository can occur. These records describe each digital item and are part of HathiTrust’s digital object management strategy. Specifications for bibliographic metadata are available at https://bit.ly/2UMFUNE, and the submission process is described at https://bit.ly/39uD9Vq.

Prior to deposit, the contributing institution assumes responsibility for content transformation and possible remediation to repository standards.  At ingest, HathiTrust performs a series of content validation steps, and acceptable objects are subsequently written to repository storage (including replication to HathiTrust’s redundant storage sites).

IV. Definitions

The definitions below are adapted from the OAIS reference model and Brian Lavoie’s Introductory Guide to OAIS.[4]

Archival Information Package (AIP) – The Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), that is preserved in HathiTrust.

Core Information Package (CIP) – The set of files that must be present in an SIP and included in the AIP for a package to be eligible for ingest.

Preferred Information Package (PIP) – The Core Package plus the set of files may be present in an SIP and included in the AIP for the package to be eligible for ingest. 

Submission Information Package (SIP) – The Information Package that is delivered to HathiTrust for use in the construction of one or more AIPs. In the HathiTrust implementation, these may also be referred to as Submission Packages or Content Packages.

Workstream – A Workstream is a group of SIPs that are similar in their technical characteristics (content formats, included metadata, digitization workflow, etc.).  In the HathiTrust implementation, these are referred to as Content Streams.

Content Information - The set of information that is the original target of preservation. It is an Information Object comprised of its Content Data Object and Representation Information.  

  • Content Data Object – The data object that, together with Representation Information, is the original target of preservation (in HathiTrust currently, page image files and associated OCR files and metadata) 

  • Representation Information – The information that maps a Data Object into more meaningful concepts (this includes at a very granular level standards such as Unicode and TIFF). 

Preservation Description Information - The information that is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, and Fixity  information.  

  • Provenance Information – Documents the history of the Content Information, including its creation, any alterations to its content or format over time, its chain of custody, any actions (such as media refreshment or migration) taken to preserve the Content Information, and the outcome of these actions.  

  • Reference Information – Uniquely identifies the Content Information within HathiTrust (e.g., repository identifier), as well as in relation to entities and systems external to HathiTrust (e.g., OCLC number, ISBN, etc.). Note that some reference information, such as ISBN, OCLC number, and other standard identifiers, are contained in MARC records with the Descriptive Information. 

  • Fixity Information - Validates the authenticity or integrity of the Content Information: for example, a checksum, a digital signature, or a digital watermark.  

Packaging Information – Information that is used to bind all of these information components into a single logical package. Packaging Information serves to associate all of the various components of an AIP, permitting them to be identified and located as a single logical unit within the archival system.  

Descriptive Information - Information that supports the discovery, retrieval, and ordering of Content Information in HathiTrust (bibliographic and rights metadata).  

V. Policies

Preservation

HathiTrust strives to ensure that each digital object it preserves is complete, suitable for long-term preservation, and useful for a variety of access purposes.  We do this through conformance with accepted community standards, including the Open Archival Information Systems framework. Preservation in HathiTrust encompasses the content characteristics, metadata, and processes that support the bit-level integrity of content over time and facilitate migration to new formats as technology, standards and developments in the library community necessitate.

From the HathiTrust website (https://www.hathitrust.org/preservation):

“HathiTrust is committed to preserving the intellectual content and in many cases the exact appearance of materials digitized for deposit...HathiTrust is committed to bit-level preservation and format migration of materials…as technology, standards, and best practices in the digital library community change.”

We do this through consideration of content file formats, preservation and descriptive metadata, validation routines, and attention to quality. HathiTrust maintains a sufficient level of conformance with community-wide standards for digital repositories, including redundant storage of materials in geographically separated locations, that it is able to serve reliably as the sole preservation repository for deposited content.

Content Formats

1. HathiTrust supports formats for Content Data Object Files that:

  • are widely held standards in the library community

  • have proven to be robust as far as carrying digital data and minimizing loss of that data over time

  • enable multiple alternative and downstream uses (such as print-on-demand) with high confidence, and

  • for which, should a more robust format be developed or technological or other developments dictate the need to alter or abandon the current format standards, there is high confidence that the library community as a whole will devote energy and resources to forward migration paths

2. We support three Content Data Object formats currently: TIFF ITU G4, JP2 (JPEG 2000 part 1), and Unicode OCR with and without coordinates. The first two are base image formats from which the third can be derived.

3. We do not currently (as of September 2010) support schemas or DTDs that describe publication structures (e.g., DocBook , TEI, ePub).

4. We do not accept derivative image formats (JPEG or PNG, for example) in the digital object package.

5. We do not perform additional processing to de-skew or crop files for display to end-users.  

Preservation Description Information (PDI)

1. PDI in HathiTrust is typically stored in two files conforming to the Metadata Encoding and Transmission Standard (http://www.loc.gov/standards/mets/), which contain Provenance, Reference, Fixity, and Context Information for Content Information as follows:

  • A “Source” METS file is assembled from metadata provided to HathiTrust in the SIP, and contains information about the Content Information from the time of its creation to the time it enters the repository;

  • A “HathiTrust” METS file is created on ingest and includes a subset of the Source METS file data, but is primarily a record of the object from the time it enters the repository forward. The Source METS is kept for preservation purposes only. The HathiTrust METS is used for both preservation and access purposes (i.e., in both the archival and dissemination information packages).

  • More information, including sample METS files is available at https://www.hathitrust.org/digital_object_specifications.

2. Provenance Information – Preservation information is recorded using PREMIS (Preservation Metadata Implementation Strategies.

3. Reference Information – Each HathiTrust item has a single primary identifier in the repository, which is composed of two parts: a namespace, and an item identifier, (typically a physical volume barcode or ARK identifier). Namespaces are usually selected by the contributing institution and delineate content in the repository contributed by that institution from content contributed by other institutions. If items from an institution have more than one distinct identifier scheme, multiple namespaces are used. Namespaces do not delineate content based on technical characteristics, so multiple Workstreams may exist within a single namespace.

We strongly prefer to use item identifiers that are currently in use at the contributing institution because:

  • introducing another identifier scheme requires mapping and/or bulk updates to existing records and ongoing additional maintenance, and

  • by leveraging an existing identifier scheme, institutions may more easily make references to HathiTrust representations of items by automated means within their own systems

If such identifiers are unavailable, or undesirable because they do not have good identifier properties, a vendor or secondary identifier may be used instead. If valid identifiers do not exist, HathiTrust may be able to assign them. Good identifier properties include:

  • guaranteed uniqueness

  • a deterministic process for creating new identifiers

  • an internal check scheme

  • an explicit, existing association to the physical item to be digitized (such as an item barcode)

  • either an accurate correlation or no correlation to existing names or other identifying characteristics of the item (i.e. no implied or misleading relationships)

Other Reference Information such as ISBN, OCLC number, and other standard identifiers may be present in the Descriptive Information (bibliographic metadata) that is stored in the Source and HathiTrust METS files. However, these identifiers are stored for preservation purposes only and are not used actively in access systems. The specifications for bibliographic metadata in Content Data files are less directive than bibliographic metadata submitted to the bibliographic metadata management system and need only be valid MARCXML. Specifications for Descriptive Information submitted to the bibliographic management system are provided at https://www.hathitrust.org/bib_specifications.

Descriptive Information and Copyright Determination

1. Bibliographic data must be submitted to HathiTrust before ingest of corresponding content package can begin, and should follow the specifications listed at https://www.hathitrust.org/bib_specifications.

2. This bibliographic data is then used to make an initial automated rights determinations for every deposited item, which are saved in a Rights Database (separate from, but integrated with, items in the repository). Bibliographic Data should therefore be as complete and accurate as possible. If insufficient data exists to make a rights determination, volumes default to a status of in copyright, with the same limitations on visibility and use that apply to in copyright volumes. Details on the bibliographic fields used and processes involved in automatic rights determination can be found at https://www.hathitrust.org/bib_rights_determination.

3. Rights information and permissions pertaining to specific volumes may be attached to the Digital Assets Submission Inventory, which must accompany each deposit. A Creative Commons declaration form is available for rights holders to open access to their works in HathiTrust, including the assignment of Creative Commons licenses, and to enable the production of reprints.

4. Staff from multiple HathiTrust institutions are involved in ongoing manual copyright review of volumes published in the United States between 1924 and 1963. More information about this work can be found at https://www.hathitrust.org/copyright-review. As an outcome of this review, we sometimes ask partner institutions to correct and resubmit their bibliographic data. Determinations resulting from the review process are also saved in the Rights Database.

5. Further information about Rights Management is available on HathiTrust.org.

Quality

1. Although strategies to expand quality review processes in HathiTrust are in development, quality review remains the responsibility of partner institutions.

2. Quality is a high priority for HathiTrust and we take action where possible to prevent content of poor quality from entering the repository.. We work with Google and Google library partners on an ongoing basis to improve the metrics and processes Google employs. We also work with Google to make corrections and improvements to individual volumes based on user feedback.

3. We are exploring opportunities to extend the manual review that partners undertake on their own content more broadly across the repository and develop criteria to certify content in HathiTrust as useful for distinct purposes (see projects). We are also investigating ways to communicate aspects and characteristics of volumes that affect quality (such as missing pages, crop errors, warp errors, etc.) in the user interface.

 


[1] OAIS. Consultative Committee for Space Data Systems. The Open Archival Information System Reference Model (OAIS). Washington, D.C.: CCSDS Secretariat, National Aeronautics and Space Administration, 2002.

[2] TRAC. Trustworthy Repositories Audit & Certification: Criteria and Checklist. Center for Research Libraries and OCLC, 2007. http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf

[3] As of May 2017, HathiTrust contained more than 15 million volumes, more than 5 million of which are in the public domain.

[4] Lavoie, B. (2004). The Open Archival Information System Reference Model: Introductory Guide. Digital Preservation Coalition Technology Watch Report 04-01. Dublin, OH: OCLC.