Ingest Checklist

Key Components of Ingest

Digital Assets Submission Inventory

All depositors must fill out a Digital Assets Submission Inventory (DASI) prior to ingest. The DASI designates the specific body of content to be ingested and is the official record of submission. A Permissions Agreement may be used to provide access to submitted works that are not in the public domain if appropriate rights have been obtained.

Submission of Bibliographic Metadata

Bibliographic metadata for each digital volume must be present in HathiTrust in order for the volumes to be ingested. Depositors make accurate bibliographic data available to the University of Michigan to be loaded into HathiTrust's bibliographic management system. The bibliographic records act as a manifest of the digital content and are used as part of a broader digital object management strategy, including rights management. HathiTrust uses information in the submitted bibliographic records to make an initial rights determination about each volume. Details about the information in each bibliographic information that is used to make rights determinations is available at http://www.hathitrust.org/bib_rights_determination.

Bibliographic metadata associated with digital volumes should conform to the following specifications.

  • Data should be in Marc or Marcxml format, in utf8 encoding
  • Records as complete as possible
  • Strongly preferred: One bibliographic record per item (multi-volume works should have the same record repeated for each item).
  • Each record should contain a single 955 field
  • Local system number in 001
  • OCLC number (master, not institutional) in an 035 field with appropriate identifying prefix (OcoLC, ocm, ocn, etc.)
  • Strongly preferred: Barcode in 955|b; any alphabetic characters should be lowercase.
  • Strongly preferred: Item description (enumeration / chronology) in 955|v.

Specifications for Internet Archive-digitized content are similar, but the Internet Archive ARK identifier should be in the 955|b field and the Internet Archive identifier in the 955|q:

  • Data should be in Marc or Marcxml format, in utf8 encoding
  • Records as complete as possible
  • One bibliographic record per item (multi-volume works should have the same record repeated for each item). Each record should contain a single 955 field
  • Local system number in 001
  • OCLC number (master, not institutional) in an 035 field with appropriate identifying prefix (OcoLC, ocm, ocn, etc.)
  • ARK Identifier in 955|b
  • Strongly preferred: Internet Archive identifier in 955|q; this id should not be lowercased.
  • Strongly preferred: Item description (enumeration / chronology) in 955|v

Submission of content

Submission of content can happen in a variety of ways, specified in the Digital Assets Submission Inventory.

Ingest Checklist

Administrative Information

  • Institution determines namespace identifier for content
    • HathiTrust separates content in the storage system by creating a new namespace for each institution, and each body of content within an institution that has a unique identifier scheme.
    • Some current namespaces include mdp, inu, uc1, pst, wu, and mnu. The namespace is at most 4 characters and is followed by a unique (in that namespace) identifier. This is often, but not always, the physical item barcode. The two together (namespace plus identifier) comprise an object's repository identifier. Some example identifiers include
      • Google-digitized (University of Wisconsin): wu.89094366424
      • Internet Archive-digitized (University of California): uc2.ark:/13960/t26973133
      • Locally-digitized (University of Michigan): miun.aaj0523.1950.001
  • Institution provides information on any validation that can be done of original barcodes (e.g., check digits, starting and ending digit requirements, or other characteristics).
  • Institution selects preferred WorldCat Registry ID from http://www.worldcat.org/registry/institutions
    • The Registry ID will determine how the institution name displays in the forthcoming HathiTrust-OCLC catalog, and will not be used for any other purpose. Many institutions have multiple Registry IDs, so OCLC is asking us to specify each institution's official ID.

Bibliographic Metadata

  • Institution sends bibliographic data to the University of Michigan (UM)
    • Bibliographic data from each distinct source (e.g., Google, Internet Archive, local) should be sent separately, one file per source
    • Data transfer for Google-digitized volumes: the method the institution uses to make bibliographic data available to Google will work well, if available. The bibliographic data specifications above were selected based on Google’s standards to be as broadly applicable as possible.
    • Data transfer for Internet Archive- and other-digitized volumes: UM will work with institutions on the most convenient way transfer the bibliographic data if not downloaded directly from the Internet Archive.
  • UM receives bibliographic data
  • UM does duplicate detection based on OCLC number
    • This duplicate detection does not weed items, it associates items UM receives with existing bib records (if duplicates are detected) or creates new records (if bib records with matching OCLC numbers do not exist).
  • UM loads bibliographic data into metadata management system (currently Aleph)

Content

Google Content:

  • Institution selects a content package type and communicates the choice to Google
  • Most HathiTrust institutions are using a hybrid format for scanned volumes, containing bitonal TIFF files for all-text pages and JPEG2000 files for images. This is the most cost-effective package, as average volume sizes are 25% smaller for the hybrid packages than the all-JPEG2000.
  • Institution requests that Google open Institution's GRIN instance to UM
  • UM requests decryption keys from Google and begins download from GRIN

Non-Google Content:

  • Institution delivers content to UM through agreed-upon mechanisms (hard drive, file download (e.g., from the Internet Archive), etc.)
  • UM performs pre-ingest transformations and normalizations to create the repository object package
  • Pre-ingest quality assurance may be performed by partner institutions

All Content:

  • UM ingests a sample of objects, including validation of the objects, and creation of HathiTrust METS and PREMIS
  • UM performs internal testing to be sure volumes are working properly in the system
  • Ingest reports are made available to the institution at http://www.hathitrust.org/ingest_logs
  • Institution may perform additional QA testing on ingested objects
  • Full ingest of partner institution's digital content begins