Key Components of Ingest
Digital Assets Submission Inventory
All depositors must fill out a Digital Assets Submission Inventory (DASI) prior to ingest. The DASI designates the specific body of content to be ingested and is the official record of submission. A Permissions Agreement may be used to provide access to submitted works that are not in the public domain if appropriate rights have been obtained.
Submission of Bibliographic Metadata
Bibliographic metadata for each digital volume must be present in HathiTrust in order for the volumes to be ingested. Depositors make accurate bibliographic data available to the University of Michigan to be loaded into HathiTrust's bibliographic management system. The bibliographic records act as a manifest of the digital content and are used as part of a broader digital object management strategy, including rights management. HathiTrust uses information in the submitted bibliographic records to make an initial rights determination about each volume. Details about the information in each bibliographic information that is used to make rights determinations is available at http://www.hathitrust.org/bib_rights_determination.
Bibliographic Metadata Specifications
Bibliographic metadata associated with digital volumes should conform to our specifications.
Submission of content
Submission of content can happen in a variety of ways, specified in the Digital Assets Submission Inventory.
Ingest Checklist
Administrative Coversheet
We request a variety of administrative information related to ingest of both bibliographic metadata and content. A fill-able coversheet that can be copied and submitted is available as a Google doc.
Bibliographic Metadata Ingest
-
Institution sends bibliographic data to the University of Michigan (UM)
- Bibliographic data from each distinct source (e.g., Google, Internet Archive, local) should be sent separately, one file per source
- Data transfer for Google-digitized volumes: the method the institution uses to make bibliographic data available to Google will work well, if available. The bibliographic data specifications above were selected based on Google’s standards to be as broadly applicable as possible.
- Data transfer for Internet Archive- and other-digitized volumes: UM will work with institutions on the most convenient way transfer the bibliographic data if not downloaded directly from the Internet Archive.
- UM receives bibliographic data
-
UM does duplicate detection based on OCLC number
- This duplicate detection does not weed items, it associates items UM receives with existing bib records (if duplicates are detected) or creates new records (if bib records with matching OCLC numbers do not exist).
- UM loads bibliographic data into metadata management system (currently Aleph)
Content Ingest
Google Content:
- Institution selects a content package type and communicates the choice to Google
- Most HathiTrust institutions are using a hybrid format for scanned volumes, containing bitonal TIFF files for all-text pages and JPEG2000 files for images. This is the most cost-effective package, as average volume sizes are 25% smaller for the hybrid packages than the all-JPEG2000.
- Institution requests that Google open Institution's GRIN instance to UM
- UM requests decryption keys from Google and begins download from GRIN
Non-Google Content:
- Institution performs any pre-ingest transformations that are needed (HathiTrust provides tools to assist in transformation, validation, and packaging of materials for ingest).
- Institution delivers content to UM through agreed-upon mechanisms (hard drive, file download (e.g., from the Internet Archive), etc.)
- Pre-ingest quality assurance may be performed by partner institutions
All Content:
- UM ingests a sample of objects, including validation of the objects, and creation of HathiTrust METS and PREMIS
- UM performs internal testing to be sure volumes are working properly in the system
- Ingest reports are made available to the institution at http://www.hathitrust.org/ingest_logs
- Institution may perform additional QA testing on ingested objects
- Full ingest of partner institution's digital content begins