Getting Content Into HathiTrust

Ingest of digital objects and associated metadata is performed at the University of Michigan (UM). The digital objects are then replicated to HathiTrust's active mirror site in Indiana and stored on backup tape. Information about HathiTrust's technical infrastructure can be found at http://www.hathitrust.org/technology.

Specifications for the two components of ingest (ingest of Bibliographic Data, and the ingest of the Digital Content) are given below, followed by a complete Ingest Checklist.

Bibliographic Data

Prior to ingest of content, a Participating Institution makes available accurate bibliographic records for digital objects. Those records are loaded into a database at the University of Michigan. The records act as a manifest of the digital content and are used as part of the Repository digital object management strategy. Each record and digital page image identifies the source of the print volume upon which the digital file is based.

Bibliographic metadata specifications

Loading of bibliographic metadata must occur before content ingest can begin. Ingest from a variety of digitization sources is supported but the most generalized are Google and the Internet Archive. As a default, metadata for objects originating from additional sources should conform to the specifications for Google-digitized content. If there are exceptions, UM will work with Institutions to develop an acceptable scheme.

Google

  • Data should be in Marc or Marcxml format, in utf8 encoding
  • Records as complete as possible
  • Strongly preferred: One bibliographic record per item (multi-volume works should have the same record repeated for each item). Each record should contain a single 955 field
  • Local system number in 001
  • OCLC number in an 035 field with appropriate identifying prefix (OcoLC, ocm, ocn, etc.)
  • Strongly preferred: Barcode in 955|b; any alphabetic characters should be lowercase.
  • Strongly preferred: Item description (enumeration / chronology) in 955|v.

Specifications for Internet Archive-digitized content are similar, but the Internet Archive ARK identifier should be in the 955|b field and the Internet Archive identifier in the 955|q.

Internet Archive

  • Data should be in Marc or Marcxml format, in utf8 encoding
  • Records as complete as possible
  • One bibliographic record per item (multi-volume works should have the same record repeated for each item). Each record should contain a single 955 field
  • Local system number in 001
  • OCLC number in an 035 field with appropriate identifying prefix (OcoLC, ocm, ocn, etc.)
  • ARK Identifier in 955|b
  • Strongly preferred: Internet Archive identifier in 955|q; this id should not be lowercased.
  • Strongly preferred: Item description (enumeration / chronology) in 955|v

Digital Content

Ingesting content into HathiTrust can happen through a variety of mechanisms, including ingest from Google, removable drives, or Internet delivery (see below). All content must be documented with a signed Digital Assets Submission Inventory. A Permissions Agreement can be used to provide access to works that are not in the public domain.

Digital content specifications

The following specifications are those used for Google content and, similar to the Bibliographic specifications, are the desired specifications that content is desired to meet. If existing content does not meet these specifications, transformations or accomodations may be made in the ingest process. These specifications are taken from the METS Profle for Google content, available on the Digital Objects Specifications page.

  • All bitonal images referenced by a conforming METS document must be in TIFF format and must be well-formed and valid.  The MIME type must be image/tiff. The compression scheme must be CCITT Group 4.  Bits per sample must be “1”.
  • All bitonal images must be accompanied by an XMP packet.  It must contain the following elements from the tiff namespace: PlanarConfiguration,  ImageWidth and ImageLength (must match the ImageWidth and ImageLength values from the header), BitsPerSample (must be 1), Compression (must be 4), PhotometricInterpretation (must be 0), Orientation (must be 1), SamplesPerPixel (must be 1), XResolution and YResolution (must be 600/1 for Google-scanned materials), ResolutionUnit (must be 2), DateTime, Artist, Make, and Model.  It must contain the following element from the Dublin Core namespace: source (must contain “barcode/filename”). 
  • All continuous-tone images referenced by a conforming METS document must be in JPEG2000 format and must be well-formed and valid.  The MIME type must be image/jp2. The compression scheme must be JPEG2000.  The JPEG2000 metadata must be of the brand jp2, minor version 0, and compatibility jp2. The number of layers must be 8.  The color specification EnumCS must be sRGB or Greyscale.  Bits per sample value is “8, 8, 8” if EnumCS is “sRGB”; value is “8” if EnumCS is “Greyscale”.  The number of decomposition levels must be a value between 5 and 32, inclusive. XSize and YSize must be present in the header. 
  • All continuous-tone images must be accompanied by an XMP packet.  An x-packet declaration with a 'begin' attribute must be present and the 'id' attribute must be W5M0MpCehiHzreSzNTczkc9d.  It must contain the following elements from the tiff namespace:  ImageWidth and ImageLength (must match the XSize and YSize values from the header), BitsPerSample, Compression (value must be “34712”), PhotometricInterpretation (value is “2” if EnumCS is “sRGB”; value is “1” if EnumCS is “Greyscale”), Orientation (must be 1), SamplesPerPixel (value is “1” if EnumCS is “Greyscale”; value is “3” if EnumCS is “sRGB”), XResolution and YResolution (must be 300/1 for Google-scanned materials), ResolutionUnit (must be 2), DateTime, Artist, Make, and Model.  It must contain the following element from the Dublin Core namespace: source (must contain “barcode/filename”). An x-packet declaration with an end attribute must be present.
  • All OCR text files must be well-formed and valid UTF-8 with a MIME type of text/plain.
  • All hOCR HTML files must be well-formed and valid UTF-8 with a MIME type of text/html.
  • There must not be any behaviors associated with a conforming document.

Ingest Checklist

Bibliographic Data

  • Institution sends bibliographic data to UM
    • Bibliographic data from each distinct source (e.g., Google, Internet Archive, local) should be sent separately, one file per source
    • Data transfer for Google-digitized volumes: the method the institution uses to make bibliographic data available to Google will work well, if available. The bibliographic data specifications above were selected based on Google’s standards to be as broadly applicable as possible, so hopefully they will match what already exists.
    • Data transfer for IA-digitized volumes: UM will work with institution for most convenient to transfer the bibliographic data.
  • UM receives bibliographic data
  • UM does duplicate detection based on OCLC number
    • This duplicate detection does not weed items, it associates items UM receives with existing bib records (if duplicates are detected) or creates new records (if bib records with matching OCLC numbers do not exist).
    • If there is no OCLC number, the record is added as a new record.
  • UM loads bibliographic data into system (currently Aleph)

Content

  • Institution determines namespace identifier for content
    • HathiTrust separates content in the storage system by creating a new namespace for each institution, and each body of content within an institution that has a unique identifier scheme.
    • Some current namespaces include mdp, inu, uc1, pst, wu, and mnu. The namepace is at most 4 characters and is followed by a unique (in that namespace) identifier. This is often, but not always, the physical item barcode. The two together (namespace plus identifier) comprise an object's repository identifier. Some example identifiers include
      • Google-digitized (University of Wisconsin): wu.89094366424
      • Internet Archive-digitized (University of California): uc2.ark:/13960/t26973133
      • Locally-digitized (University of Michigan): miun.aaj0523.1950.001
  • Institution provides information on any validation that can be done of original barcodes (e.g., check digits, starting and ending digit requirements, or other characteristics).
  • Institution selects preferred WorldCat Registry ID from http://www.worldcat.org/registry/institutions
    • The Registry ID will determine how the institution name displays in the forthcoming HathiTrust-OCLC catalog, and will not be used for any other purpose. Many institutions have multiple Registry IDs, so OCLC is asking us to specify each institution's official ID.

For Google Content only:

  • Institution selects a content format and communicates the choice to Google
    • Most HathiTrust institutions are using a hybrid format for scanned volumes, containing bitonal TIFF files for all-text pages and JPEG2000 files for images. This is the most cost-effective package, as average volume sizes are 25% smaller for the hybird package as opposed to the all-JPEG2000.
  • Institution requests that Google open Institution's GRIN instance to UM
  • UM requests decryption keys from Google and begins download from GRIN
For non-Google Content:
  • Institution delivers content to UM through agreed-upon mechanisms
    • Hard drive, file download (e.g., from the Internet Archive), etc.
  • UM performs pre-ingest transformations and normalizations to create the repository object package
  • Pre-ingest quality assurance may be performed by partner institutions
  • UM ingests a sample of objects, including validation of the objects, and creation of HathiTrust METS and PREMIS
  • UM performs internal testing to be sure volumes are working properly in the system
  • Institution may perform additional QA testing on ingested objects
  • Full Ingest of partner institution's digital content begins