Available Indexes

Submission Package Requirements for Digitized Content Submitted to HathiTrust

A Guide for HathiTrust Members

Version 1.0

February 19, 2020

INTRODUCTION

The HathiTrust Digital Library ensures long-term access to the content held in its repository by requiring conformance with accepted community standards for digital preservation.  Specifically, the HathiTrust repository was designed according to the Open Archival Information Systems (OAIS) framework.  In this model, the package of data required for submission of an individual volume is referred to as a Submission Information Package, or SIP.  In the HathiTrust implementation, these objects may also be referred to as Submission Packages or Content Packages.

 

1.0 What is a Submission Package?

2.0 Submission Package Specifications

2.1 Digital Content

2.2 Metadata

3.0 Fixity

4.0 Package Structure

5.0 Package Validation

 

1.0 What is a Submission Package?

The content package contains the set of files that comprise the complete digital object (usually image files and OCR representing a physical volume), as well as metadata and other files that facilitate processing into the repository and manage preservation of the volume over the long term.

 

2.0 Submission Package Specifications

Each individual volume MUST be submitted as a separate submission package containing all the files necessary to represent, process and manage the item.  These include:

  • Digital content (page images and OCR)

  • A metadata file

  • Fixity information in the form of a .md5 file

2.1 Digital Content

The following items are required in order to fully represent the source volume and support services (search, display, etc.) in the HathiTrust Digital Library:

2.1.1 Page images

The content package MUST include a single, well-formed TIFF or JP2000 image file for each page.  For more information on acceptable image formats, please see our digitization requirements for page images submitted to HathiTrust.

2.1.2 Optical Character Recognition (OCR) 

Plain text OCR MUST be present for every page, unless the item is a handwritten manuscript or in a language that cannot be OCRed. 

  • OCR MUST be provided as one page of plain text UTF-8 per image.

  • Raw OCR output is acceptable; there is no minimum requirement for conformance with source text.

  • Filenames MUST match the corresponding page image, plus the .txt extension

    • Example: The OCR file for 00000001.jp2 MUST be named 00000001.txt

 In addition to plain text OCR, hOCR or other forms of coordinate OCR MAY be provided. 

  • hOCR or coordinate OCR MUST be valid UTF-8

  • hOCR or coordinate OCR SHOULD be well-formed XML/XHTML. 

  • There is no schema or format requirement for coordinate OCR.   Examples in our repository include hOCR, ALTO XML, and DjVuXML.

  • Filenames MUST match the corresponding page image, plus the appropriate extension

    • Example: The hOCR or coordinate OCR for 00000001.txt MUST be named 00000001.html or 00000001.xml.

The only control characters that MAY appear in OCR are tabs, carriage returns, and line feeds.  All other control characters, including form feed characters (ctrl-L, ASCII 12 / 0x0C, etc.) MUST NOT appear in the OCR.

2.2 Metadata

The meta.yml file provides additional metadata used for ingesting material into HathiTrust. This file MUST be a well-formed YAML file.  See the YAML specification for more information.

Information supplied in the .yml file should be formatted as a series of element/value pairs, with a colon and one character space separating the two components:

[Element]: [Value]

Each element/value pair should be listed on a separate line.  

Note that in this document, element names are boldfaced for visibility.

YAML files can be created or edited with any text editor. We suggest TextWrangler on the Mac, vi or emacs on Linux, or notepad++ on Windows. Ruth Tillman (formerly of University of Notre Dame) has also created a Python script  that will generate YAML files to our specifications based on the prior creation of a spreadsheet containing the necessary data. 

2.2.1 Capture Date

Required - the date and approximate time the volume was scanned. This date will be used for the PREMIS capture event. It will also be used to populate the ModifyDate and the XMP tiff:DateTime image metadata elements if they are missing from the submitted image files.

The capture date MUST be in the ISO 8601 combined date format with timezone.

Example:

capture_date: 2013-11-01T12:31:00-05:00

Note: the -05:00 is a representation of a time zone offset from UTC, not a representation of a time range.

2.2.2 Scanner Make and Model

The scanner make and model are used to populate the XMP tiff:Make and tiff:Model headers if they are not present in the submitted image files. These free-text elements are optional.

Example:

scanner_make: CopiBook

scanner_model: HD

2.2.3 Scanner User

Required - This value should reflect "who pushed the button" to actually scan the item. This could be an organizational unit or an outside vendor.  It will be used to populate the TIFF ImageProducer and XMP tiff:Artist image headers if they are missing from the submitted image files.

Example:

scanner_user: "University of Michigan: Digital Conversion Unit"

2.2.4 Resolution

If the submitted images are missing resolution information, the resolution MUST be supplied here. Resolution supplied in the meta.yml file will overwrite the tiff:XResolution, tiff:YResolution, and tiff:ResolutionUnit values encoded in the image header, if present.  This element can therefore be used as a mechanism for supplying the correct resolution if the image file is found to contain incorrect information (usually as the result of an ingest failure).

Example:

bitonal_resolution_dpi: 600

or

contone_resolution_dpi: 300

2.2.5 Compression

The following information regarding image compression is required if the images were compressed, converted, or normalized before creation of the submission package. If no compression or other post-processing occurred, this information should not be included.

Examples:

image_compression_date2013-11-01T12:15:00-05:00

image_compression_agent: umich

image_compression_toolImageMagick 6.7.8

Notes:

  • image_compression_date MUST be in ISO 8601 combined date format.

  • image_compression_agent MUST be a HathiTrust institution identifier.

  • Image_compression_tool (free-text) should include both software name and version. Multiple values should be comma-separated, per the above example.

2.2.6 Scanning Order and Reading Order

Scanning order and reading order designations are used to ensure the correct reading experience when viewing items in the HathiTrust Digital Library.  

Examples:

scanning_order: left-to-right

reading_order: left-to-right

Note: If the volume was scanned right-to-left and/or should read right-to-left, put "right-to-left" for the scanning or reading order here. If this information is not provided, volumes are assumed to be scanned left-to-right and read left-to-right.

Possible combinations are:

  • Book reads left-to-right and 00000001.tif is the FRONT cover of the book:

scanning_order: left_to_right

reading_order: left_to_right

  • Book reads left-to-right but 00000001.tif is the BACK cover of the book:

scanning_order: right_to_left

reading_order: left_to_right

  • Book reads right-to-left and 00000001.tif is the FRONT cover of the book:

scanning_order: right_to_left

reading_order: right_to_left

  • Book reads right-to-left but 00000001.tif is the BACK cover of the book:

scanning_order: left_to_right

reading_order: right_to_left

For more complicated cases (e.g., books that are half in English and half in Hebrew and are read either left to right or right to left, or books that are in two left-to-right languages and one language is printed upside-down from the other), pick the correct scanning order and one of the correct reading orders. Users of the other language can use the HathiTrust Digital Library interface to adjust the view appropriately.

2.2.7 Page Data

Optionally, page numbers and page tags can be supplied here. The orderlabel attribute holds the source page number and the label attribute holds the page tag. Multiple page tags should be comma-separated.

Allowable page tags include:

  • BACK_COVER - Image of the back cover.

  • BLANK - An intentionally blank page.

  • CHAPTER_PAGE - A sort of half title page for a chapter or grouping of chapters -- that is, a page that gives the name of the chapter or section that begins on the next page.

  • CHAPTER_START - Subsequent chapters with regular page numbering after the first. Also use this for the beginning of each appendix.

  • COPYRIGHT - Title page verso (the back of the real title page).

  • FIRST_CONTENT_CHAPTER_START - First page of the first chapter with regular page numbering. If the first chapter with regular numbering is called the introduction, that's okay.

  • FOLDOUT - A page that folded out of the print original.

  • FRONT_COVER - Image of the front cover (if the cover of the book was scanned).

  • IMAGE_ON_PAGE - Use for plates (pages with only images, which often do not contain the regular page numbering).

  • INDEX - The first page in a sequence containing an index.

  • MULTIWORK_BOUNDARY: for items with multiple volumes bound together.

  • PREFACE - First page of each section that appears between the title page verso and the first regularly numbered page. For example, a one-page dedication on page xvi would get this tag, and then the first page of a three-page preface starting on page xviii would also get this.  However, if the introduction of the text starts on page 1 (or on an unnumbered page followed by page 2), do not use this tag (use CHAPTER_START instead). May be used for frontmatter components occurring both before and after the table of contents.

  • REFERENCES - The first page in a sequence containing endnotes or a bibliography.

  • TABLE_OF_CONTENTS - First page of the table of contents.

  • TITLE - Title page recto (the front of the real title page).

  • TITLE_PARTS - Half title page (a sort of preliminary title page before the real one).

We are aware that there are other page tagging schemes in use at various institutions.  Please contact HathiTrust staff for additional guidance in mapping your existing page tags to HathiTrust conventions.

Example:

pagedata:

  00000001.jp2: { label: "FRONT_COVER" }

  00000007.jp2: { label: "TITLE" }

  00000008.jp2: { label: "COPYRIGHT" }  

  00000009.jp2: { orderlabel: "i", label: "TABLE_OF_CONTENTS" }

  00000010.jp2: { orderlabel: "ii", label: "PREFACE" }

  00000011.jp2: { orderlabel: "iii" }

  00000012.jp2: { orderlabel: "iv" }

  00000013.jp2: { orderlabel: "v" }

  00000014.jp2: { orderlabel: "vi" }

  00000015.jp2: { orderlabel: "1", label: "FIRST_CONTENT_CHAPTER_START" }

  00000016.jp2: { orderlabel: "2" }

  00000017.jp2: { orderlabel: "3" }

  00000018.jp2: { orderlabel: "4", label: "IMAGE_ON_PAGE" }

Note: the indentation above MUST use character spaces only, never tabs: see http://www.yaml.org/spec/1.2/spec.html#id2777534.

 

3.0 Fixity

The Submission Package MUST also contain a file named checksum.md5 which contains checksums for all other files contained in the package.

The .md5 file can be generated with md5sum on Linux or md5 -r on Mac OS X. On Windows use md5sum from CoreUtils for Windows: http://gnuwin32.sourceforge.net/packages/coreutils.htm.

Example:

8c1a363eb3682542a16edf7dba036fe1  00000001.tif

8df14295ce4b6194bbb6ae66fc41d03b  00000001.txt

f30cc4a3d27f54329b3d9aaa5b2d7bda  00000002.tif 

6a621fe605578f95cc66cc27b7ca77b5  00000002.txt 

97c664aa9fb998dde78ce2aecbf59d73  00000003.tif 

01cb4b01a9de2aa1660da009989f5f13  00000003.txt

...

e67cad94ae85bf6ae439583f4ab88227  meta.yml

The checksum.md5 file MUST NOT contain a checksum for checksum.md5 — it is not generally possible for a file to contain its own checksum. (Assume we compute the checksum, then add it to the file; the checksum will no longer be valid because by adding it to the file, the file's checksum will have changed. That is to say, in order to compute the correct final checksum, it would have to already be in the file. This is not normally possible!)

 

4.0 Package Structure

Each item MUST be encapsulated in a .zip file, which MUST be named according to the object ID (barcode or ARK ID)

  • Example: 39015012345678.zip or ark:/28722/h2000017z.zip

  • The zip file SHOULD not contain any directories or internal hierarchy.

  • If the zip filename contains any alpha characters, these should be lowercased.

Sample Package:

This package contains:

  1. An image file for each page (either .tif or .jp2)
  2. A plain text OCR file for each page (.txt)
  3. Coordinate OCR for each page (.html)
  4. A checksum file for the volume (.md5)
  5. A metadata file (.yml)

4.1 Sample Package

A simple digital object package is available for download here.

 

5.0 Package Validation

Contributors should use the HathiTrust SIP validation tool to verify that digital object packages meet these specifications prior to submission.

5.1 Special instructions for large batches

Packages larger than 15GB should be split to avoid difficulties in transfer.  One option is to split the larger packages into multiple zipped files using a tool like 7-zip and the method described at https://www.webhostinghub.com/help/learn/website/managing-files/split-file.

Questions? Please contact us at feedback@issues.hathitrust.org

 

 

AttachmentSize
PDF icon Download a PDF version of this page263.36 KB