Note: You can sign up for a Google Group mailing list to receive updates on HathiTrust ingest tools.
HathiTrust has a mission to ensure the long-term preservation and accessibility of materials in the digital archive. Ensuring consistency among materials submitted from different sources is one way we do this. To ensure consistency, we have defined baseline requirements for content in a number of areas, including:
- Item identifiers (i.e. how each individual submitted item is identified and named)
- Package layout (file names, directory structure, etc.)
- Image technical characteristics (file format, resolution, color depth, etc.)
- Image metadata (scanning time, scanning artist, etc.)
- Source METS file comprising the following:
- Descriptive metadata such as MARC
- PREMIS metadata in a particular format
- Package contents (fileSec) with file groups separated by file type
- Physical structMap, optionally with page numbers and page tags
The specifications that content submitted to HathiTrust must meet are detailed in Section II of the HathiTrust Deposit Form .
Ingest Tool Options
We have made 3 types of tools available to aid depositors in preparing content to these specifications:
- A single-image validator (http://bit.ly/1gvD9q7 ) - validates single uploaded images and provides a report on compliance with HathiTrust specifications.
- A full-volume validator and packaging service (http://bit.ly/ 1jboMIC ) - validates full volumes and remediates problems if provided with sufficient instructions and metadata.
- Ingest tools in the form of Perl code (http://bit.ly/1fphFai ) - can be used to validate and package volumes for ingest. Download example Source METS file .
The single-image and full-volume tools are documented at the links above. The Perl code includes a ReadMe.txt file with full instructions. An introduction is provided below:
From the Ingest Tools ReadMe file
Feed is a suite of tools to assist in preparing content for ingest into
HathiTrust. Feed can assist in transforming and remediating content and
technical metadata, prevalidating the content to HathiTrust specifications, and
packaging content into a submission inventory package (SIP).
Feed is not a general-purpose validation environment: it is narrowly targeted
towards the needs of ingesting digitized print material into HathiTrust. Major
functionality includes remediating and prevalidating TIFF and JPEG2000 images,
creating METS files with PREMIS metadata, and packing files into a .zip for
submission to HathiTrust.
Feed works best out of the box when input volumes consist of sequentually-named
images and text files within a directory whose name is the ID that will be used
for the HathiTrust object. For example:
Feed is very extensible; it can be adapted to remediate and package input of
almost any type, but the farther the departure from the simple layout above,
the more work will be required.
This guide assumes fluency with perl and the command-line environment of Linux
or Mac OS X as well as familiarity with digitization and digital preservation.
Some introductory references to consult include:
- Learning Perl
- Programming Perl
- Introduction to Linux: http://www.tldp.org/LDP/intro-linux/html/index.html
- Library of Congress's Digital Preservation web site: http://www.digitalpreservation.gov/
Steps to ingest
Whether using the tools provided or not, it is recommended that depositors review the Guidelines for Deposit and fill out the HathiTrust Deposit Form to gain an understanding of the ingest requirements and how much transformation of content may be needed prior to submission to HathiTrust.
If the HathiTrust ingest tools are used to build packages for submission, a recommended workflow, after reviewing the Guidelines and filling the Deposit Form , would be as follows:
- Create a custom package type for your content - see the README documentation for more information about this. In general, this lists the files that will be present in your package, what transformations should be applied to your submitted packages, etc.
- If necessary, subclass and extend the ImageRemediate stage to add or fix image technical characteristics and metadata (see HTFeed/PackageType/*/ImageRemediate.pm for some examples)
- If necessary, create a custom stage to produce plain-text OCR (and optionally coordinate OCR in any XML-based format) on a per-page basis (see for example HTFeed/PackageType/Kirtas/ExtractOCR.pm, HTFeed/PackageType/IA/OCRSplit.pm for some examples of transforming OCR).
- Subclass and extend the SourceMETS stage (see HTFeed/PackageType/*/SourceMETS.pm for examples) to generate a METS file that includes descriptive metadata such as MARC, page numbers and page tags, and any other metadata that you might want to preserve.
- Use the generate_sip.pl wrapper script to test and generate submission packages.
We can help guide depositors through these stages (email email@example.com ), and diagnose problems along the way. We would like depositors to fill out a Deposit Form before contacting us for assistance so we can be as efficient and effective as possible.