Download Ingest tools
Download example Source METS file
Note: You can sign up for a Google Group mailing list to receive notifications when new versions of the tools are available.
HathiTrust has a mission to ensure the long-term preservation and accessibility of materials in the digital archive. Ensuring consistency among materials submitted from different sources is one way we do this. To ensure consistency, we have defined baseline requirements for content in a number of areas, including:
Item identifiers (i.e. how each individual submitted item is identified and named)
Package layout (file names, directory structure, etc.)
Image technical characteristics (file format, resolution, color depth, etc.)
Image metadata (scanning time, scanning artist, etc.)
Source METS file comprising the following:
Descriptive metadata such as MARC
PREMIS metadata in a particular format
Package contents (fileSec) with file groups separated by file type
Physical structMap, optionally with page numbers and page tags
The specifications that content submitted to HathiTrust must meet are detailed in Section II of the HathiTrust Deposit Form . We have made tools available (see the link above) that depositors can download and use to a) validate content against our specifications and identify any areas in need of remediation and b) aid in content transformation and packaging for submission to HathiTrust. Depositors may choose to create their own tools for remediation, in which case the specifications can be used as a guide. All content must validate to HathiTrust specifications in order to be ingested.
Introduction to the Ingest Tools
(From the Ingest Tools ReadMe file, included in the materials that can be downloaded above.)
Feed is a suite of tools to assist in preparing content for ingest into
HathiTrust. Feed can assist in transforming and remediating content and
technical metadata, prevalidating the content to HathiTrust specifications, and
packaging content into a submission inventory package (SIP).
Feed is not a general-purpose validation environment: it is narrowly targeted
towards the needs of ingesting digitized print material into HathiTrust. Major
functionality includes remediating and prevalidating TIFF and JPEG2000 images,
creating METS files with PREMIS metadata, and packing files into a .zip for
submission to HathiTrust.
Feed works best out of the box when input volumes consist of sequentually-named
images and text files within a directory whose name is the ID that will be used
for the HathiTrust object. For example:
Feed is very extensible; it can be adapted to remediate and package input of
almost any type, but the farther the departure from the simple layout above,
the more work will be required.
This guide assumes fluency with perl and the command-line environment of Linux
or Mac OS X as well as familiarity with digitization and digital preservation.
Some introductory references to consult include:
- Learning Perl
- Programming Perl
- Introduction to Linux: http://www.tldp.org/LDP/intro-linux/html/index.html
- Library of Congress's Digital Preservation web site: http://www.digitalpreservation.gov/
Steps to ingest
Whether using the tools provided or not, it is recommended that depositors review the Guidelines for Deposit and fill out the HathiTrust Deposit Form to gain an understanding of the ingest requirements and how much transformation of content may be needed prior to submission to HathiTrust.
If the HathiTrust ingest tools are used to build packages for submission, a recommended workflow, after reviewing the Guidelines and filling the Deposit Form , would be as follows:
Create a custom package type for your content - see the README documentation for more information about this. Basically, this lists the files that will be present in your package, what transformations should be applied to your submitted packages, etc.
If necessary, subclass and extend the ImageRemediate stage to add or fix image technical characteristics and metadata (see HTFeed/PackageType/*/ImageRemediate.pm for some examples)
If necessary, create a custom stage to produce plain-text OCR (and optionally coordinate OCR in any XML-based format) on a per-page basis (see for example HTFeed/PackageType/Kirtas/ExtractOCR.pm, HTFeed/PackageType/IA/OCRSplit.pm for some examples of transforming OCR).
Subclass and extend the SourceMETS stage (see HTFeed/PackageType/*/SourceMETS.pm for examples) to generate a METS file that includes descriptive metadata such as MARC, page numbers and page tags, and any other metadata that you might want to preserve.
Use the generate_sip.pl wrapper script to test and generate submission packages.
We can help guide depositors through these stages (email email@example.com ), and diagnose problems along the way. We would like depositors to fill out a Deposit Form before contacting us for assistance so we can be as efficient and effective as possible.