Note: You can sign up for a Google Group mailing list to receive updates on HathiTrust ingest tools.
HathiTrust has a mission to ensure the long-term preservation and accessibility of materials in the digital archive. Ensuring consistency among materials submitted from different sources is one way we do this. To ensure consistency, we have defined baseline requirements for content in a number of areas, including:
- Item identifiers (i.e. how each individual submitted item is identified and named)
- Package layout (file names, directory structure, etc.)
- Image technical characteristics (file format, resolution, color depth, etc.)
- Image metadata (scanning time, scanning artist, etc.)
- Source METS file comprising the following:
- Descriptive metadata such as MARC
- PREMIS metadata in a particular format
- Package contents (fileSec) with file groups separated by file type
- Physical structMap, optionally with page numbers and page tags
Ingest Tool Options
We have made 3 types of tools available to aid depositors in preparing content to these specifications:
- A single-image validator (http://bit.ly/1gvD9q7) - validates single uploaded images and provides a report on compliance with HathiTrust specifications.
- A full-volume validator and packaging service (http://bit.ly/1jboMIC) - validates full volumes and remediates problems if provided with sufficient instructions and metadata.
- Ingest tools in the form of Perl code (http://bit.ly/1fphFai) - can be used to validate and package volumes for ingest. Download example Source METS file.
The single-image and full-volume tools are documented at the links above. The Perl code includes a ReadMe.txt file with full instructions. An introduction is provided below:
From the Ingest Tools ReadMe file
Steps to ingest
Whether using the tools provided or not, it is recommended that depositors review the Guidelines for Deposit and fill out the HathiTrust Deposit Form to gain an understanding of the ingest requirements and how much transformation of content may be needed prior to submission to HathiTrust.
- Create a custom package type for your content - see the README documentation for more information about this. In general, this lists the files that will be present in your package, what transformations should be applied to your submitted packages, etc.
- If necessary, subclass and extend the ImageRemediate stage to add or fix image technical characteristics and metadata (see HTFeed/PackageType/*/ImageRemediate.pm for some examples)
- If necessary, create a custom stage to produce plain-text OCR (and optionally coordinate OCR in any XML-based format) on a per-page basis (see for example HTFeed/PackageType/Kirtas/ExtractOCR.pm, HTFeed/PackageType/IA/OCRSplit.pm for some examples of transforming OCR).
- Subclass and extend the SourceMETS stage (see HTFeed/PackageType/*/SourceMETS.pm for examples) to generate a METS file that includes descriptive metadata such as MARC, page numbers and page tags, and any other metadata that you might want to preserve.
- Use the generate_sip.pl wrapper script to test and generate submission packages.
We can help guide depositors through these stages (email firstname.lastname@example.org), and diagnose problems along the way. We would like depositors to fill out a Deposit Form before contacting us for assistance so we can be as efficient and effective as possible.