Note: You can sign up for a Google Group mailing list to receive notifications when new versions of the tools are available.
HathiTrust has a mission to ensure the long-term preservation and accessibility of materials in the digital archive. Ensuring consistency among materials submitted from different sources is one way we do this. To ensure consistency, we have defined baseline requirements for content in a number of areas, including:
- Item identifiers (i.e. how each individual submitted item is identified and named)
- Package layout (file names, directory structure, etc.)
- Image technical characteristics (file format, resolution, color depth, etc.)
- Image metadata (scanning time, scanning artist, etc.)
Source METS file comprising the following:
- Descriptive metadata such as MARC
- PREMIS metadata in a particular format
- Package contents (fileSec) with file groups separated by file type
- Physical structMap, optionally with page numbers and page tags
The specifications that content submitted to HathiTrust must meet are detailed in Section II of the HathiTrust Deposit Form. We have made tools available (see the link above) that depositors can download and use to a) validate content against our specifications and identify any areas in need of remediation and b) aid in content transformation and packaging for submission to HathiTrust. Depositors may choose to create their own tools for remediation, in which case the specifications can be used as a guide. All content must validate to HathiTrust specifications in order to be ingested.
Steps to ingest
Whether using the tools provided or not, it is recommended that depositors review the Guidelines for Deposit and fill out the HathiTrust Deposit Form to gain an understanding of the ingest requirements and how much transformation of content may be needed prior to submission to HathiTrust.
- Create a custom package type for your content - see the README documentation for more information about this. Basically, this lists the files that will be present in your package, what transformations should be applied to your submitted packages, etc.
- If necessary, subclass and extend the ImageRemediate stage to add or fix image technical characteristics and metadata (see HTFeed/PackageType/*/ImageRemediate.pm for some examples)
- If necessary, create a custom stage to produce plain-text OCR (and optionally coordinate OCR in any XML-based format) on a per-page basis (see for example HTFeed/PackageType/Kirtas/ExtractOCR.pm, HTFeed/PackageType/IA/OCRSplit.pm for some examples of transforming OCR).
- Subclass and extend the SourceMETS stage (see HTFeed/PackageType/*/SourceMETS.pm for examples) to generate a METS file that includes descriptive metadata such as MARC, page numbers and page tags, and any other metadata that you might want to preserve.
- Use the generate_sip.pl wrapper script to test and generate submission packages.
We can help guide depositors through these stages (email firstname.lastname@example.org), and diagnose problems along the way. We would like depositors to fill out a Deposit Form before contacting us for assistance so we can be as efficient and effective as possible.