Preservation & Technology
HathiTrust is guided by principles of trustworthiness, openness and responsible stewardship. We provide reliable long-term preservation for digital content, with access to the extent legally possible, in ways that maximize the contributions of member libraries and make the most efficient use of available resources. Our Shared Program Program extends our preservation philosophy to ensure preservation of print and digital collections by linking the two, to reduce overall costs of collection management for HathiTrust members, and to catalyze national/continental collective management of collections.
HathiTrust is committed to preserving the intellectual content and in many cases the exact appearance of materials that have been digitized for deposit. This includes:
- Digital representations (images) of content as the content appeared in its original form, with the same layout and color (e.g., for illustrations and artwork), and in the same order
- Textual representations of content where possible through Optical Character Recognition technologies
HathiTrust employs a number of strategies to ensure the long-term integrity of deposited materials. These include:
- Use of standard and open content formats that meet community-accepted digital preservation standards, are widely supported on a number of platforms, and that we are confident can be preserved and migrated forward to new preservation formats over time
- HathiTrust currently relies on the extensive specifications of file formats, preservation metadata, and quality control methods that are detailed in our Technical Requirements for Digitized Page Images Submitted to HathiTrust.
- HathiTrust is committed to bit-level preservation and format migration of materials created according to these specifications as technology, standards, and best practices in the library community change.
- Formats preserved in HathiTrust include TIFF ITU G4 files stored at 600dpi, JPEG or JPEG2000 files stored at several resolutions ranging from 200dpi to 400dpi, Unicode text, and XML files with an accompanying DTD (typically METS).
- Rigorous validation of content on ingest; Reliance on standards for repository design and trustworthiness such as OAIS and TRAC (see HathiTrust Digital Library and Content Standards)
- Reliance on standards for metadata such as METS and PREMIS (see HathiTrust Digital Object Specifications)
- Regular checks on the integrity of stored content through
- Automated system checks that verify the integrity of digital objects with their ingested versions. These are performed on all files on a quarterly schedule
- User access, and
- Repository processes such as full-text indexing that use the content on a regular basis
HathiTrust Technology Infrastructure
HathiTrust serves its repository from a University of Michigan-managed data center in Ann Arbor with a mirror site in Indianapolis managed by Indiana University. Each data center has a 1.4 petabyte spinning-disk storage array holding a complete copy of the images and OCR text of all 17.6 million digitized books. There are Apache Solr indexes comprising over 12 terabytes of full text from these books and a separate index with library catalog metadata (MARC records) for each item. We manage a variety of metadata in MariaDB and in MongoDB including information about holdings from member libraries, copyright and licensing information, US federal government documents, and more.
Our applications allow search, discovery, and access to material in the repository as well as managing content ingest and indexing for data coming in to the repository. Our applications were written in-house, primarily in the Perl and Ruby programming languages. Much of the code for our applications is publicly available in HathiTrust’s GitHub. Our applications increasingly use containers for development and testing with Docker and in production with Kubernetes.