HathiTrust Commitment to Quality

June 16, 2016

HathiTrust is committed to providing a high quality digitized corpus for all its stakeholders. These include a range of users, library communities, and related member organizations:

  • online end users

  • print retention and shared print programs

  • users with print disabilities

  • print and digital preservation communities

  • digitization communities

  • computational research communities

Quality as a concept may be defined differently by these stakeholder communities (and are detailed in the appendix below). Additionally, quality control measures may vary among the multiple digitization partners and automated methods through which HathiTrust acquires content. Recognizing these factors, HathiTrust focuses on addressing quality issues at multiple levels and at a macro scale that includes the digital objects as well as their metadata. The HathiTrust commitment thus includes a range of assessment approaches that aspire to ensure and improve quality over time as issues are identified.

In all cases, HathiTrust materials are subject to quality review as a fundamental step in the digitization process.  Materials digitized by partner institutions or those made available through entities such as the Internet Archive are subject to formal quality review processes prior to the digital content entering the HathiTrust Digital Library. In the case of materials digitized by Google, HathiTrust devotes significant resources to ensuring the highest possible quality content, despite the extraordinarily high volume of material. This work is in addition to the substantial quality assurance processes Google undertakes.

HathiTrust partners have devoted attention to giving better definition to a shared sense of quality.  Individual partners commit resources to externally verifying the quality of the materials, and document systemic problems in the digitization process. Partners work with Google to address specific quality issues and problems raised by the user community.  HathiTrust partners that collaborate with Google have worked collectively to refine quality indicators and even processes.

HathiTrust’s processes rely on partner-contributed descriptive cataloging, ensuring that discovery is based on the highest standards of bibliographic data. HathiTrust has not to date relied on automated approaches to gather information about the works or infer information about the identity of the work from the metadata. Such methods are considered as opportunities arise and resources permit.

In order to address and prioritize quality issues, HathiTrust works directly with partners and through the Program Steering Committee and related groups to measure and document the current quality state of items and support an ongoing process of quality improvement.



Stakeholder Communities: Use Cases, Quality Issues and Improvement Strategies

HathiTrust is committed to an ongoing process of quality improvement for digital objects and metadata for all use cases.  These stakeholder groups, use cases, quality issues and assessment strategies include:


Online end users

  • Use Cases: Discovery, including full-text search, close reading (text or image), referencing/citation, production of  print surrogates.
  • Quality Issues: Large range of use cases, with difficulty anticipating future use cases; differences between scholarly and general users are suspected, but not verified.
  • Improvement Strategies: Focus on high severity errors (e.g., metadata errors, missing pages, illegible content, OCR inaccuracies); facilitate contribution of missing content or illegible content from members.

Print retention and shared print programs (e.g., HathiTrust’s Federal Documents program; Shared Print; National-/consortial- level shared print/digital programs)

  • Use Cases: Digital preservation in case of loss of print; coordinated print and digital archives; shared print programs; collection analyses.
  • Quality Issues: Quality issues need consideration as part of a retention decision; libraries unlikely to digitize multiple copies.
  • Improvement Strategies: Focus on brittle books (e.g., pre 1920s) where digital copy will outlast physical; Focusing on specific community needs (e.g., federal documents) may identify specific QC assessment and remediation strategies.

Users with print disabilities

  • Use Cases: Use of optical character recognition (OCR) text via screen readers.
  • Quality Issues: Image contrast; the overall quality of digitized text and the impact on OCR.
  • Improvement Strategies: Assess quality of OCR content for disabled users; identify and continuously monitor strategies for improving quality.

Print and digital preservation communities

  • Use Cases: Coordinated print and digital archives; shared print programs; over time physical copies may become at risk and less accessible to users given deterioration/loss; HT is used as an evaluative factor in print preservation (e.g., if in HT then decisions may be made differently); some users may rely on HT as a preservation repository.
  • Quality Issues: Addressing fatal flaws is important for items that are “at risk”; are there limits of items of ‘poor’ quality that should not be ingested into the corpus?
  • Improvement Strategies: Ensure that items with known poor quality characteristics/measures are ingested with appropriate markers; establish guidelines that preserve items with data on what quality issues exist.

Scholars of languages using non-roman alphabets/scripts

  • Use Cases: Poor discoverability.
  • Quality Issues: Problems with OCR and romanization errors/quality in both content and metadata; variation in transliteration in metadata
  • Improvement Strategies: Assessment and potential pilot involving language experts.

Digitization communities

  • Use Cases: If a copy of sufficient quality is already in HathiTrust, a library may not need to re-digitize an item.
  • Quality Issues: The completeness of a digital object would influence local efforts (e.g., foldouts, inserts, etc.).
  • Improvement Strategies: Apply an assessment program to digital resources to help guide library local digitization practice.

Computational research community (e.g., HTRC)

  • Use Cases: Text analysis, linguistic tagging, full-text search, ascertaining content type based on formatting (e.g., music, indented text), extraction of images from page images.
  • Quality Issues: Quality of scanning and accuracy of OCR increases functionality of HTRC and confidence in dataset.
  • Improvement Strategies: OCR quality is a key issue; Are there ways of assessing the quality of OCR to exclude garbage OCR from index/HTRC?  Image detection — could it help with quality control efforts?