Program Steering Committee Planning Briefs

HathiTrust Program Steering Committee Planning Brief: Non-­Text Formats 

September 2014

Proposed revisions to the HathiTrust mission statement explicitly embrace collections beyond digitized print, including support for “a variety of formats and born-­‐digital materials.” HathiTrust has mounted several pilots involving non-­‐text (or non-­‐print or non-­‐book) materials, such as the University of Minnesota image ingest pilot and a current pilot project to ingest audio materials. The Ithaka 2011 report also surfaced substantial partner interest in special collections materials. Nonetheless, extending the corpus to encompass non-­‐book materials will present many new issues of a strategic, economic, technical, and operational nature. Moving in this direction will require careful consideration of all of these factors.

Defining what is and is not included by the terminology “non-­‐text, non-­‐print, non-­‐book” is an initial first step necessary for reducing ambiguity and misunderstanding around potential initiatives. For purposes of discussion, non-­‐text may be defined as what is not predominately being stored in HathiTrust now — objects that are not books or journals, or that have not originated as a printed word.  Examples include:  

  • Still images
  • Time-­‐based media such as sound and moving images
  • Executable content
  • •Formats such as archival materials and web archives may or may not be considered in this category, subject to further exploration of member interests and priorities.

We view born-­‐digital versions of books and journals, such as files that might be deposited directly by a book publisher or material being published directly into HathiTrust, as an extension of current work and out of scope for this area.  

Introducing non-­‐text material, no matter how it is defined, necessarily requires a careful examination of cost and service models to support an extended mission. Support for non-­‐text materials will entail significant development costs as well as ongoing management costs that may be very different from those required for book-­‐like content. The current cost-­‐sharing model is based on an assumption of significant overlap in partner holdings of commercially-­‐ published material, and this may not be true of all non-­‐text materials (e.g. published or government issued maps and commercial sound recordings may be widely held, but archival materials will not be). Non-­‐text materials, particularly unpublished or archival materials, will also present very different rights considerations, with respect to not only copyright status (e.g. the potential for multiple layers of copyright in audiovisual works), but also issues of donor restrictions and privacy. Additionally, non-­‐text materials may have very different metadata, indexing, and user interface requirements from those of the current corpus, and different content types will have very different needs. 

In short, any activities that HathiTrust undertakes in this area will require extensive and detailed planning as well as a significant commitment of resources. This is not something that will happen quickly.

Issues to be considered in this domain: Needs and priorities

  • •What are the specific needs and priorities of current HathiTrust partners in this area – what kinds of materials are of interest, and for whom?  Is there a critical mass of interest in particular areas?
  • How can we move in this direction without undermining our ability to continue to develop the kinds of collections for which HathiTrust is best known? Will it divert us from building on our core strengths and dilute our effectiveness? What is the perceived value of building on our core strength in printed material vs. branching out in new directions?
  • Are the needs centered on preservation, access, or both?

Strategic considerations

  • What are the perceived benefits, both to members and to HathiTrust as a whole?
  • •How would this relate to other initiatives in this area, such as DPLA?
  • •What is the best way to move forward in these areas without adversely impacting core activities?

Technical considerations

  • •Managing some non-­‐text formats, particularly time-­‐based media, would entail significant new costs in terms of development, systems architecture, storage and preservation – are there sufficient resources and collective will to address these in a large-­‐scale way?
  • •If unpublished and archival holdings are included, there will be new issues to deal with in terms of rights status, ownership conditions, and privacy concerns; will these have scalable solutions?
  • Non-­‐print materials will have very different characteristics and requirements from digitized print in the areas of uniqueness vs duplication; rights status; metadata standards and requirements; indexing; user interface; format migration needs; and potentially, audience.  Inclusion of non-­‐text formats opens new challenges and may well require different and currently undeveloped solutions.

Operational considerations

  • •Might approaches to non-­‐text formats be centered on indexing and/or metadata aggregation rather than building a central object repository? Or on a hybrid approach that supports the needs of both partners seeking a solution for storing and/or preserving objects and those who may have local asset management solutions?
  • •Should participation in non-­‐book collection aggregation be organized on an opt-­‐in basis?
  • •What cost model would be appropriate for non-­‐text collections, given the likelihood of very little holdings overlap (at least for unpublished and archival materials)?

HathiTrust Program Steering Committee Planning Brief: Print Disability Services  

September 2014

The HathiTrust Program Steering Committee has identified the potential of the HathiTrust print disability service as an area that deserves attention in the coming year.   

Through a designated proxy, eligible patrons at HathiTrust partner institutions can receive special access to in-­‐copyright digital copies held in the HathiTrust.  The current guidelines state that the materials must be held currently or have been held previously by the institution’s library, as indicated through print holdings information submitted to HathiTrust. 

With the favorable ruling in the latest challenge by the Authors Guild, this is the time to think more broadly about what services could be offered to users with print disabilities.  This is the opportunity to define types of print disabilities that HathiTrust will address (e.g., visual impairment, dyslexia, inability to hold and manipulate print book); importance of extending services beyond HathiTrust members (providing value to society, enhancing awareness and reputation of HathiTrust); and a potential funding source to help improve services (e.g., develop the Page Turner, OCR correction, correct file structure).

Questions to explore are:

  1. Are we doing all we can to enable support to people that have access to the HathiTrust’s Print disability services?
    1. Should the corpus of the HathiTrust be thought of as a national library collection for the purposes of service to all people with print disabilities?
    2. HathiTrust has potential partners in the higher education and disability services communities helping expand the service.  How could those relationships be realized?
      1. Just passing eligible and authenticated users on to HathiTrust?
      2. Using HathiTrust assets and adding services? And if so, would this be a reciprocal relationship?
      3. Is there the potential for cost sharing through potential partnerships?
    3. Does this effort stay true to the mission of the HathiTrust?
  2. How broad is the potential impact for the users?   
    1. How much more impact can HathiTrust, either independently or in partnership with public or private entities have to support people with print disabilities?
    2. What can we do within the law? How far does the partnership want to challenge the limits?   
    3. How does the HathiTrust maintain sensitivity to political issues and reputation of the HathiTrust brand?
      1. Is this the right thing to do?
      2. This is an important moment with the legal standing being so predominant; should we take advantage of this moment in time?
  3. What is the scope of the need?
    1. Are the HathiTrust partners taking advantage of the service to help people with print disabilities on their campuses?
    2. What is the potential number of users if we fully exploited this service on campuses?
    3. What is the potential number of users if we expanded the service beyond the partnership?
    4. What would be the cost to the partnership if HathiTrust did expand the service?
  4. Can we provide direct access for those who have demonstrated need?
    1. Do we need to have a proxy or can there be a user initiated validation process?
    2. What additional authentication can be put into place so this service is not taken advantage of to ensure we are compliant with the intent of the service and honor the legal president?
    3. Is two-­‐factor authentication achievable with current campus infrastructures?
  5. Is there a potential revenue stream to benefit partners?
    1. Can we ethically collect fees for the print disability service?
    2. Can we ethically collect more than just the cost to access the service to benefit the partnership?
    3. Are the potential revenue and the effect on the HathiTrust brand balanced?
      1. Is being a public good more important than the potential income?
      2. Is the potential access revenue large enough to support infrastructure that will benefit the entire partnership?

HathiTrust Program Steering Committee Planning Brief: HathiTrust Quality and Validation Issues

September 2014

HathiTrust has long sought to establish itself as a repository of high-­‐quality digital content designed around the needs of research libraries and their constituencies. In 2011, HathiTrust received certification as a trustworthy digital repository by the Center for Research Libraries after an extensive audit.  The final report included a recommendation to “clarify and strengthen the quality assurance and print archiving components” of our operations. Thus, ensuring the quality of the HathiTrust corpus is essential to fulfilling our mission and exercising responsible long-­‐term stewardship over these incomparable collections.

The HathiTrust Program Steering Committee plans to explore several issues and potential projects this year relating to the quality of the digital surrogates in the repository. By quality in this context, we are referring specifically to the fidelity, completeness, and authenticity of the digital surrogates in the corpus. Clearly a focus on quality can have many other dimensions, including collection policy, services, user experience, etc., but our focus here is on the digital surrogates themselves.   

The aim will be three-­‐fold:

  1. To develop policies that can guide decisions and procedures for evaluating the quality of content (images, OCR and metadata) ingested into HathiTrust, including how materials with varying quality indicators are treated within the repository 
  2. To explore the feasibility of a quality certification program for items in the repository, tied to specific use cases
  3. To consider avenues or projects for quality remediation

1To develop policies that can guide decisions and procedures for evaluating the quality of content (images, OCR and metadata) upon ingest into HathiTrust, including homaterials with various quality indicators are treated within the repository 

Since its inception, HathiTrust has applied a quality filter to Google books ingested into the repository, by rejecting material whose Google-­‐assigned error rate exceeded a certain threshold (>15%).  This policy was intended to ensure that HathiTrust served as a preservation repository of high-­‐quality content. However, Google’s error detection algorithms are designed primarily for internal processing purposes, and as they have evolved over time, their reliability for HathiTrust purposes has become uncertain. In June of this year, HathiTrust modified its ingest gating criteria after partners noticed that a significant number of volumes digitized from their libraries by Google were not being ingested into the repository, due to changes in Google’s error scoring that were unrelated to the quality of the items themselves. These developments have prompted broader discussions about the role of quality and of ingest gating in HathiTrust, how this relates to the repository’s preservation mission, and whether there are similar considerations that should inform our approach to a broader range of quality considerations (such as metadata quality). The Program Steering Committee proposes to charge a group to review HathiTrust’s overall approach to quality and how this should be reflected in ingest gating practices, metadata ingest and remediation, and the display of content of uncertain quality (including implications for the user interface and feedback mechanisms). The goal would be a recommended set of policies and practices to be implemented by HathiTrust systems.

2. To explore the feasibility of a quality certification program for items in the repository, tied to specific use cases

The questions raised about Google’s quality indicators have highlighted the need for reliable indicators of quality for materials in HathiTrust. Any number of issues, ranging from a desire to reduce the number of unnecessary duplicate items stored in the repository and minimize storage costs, to evaluation by end users of the fitness of particular materials for specific purposes, hinge on an ability to ascertain the quality of a given item in the repository. That said, what is meant by ‘quality’ itself requires definition, and raises the corollary question “quality for what purpose?” 

Paul Conway’s IMLS grant ““Validating Quality in Large-­‐Scale Digitization: Metrics, Measurement, and Use-­‐Cases,”[1] sought to develop a framework for evaluating quality tied to specific user needs. A proposal was subsequently made to the HathiTrust Board of Governors to build on that work by fleshing out this framework and, using a distributed staffing model similar to the CRMS project, undertake a project by which HathiTrust volumes could be reviewed and certified for particular uses. The Board of Governors asked the PSC to consider the proposal in greater depth and analyze the business case for such a service.

3. To consider avenues or projects for quality remediation

Beyond establishing thoughtful and well-­‐understood quality thresholds for acceptance of digital surrogates into the publicly-­‐accessible repository, what are the opportunities for HathiTrust and the community to address quality deficit issues through a remediation program?  Might HathiTrust become deliberately transparent to the community about existing quality challenges as an engagement strategy for addressing these issues? At a minimum, HathiTrust might consider framing a remediation strategy along these activities:

  • Identify and transparently (publicly?) acknowledge works in need of error correction or quality remediation
  • Determine criteria and processes for prioritizing and directing remediation efforts and investments
  • Review methods for achieving impactful and scalable remediation, including consideration of innovative contribution and validation methods used by entities such as Wikipedia and Zooniverse.

Questions to explore are:

  • The role of quality in HathiTrust, broadly speaking – to what extent is this a core value, what do we mean by it, and how should it inform decisions and actions with respect to the corpus?
  • •The effectiveness of previous and current efforts to evaluate quality of digitized materials, including evaluation of images, OCR text and bibliographic records.
  • •The broad effects (pros and cons) of gating materials from ingest.
  • The different levels and types of quality measures for different purposes (reading, text mining, print on demand / replacement of print, preservation ...)
  • •The different needs and implications for full view and limited search.
  • •potential external sources of quality data and how to leverage these (Google n-­‐gram collections, high resolution TIFFS, etc.).
  • If programs are mounted for quality certification or remediation (or both), how should these be organized and funded? Should this work be designed to align with other HathiTrust projects, such as print monographs archiving or government documents digitization?  Are there particular kinds of collections that should be targeted (e.g. based on rights status)?
  • To what extent should we involve external partners such as Google in these projects or project outcomes?

[1] http://hathitrust-­‐


HathiTrust Program Steering Committee Planning Brief: Metadata Strategy and Policy 

September 2014

The HathiTrust mission is to contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge and to dramatically improve access to the collection. To support these goals, HathiTrust gathers and maintains a variety of metadata and has evolved a general set of policies and practices over time for how this metadata is managed and shared, providing open access to its metadata in all cases where legal and privacy restrictions allow. However, beyond the recent development of a formal policy governing bibliographic metadata correction for the Zephir metadata management system,[1] there exists at present no formal policy or set of policies governing the development and management of HathiTrust metadata, nor has HathiTrust explored in any detail how the metadata it maintains might be leveraged to enhance its value and the value it provides to HathiTrust services and end users. This work item will review the range of metadata managed by HathiTrust, consider whether there are opportunities to develop a more holistic approach to metadata management that can benefit the partnership, and seek to articulate a set of policies and/or strategies for the management and future development of HathiTrust metadata.  

Metadata currently managed by HathiTrust includes:

  • •Bibliographic metadata contributed by HathiTrust partner institutions to describe their contributed works, currently managed via Zephir, the metadata management system developed and maintained by the California Digital Library. Zephir data is used as the basis for several metadata exposure methods: the hathifiles (a tab-­‐delimited subset of HathiTrust bibliographic metadata), a bibliographic API which provides up to 20 brief or full metadata records in a single call, and an OAI feed which provides records with limited bibliographic metadata.  
  • Rights metadata developed and maintained in a separate database at Michigan, populated with publication date information from Zephir and updated in part through the work of the Copyright Review Management System (CRMS) project. Rights metadata is linked to but managed separately from Zephir
  • Holdings metadata for print collections that is collected annually by Michigan to determine collection overlap and derive HathiTrust cost allocations – this data includes OCLC numbers but is not otherwise linked to HathiTrust’s bibliographic or rights metadata systems  
  • US Federal Documents registry – This project seeks to develop a comprehensive registry of US federal documents to support HathiTrust’s federal documents digitization initiative (see
  • Technical and preservation metadata is stored in the METS files that accompany digital objects in HathiTrust, but is not independently accessible and queryable.

Issues to be explored in this work item include: 

  • •Metadata integration – could HathiTrust benefit from increased integration of its various metadata initiatives? Are there additional services that could be supported through a more integrated approach? (e.g. the possibility of enhanced collection analysis services by combining bibliographic and holdings metadata, or utilizing bibliographic metadata to populate the federal documents registry)
  • Maintenance and enhancement – HathiTrust’s current bibliographic metadata correction policy relies on individual partners to maintain and enhance the bibliographic metadata corresponding to their works in the corpus – critical updates are made centrally (e.g. those involving rights), but corrections are not made to partner records themselves.  Should HathiTrust be more proactive about correcting and enhancing bibliographic metadata, especially where algorithmic or bulk operations can be employed, e.g. by including authority data elements such as author death dates from trusted authority data providers such as OCLC’s VIAF (Virtual Name Authority File) and LCNAF, the Library of Congress Name Authority File?
  • Metadata use and sharing – should HathiTrust develop a formal policy to clarify the conditions under which HathiTrust metadata can be shared and re-­‐used?
  • Metadata impact – accuracy, quality, and completeness of metadata have an obvious impact on users’ ability to discover and evaluate HathiTrust collections. These same qualities affect services in less obvious ways: in making accurate rights determinations; in detecting (and connecting) duplicates; in creating data sets for analysis through the HathiTrust Research Center; in accurately matching members’ holdings data with the HathiTrust collection.  What measures to improve HathiTrust metadata would have the greatest impact on services? What central staffing or action by members will be needed to support this work?  What partnerships, automated techniques (such as use of linked data) can be used to advantage?