HathiTrust Short- and Long-Term Functional Objectives
January 15, 2010
Functional Objectives – Short-term
- Page turner mechanism: A page turner has been deployed for all content in HathiTrust. We hope to report soon on a strategy to re-engineer the current page turner application so that it provides access to materials in HathiTrust through an API. The intention is to provide a wider variety of functions or modes of access to the collections than are currently available. A first draft of the functional specification for the Data API was completed in January 2009. Following internal discussion and revision it was released in April 2009 on the HathiTrust website for broader comment. It is now available at http://www.hathitrust.org/data_api. Feedback on the specification is requested and should be sent to hathitrust-info@umich.edu.
- Branding (overall initiative; individual libraries): After consultation with our partners, we released several new elements that provide support for branding in the HathiTrust repository. These elements include:
- The pageturner now prominently identifies the HathiTrust initiative;
- A watermark on every page identifies the digitizing agent; and
- A watermark on every page identifies the source library of the print material.
- The source of the print material is included in our feed of bibliographic identifiers so that institutions can import or update records with this information.
- The pageturner contains institution-specific branding, identifying to users at partners institutions that their institution is a member of HathiTrust.
- Format validation, migration and error-checking: Format validation and error-checking is currently performed for all content that enters HathiTrust. Although, to date, no migration of content has been necessary, we believe that we have mitigated this need by choosing rich, flexible, standards-based formats. We have performed the work required to store a variety of technical and digital preservation metadata along with each object in order to aid in migration should it become necessary. Finally, the Isilon storage automatically conducts periodic parity and media checks in the background, a fairly unique feature in storage systems and one of the reasons this storage system was seen as an appropriate match to the project.
- Development of APIs that will allow partner libraries to access information and integrate it into local systems individually:
- Bib API: HathiTrust partners identified the need for a mechanism by which a bibliographic identifier (e.g., an ISBN or OCLC number) could be submitted to HathiTrust and resolved as a persistent URL with information about levels of access (e.g., full text or search only). An API to accomplish this (Rights API) was released and is being implemented in the online catalogs of several institutions (a list of known institutions is available at http://www.hathitrust.org/access). A Bib API, released January 2010 replaces this initial implementation, and users of the Rights API are asked to switch to the new API. In addition to having a more robust backend than the initial implementation, the Bib API delivers more complete information about volumes (e.g. all items in a multi-volume set or serial), as well as an option to include full Marcxml metadata.
- Data API: A second API, known as the Data API, is also available to retrieve content (page images and OCR text files) and content metadata (METS files, rights and administrative information) from the repository. The availability of these resources to client applications (examples of current applications are the HathiTrust Collection Builder and Pageturner) will enable the creation of additional services and uses of repository materials. Other similar APIs will be developed as needed in the future.
Information on all modes of content and metadata distribution (including OAI and tab-delimited metadata files) can be found at http://www.hathitrust.org/data.
- Access mechanisms for persons with disabilities: HathiTrust has deployed an interface for visually impaired users (optimized for use with JAWS and other screen readers). This interface presents to the user the entire text version, with navigation, on one screen. Staff members at the University of Michigan are currently working with UM School of Information interns to optimize this interface for use with screen readers, as well as the general accessibility of the pageturner. For in-copyright resources, access is currently limited to authorized users at the University of Michigan. We plan to add Shibboleth support to the HathiTrust repository so that resources like access mechanisms for persons with disabilities can tie into the authentication environments of our partner institutions.
- Public ‘Discovery’ Interface for HathiTrust: HathiTrust has initiated a multi-stage strategy to create a “public interface” mechanism, an interface with which digital books and journals in the HathiTrust repository can be discovered and accessed.
- The first phase of this effort was the creation of a temporary public beta of a comprehensive bibliographic search. This was made available in April 2009. The temporary catalog provides bibliographic search and faceted browse of all content in HathiTrust, with the ability to restrict to all public domain resources or volumes digitized from a specific institution’s collection. This public beta serves as a real-world proof of concept for the second phase.
- The second phase, which began concurrently with the first, is to create a replacement for the temporary public beta catalog. In April 2009 the HathiTrust partners began working with OCLC to create a production-level catalog for materials in the repository. The first release of this catalog is targeted for April 2010. Information about this project and progress that has been made can be found at http://www.hathitrust.org/projects#wg_discovery.
- Ability to publish virtual collections: Vast bodies of digital content benefit from methods to gather together subsets into “collections” that can be searched and browsed. HathiTrust has created an early release of a Collection Builder that permits individuals to create public (i.e., shared) and private collections. We will turn our attention to creating mechanisms by which persons such as bibliographers can create and share collections with a more formal identity (cf. imagine having full text resources associated with classic bibliographies such as the Wing or Pollard and Redgrave short title lists). We are now performing intensive usability review on the Collection Builder. Although the Collection Builder’s authentication and authorization now relies on the University of Michigan “friend” system (see http://www.itd.umich.edu/itcsdocs/s4316/), Shibboleth support is targted for addition to the HathiTrust repository in the first quarter of 2010 so that resources like the Collection Builder can tie into different authentication environments.
- Mechanism for direct ingest of non-Google content: We are currently finalizing standards and procedures for the ingest of partner content that has been digitized by the Internet Archive. We expect to begin ingest of IA-digitized volumes from the University of California in January 2010, with IA-digitized content from other partner institutions to follow. The University of Michigan is in the process of hiring a programmer dedicated to transforming content from a variety of non-Google sources (including non-IA sources) for ingest into the repository. We are also working to provide a tool that partners can use to validate their own content prior to ingest.
Functional Objectives – Long-term
- Compliance with required elements in the Trustworthy Repositories Audit and Certification (TRAC) criteria and checklist: HathiTrust has addressed most of the minimum required elements in the TRAC criteria and checklist. All of the required elements will receive ongoing attention, with incomplete items being assigned the highest priority. In addition, the Center for Research Libraries and HathiTrust have made plans for an independent assessment of the HathiTrust repository, based largely on the Trusted Repositories Audit and Certification (TRAC) criteria. The assessment will take place during the winter of 2009-2010. More information is available on the CRL website at http://www.crl.edu/content.asp?l1=13&l2=58&l3=181.
- Robust discovery mechanisms like full-text cross-repository searching: Full-text search of the entire repository was released on November 19, 2009. It can be accessed at http://catalog.hathitrust.org. More information about full-text search is available at http://www.hathitrust.org/large_scale_search and in the Large-scale search blog at http://www.hathitrust.org/blogs/large-scale-search. The official full-text press release can be found at http://www.ns.umich.edu/htdocs/releases/story.php?id=7426. We are currently investigating the integration of full-text search into the Collection Builder application.
- Development of an open service definition to make it possible for partner libraries to develop other secure access mechanisms and discovery tools: We believe that the great wealth of resources that HathiTrust now makes available can only be effectively exploited through the creation of an open service definition that makes it possible for others to create new tools and approaches to access. As a first step, we are in the process of creating a parallel production system that does not compromise the content in the repository, and gives developers access to the functions of the HathiTrust repository system. A preliminary instantiation of this environment was created in December 2009 based on specifications of a dedicated working group. It is initially configured to faciliate collaborative development of the HathiTrust PageTurner, which the University of California and University of Michigan undertook in 2009. We hope that the availability of this development sandbox, in conjunction with existing and future APIs, will make it possible for partner institutions to collaborate in creating new services for the repository.
- Support for formats beyond books and journals: Our first “content” priority is support for digitized books and journals, but we believe that HathiTrust must expand its support to other formats (particularly born-digital publications) and materials. To this end, HathiTrust has begun to investigate issues relating to the storage and delivery of electronic publications (in the ePub format in particular) and digital audio and image files (such as maps). Pilot projects in each of these areas are planned for 2010. Updates on timelines for these initiatives will be posted in the coming weeks.
- Development of data mining tools for HathiTrust and use by HathiTrust of other analysis tools from other sources: Because of the vast bodies of content held by HathiTrust, an important function of the HathiTrust repository will be to support data mining and other forms of large-scale analysis. In July 2009, HathiTrust engaged members of partner institutions in a working group to develop specifications for a HathiTrust Research Center. The work of this group has resulted in a Call for Proposals that will be distributed to interested HathiTrust institutions to implement this Center. Information about this working group can be found at http://www.hathitrust.org/projects# wg_research_center. As a first step toward enabling computational research with repository materials, HathiTrust has made sample datasets of two different sizes available to researchers for computational processing and analysis. The first sample is available to all researchers through an application process. The second sample will be available to participants in the Digging Into Data Challenge. The samples are described below: Sample 1: The first sample is composed of 5,000 texts, which may be requested in one of three bundles. Texts in all bundles are pre-1923 (pre-1869 for works published outside of the United States) and are as follows:
- A random sample representing 4 character sets and 5 languages (Arabic, English, French, Japanese, and Russian)
- A random sample of English language literary and historical texts
- A random sample of Classics texts, including original language texts and translations
Sample 2 - Digging Into Data: A second sample of 50,000 texts will be made available for participants in the Digging into Data Challenge. The corpus represents a mix of dates (as above, all pre-1923, and pre-1869 for materials published outside the United States), countries of origin, languages, character sets, and formats (i.e., some serial literature in a body of mostly monographic literature). More information is available at http://www.hathitrust.org/hathitrust_datasets.

