HathiTrust Research Center Semi-annual Report (October 1, 2011 - March 31, 2012)

Updates are provided in relation to the milestones listed at HathiTrust Research Center Timeline and Deliverables

Executive Summary

The HathiTrust Research Center (HTRC) had a productive 6 months as it works out core issues in Phase I of its development effort.    Milestone wise, we are looking forward to and planning for a public demonstration of functionality that is tentatively scheduled for June 2012 as is in accordance with the MOU between HathiTrust and HTRC.   Phase II in which HTRC Is operational is scheduled to begin date 01 Jan 2013.   

In a striking accomplishment, HTRC is delighted to report that three legal agreements guiding the Center have been completed at the University level.  The MOU between Hathi Trust and the HathiTrust Research Center has gotten signatures at IU and UIUC and is with University of Michigan. The MOU between IU and UIUC has been fully executed. With the Google Agreement, UIUC and IU have each entered into an agreement with Google separately but the same terms.  The agreements have been signed at the University level and are with Google. 

The philosophy behind the technical infrastructure is to use existing services and cyberinfrastructure as much as possible to reduce development and maintenance costs.  That philosophy manifests itself in a recent evaluation of tools like Blacklight for instance. 

The HTRC cyberinfrastructure is up and running on 4 4-core virtual machines hosted at IU.  We are working out access to more disk space in anticipation of the next steps with execution of the Google agreement.  We have a sandbox set up at UIUC to permit broader internal testing. The sandbox consists of a Cassandra noSQL server v1.0 (for volume store), a Solr index, and v0.1 of the HTRC Data API.  The volumes that are available in the sandboxes are 68,724 volumes of non-Google scanned content. 

1.  Technical Accomplishments 

Data API: The HathiTrust Research Center released a beta version 0.1 of the HTRC Data API. The API is a RESTful API through which the HTRC Solr index and volume store are accessed. It cannot be used to download volumes, but can be used to move data to a location where computation takes place. It can also be used to search the Solr index for a set of volume IDs and pass the volume IDs to a service for access and computation. Access to the API will require OpenID authentication and appropriate authorization. The Data API is installed on two sandbox machines, one at UIUC and another at IU, for internal testing. Both sandbox installations work against a small subset of non-Google scanned volumes. 

Sandbox:  HTRC has set up a sandbox at UIUC that consists a volume store Cassandra repository, corresponding Solr index, and a collection of 68,724 volumes of non-Google scanned content. It supports the Data API v0.1 but without security enabled to encourage testing and exploration. Data Management components: 

  • Solr Indexes: HTRC has one 4-core, 16 GB RAM machine at IU to run HTRC Solr indexes. We currently run 1 index each for our test collections, and this will exist while we are pre-Google agreement execution.  The Solr index is accessed through the Data API layer.  The Data API layer limits some access, and does auditing, but otherwise is a pass through to the Solr API. 
  • Volume StoreHTRC uses a noSQL data store cluster to hold the volumes of digitized text.   HTRC currently supports a 250,000 volume collection of non-Google digitized content and a 50,000 volume collection of content that IU libraries digitized.  These collections reside in a cluster of 3 4-core, 16 GB RAM machines running v1.0 of Cassandra.  These nodes give volume and page level access to HTRC data through the HTRC Data API.  Each machine has 500 GB of disk, and the volumes are partitioned and replicated across the 3 Cassandra instances.   
  • Registry:  IU is running a version of WSO2 Governance Registry, where applications are registered prior to running in the non-consumptive framework. The registry is also used as a temporary storage for return results. 

Non-Consumptive Research:  HTRC received funding from the Alfred P. Sloan Foundation for development of secure infrastructure on which to carry out execution of large-scale parallel tasks on copyrighted data using public compute resources such as FutureGrid or resources at NCSA.  The high level design uses a pool of VM images that run in a secure-capsule mode and are deployed onto compute resources.  The team is working on a proof of concept deployment process onto an OpenStack platform using Sigiri

Blacklight proof of concept. Searching for data is an important function for digital humanists – finding all of the works with a specific set of concepts, a certain genre, by one or more known authors, and other such criteria is generally the first step in the research process. Rather than developing a new search interface, a time consuming activity, the HTRC technical team determined that Blacklight, an open source library catalog search and retrieval system, would be a reasonable choice for the HTRC; Blacklight is designed to support data that is both full text and bibliographic, the exact type of data that the HTRC has; it is built on sorl, the same technology that we already use to index the HTRC data; and Blacklight supports faceted searches, a known need of researchers. We have a test implementation deployed on a shared server at UIUC that we were able to deploy with very few problems. In the next quarter, we expect to use the customization options to configure the look and feel of the interface and perhaps to extend the functionality to show snippets of the text to help researchers refine their results. Any new functionality that we develop will be shared with the larger Blacklight community. We expect Blacklight to be a significant component of the public face of the HTRC. 

OCR error detection study: Members of the HTRC undertook a study recently on quantifying OCR errors in the HathiTrust corpus. Scholars are interested in doing quality text analysis, but results can be confounded by OCR errors. Information on which books (or pages) in the collection have significant rates of OCR errors could help. The HTRC explored a couple of approaches to OCR error detection and have results for one approach that uses machine-generated and expert-evaluated rules. Starting with a large dictionary of correctly spelled words, HTRC members identified outlier words that were in the HathiTrust corpus but not in the dictionary. 

As a check on identified words, the rules by which outliers were detected were verified by a human expert. Using this approach, HTRC formulated 48,308 rules that identified outlier words and provided corrections. HTRC members applied the rules to 256,000 non-Google digitized volumes from HathiTrust, which took 4 hours using the National Center for Supercomputing Applications (NCSA) Ember supercomputer. The results showed that the probability of a word having an OCR error (detected by the rule set) was 0.20%. The average number of errors per page was 0.57. The average number of errors per volume was 156. The probability that a page had one or more errors on it was 11%. The probability that any volume had one or more errors was 84.9%. Overall, 217,754 of the 256,416 volumes had one or more OCR errors and 7,745,034 of the 69,297,000 pages had one or more errors. 

2. Outreach 

Talk “Digital Humanities At Scale: Hathi Trust Research Center”, by Beth Plale at UMaryland February 29, 2012 hosted by Maryland Institute for Technology in the Humanities and the University of Maryland Libraries. 

Google Digital Humanities Awards Recipient Interview Report.  John Unsworth commissioned a study of award recipients of the Google Digital Humanities Award over the period 2010 – 2011.  The study, Google Digital Humanities Awards Recipient Interview Report, interviewed recipients of the Google awards to determine what difficulties the recipients encountered when working with the Google corpus. A recurring theme was weak metadata and poor OCR. 

OCR Summit, Oct 17-18, 2011, Texas A&M University. Loretta Auvil of UIUC attended the OCR Summit whose purpose was to bring together experts to work on the problem of Optical Character Recognition for early modern texts, when printing techniques make it difficult for machines to type text by “reading” page images. HTRC proposed involvement would be to provide post processing capabilities based on work we have done in the SEASR Services project to correct the Google Ngrams data as well as a corpus of 18th and 19th century novels. 

Digging Into Data: The HTRC is working with one of the recently awarded Digging into Data projects. The Principal Investigators of the “Digging by Debating” project, have approached the HTRC for help in extending the infrastructure and developing functionality that would benefit humanities researchers in general and the Digging by Debating project in particular. Colin Allen and Katy Börner, Indiana University, Bloomington, NEH; Andrew Ravenscroft, University of East London, Chris Reed, University of Dundee, and David Bourget, University of London, AHRC/ESRC/JISC are interested in creating interfaces and systems that harvest concepts that cross domains; for example, they want to investigate the process by which concepts in one domain, such as philosophy, are used in another domain, such as physics. 

HTRC Web Site: A new HTRC web presence was launched early December 2011 as part of the HathiTrust website and can be seen a In addition to the direct access provided via the URL, the HTRC webpages are also available from the HT’s site navigation; under the About option, the HTRC is referred to as “Our Research Center.” The HTRC website has entirely new content that covers the governance of the HTRC, technical architecture and organization, policies for access and use, project information with timelines and deliverables, information about collaboration opportunities, and demonstrations of future functionality. 

3. Initiatives 

Community Engagement through “Challenges”. HTRC is creating a set of ongoing "challenges" as a mechanism by which community interest and engagement in the research opportunities represented by the HTRC can be piqued and cultivated. These challenges are inspired by such successful challenges in other domains as, TREC (Text Retrieval Conference), Netflix, and the Music Information Retrieval Evaluation eXchange (MIREX). Downie is the founder of MIREX (2004), so he has had considerable experience in organizing and running evaluation challenges. As it stands now, four challenges are being sketched out as possible candidates: 

  1. Optical character recognition (OCR) error identification and correction; 
  2. Metadata error identification and correction (and possible enhancement); 
  3. Genre detection; and, 
  4. Gender identification 

The first two challenges are inspired by the problems identified in the Google awardee interview report discussed in Section 2. We believe the creation of challenges around these two topics will pay off in two important ways. It will build active community engagement with the HTRC collections and tools; and, it can result in the creation of useable OCR/Metadata tools that can be used to increase the usability of the HTRC collections for future researchers. 

4. Governance Matters 

Change in leadership. With the departure of John Unsworth to Brandeis University, J. Stephen Downie has stepped into the role as co-director of HTRC representing UIUC. Stephen is Professor and Associate Dean for Research, University of Illinois Graduate School of Library and Information Science.  Stephen will join the other members of the HathiTrust Research Center Executive Management Team including Beth Plale (Codirector and chair); Marshall Scott Poole; Robert McDonald; and John Unsworth now Vice Provost for Library and Technology Services and Chief Information Officer, Brandeis University.