Navigation

HathiTrust Research Center Semi-annual Report (April 1, 2011 - September 31, 2011)

Updates are provided in relation to the milestones listed at HathiTrust Research Center Timeline and Deliverables.

Technical

  • Develop bridge and caching strategies between the HathiTrust Repository and indices and the HTRC data store. Work on a versioning database. 

Progress: HTRC is working with a 50,000 volume collection of materials digitized from the IU library and a 250,000 volume collection of non-Google digitized content. Both collections reside at IU stored on a NAS (Network Attached Storage) unit and are regularly synchronized with the main HT collection at Michigan using the Unix tool rsync.

  • Develop a prototype system for non-consumptive research.

Progress: HTRC received a 3­-year grant from the Alfred P. Sloan Foundation to build this prototype system. The system will prove experimentally and theoretically that it is possible to comply with the non­-consumptive constraint in computational research. It will serve the community as a platform for development, testing, and execution of new algorithms developed by the broad research community capable of running at scale on the HathiTrust corpus. This research involves Atul Prakash of the University of Michigan.

  • Develop web and portal capabilities.

Progress: HTRC staff set up a development portal for HTRC. The portal is built using the Lift framework. A key aspect of the portal implementation is its support for InCommon identity and access management, which enables a user user to log in using their home university credentials. The portal is consequently more secure because HTRC does not need to manage identity itself, and users also benefit from the inCommon management tool as they are not required to remember another user ID and password.  

  • Develop distributed access capabilities and improve data quality.

Progress: In this first phase HTRC is working on setting up core infrastructure components including the portal, InCommon sign-on, a service registry, Solr indexes, file system and database storage for the collections. Staff are also working on infrastructure for user-created collections and experimenting with text-mining techniques for improving descriptive metadata across the collections. Finally, a set of 60,000 rules developed with the aid of domain experts is being applied to correct OCR errors across the collection.  

  • Implement SEASR in the HTRC portal.

Progress: HTRC staff demonstrated SEASR running against a small HTRC collection at the Digital Humanities Conference June 2011 using a collection of  50,000 volumes from the Indiana University collection in the HathiTrust. The content was prepared for this use by flattening the HT internal pairtree and converting bibliographic data to RIS format (http://www.refman.com/support/risformat_intro.asp). Future work includes integrating SEASR into the HTRC portal infrastructure by supporting InCommon identity and access management. Other projects currently in progress include scaling to access a large remote data collection and ensuring algorithm integrity against the copyrighted collection particularly in the face of user's ability to rewire workflows at will.

  • Perform risk security analysis and the initial development of security infrastructure and procedures.

Progress: HTRC has chosen the InCommon framework for trustworthy shared management of access to on-line resources. Researchers have single sign-on convenience using their existing credentials at their host organization, which eliminates the need to create additional accounts. InCommon uses Shibboleth or another SAML-compliant software to exchange attributes with partners, providing only the information necessary to do the authentication and authorization. The InCommon Federation provides the policy and technical framework that makes all of this possible. As of a recent count, all but 15 of the members of HathiTrust are members of the InCommon Federation. We anticipate that membership will grow to 100% of HathiTrust members. 

Public Engagement

  • Publicize HTRC and engage potential users.

Progress: An official kick-off of the HTRC was held at the Digital Humanities Conference in Palo Alto, CA June 20, 2011. The HathiTrust Research Center team has given 7 presentations to other various groups and conferences.

  • Co-sponsor proposals with potential users of HTRC.

Progress: HTRC has co-sponsored three grant proposals with institutions inside and outside the HathiTrust partnership community.

  • Develop relationships with other Digital Humanities/Social Science initiatives.

Progress: HTRC has met with Project Bamboo on multiple occasions in continuing discussions.

Sustainability

  • Develop and implement a sustainability plan.

Progress: The Alfred P. Sloan Foundation award is a step towards sustainability. We are working on a long-term sustainability plan.