Update on January 2009 Activities
February 13, 2009
- General News
- Public ‘Discovery’ Interface and OCLC collaboration – We have begun active planning discussions with OCLC on the creation of a “catalog” for HathiTrust. Chaired by Lee Konrad (Wisconsin) and John Butler (Minnesota), this group will create specifications for adaptation of WorldCat Local (WCL) for HathiTrust. The deployment of the HathiTrust WCL interface is scheduled for early 2010, with work ongoing throughout 2009.
- HathiTrust growth – Lower ingest rates continued (after very fast growth rates through November 2008) this month, with slightly less than 100,000 volumes added to HathiTrust. The rate will increase when we start ingesting content from the University of California, which we hope will begin in March. We will also begin ingesting volumes from Indiana University around the same time, pending resolution of issues by Google.
- Technical Walkthrough with UC Staff – on January 29th, staff from the University of Michigan held a video conference with staff from the University of California (including staff from CDL and UCSD) to review technical aspects of HathiTrust in preparation for ingest of UC content. This is the first of a series of working meetings on HathiTrust technologies.
- Datasets – HathiTrust will soon be making sample datasets of two different sizes available to researchers for computational processing and analysis. The first sample will available to all researchers through an application process. The second sample will be available to participants in the Digging Into Data Challenge .
Sample 1: The first sample will be composed of 5,000 texts, which may be requested in one of three bundles. Texts in all bundles are pre-1923 (pre-1869 for works published outside of the United States) and are as follows:
- A random sample representing 4 character sets and 5 languages (Arabic, English, French, Japanese, and Russian)
- A random sample of English language literary and historical texts
- A random sample of Classics texts, including original language texts and translations
Sample 2 - Digging Into Data: A second sample of 50,000 texts will be made available for participants in the Digging into Data Challenge . The corpus represents a mix of dates (as above, all pre-1923, and pre-1869 for materials published outside the United States), countries of origin, languages, character sets, and formats (i.e., some serial literature in a body of mostly monographic literature).
More information about these datasets, as well as specifications of file formats and modes of access, will be posted soon on HathiTrust.org.
- Print-On-Demand – The UM Library’s Scholarly Publishing Office (SPO) is actively engaged in designing a workflow and negotiating business arrangements to create print on demand copies of the public domain UM materials from HathiTrust. For the past five years, SPO has been managing a very successful and active reprint program of more than 10,000 titles from the Library’s digitized collection. SPO will leverage that experience and printing and distribution partnerships to create a robust POD program for HathiTrust. Currently, SPO is in negotiations with Amazon and Hewlitt Packard (and its partner distributors) to design a workflow that will minimize human intervention in the preparation of scans for printing, cover creation and metadata manipulation. Although currently only University of Michigan content is under discussion, SPO hopes to design a set of services that are of value, in terms of both efficiency and revenue generation, to all HathiTrust partners.
- Deployment Status
- Establishing Indiana Mirror Site – Establishing Indiana mirror site: Several unexpected network problems slowed our work on the mirror site, but with the problems now resolved, we have resumed work and expect to complete the full deployment of the mirror site soon. End users will not be able to detect a difference when the mirror goes online: having two operational sites (with load balancing and fail-over) will enable us to avoid the brief outages we have scheduled for routine maintenance.
- Development Update
- Large-scale Search – January saw our experiments shift to load testing. An array of tests were executed against Solr indexes using JMeter to estimate real world use with 4, 8, and 10 simultaneous users, randomized delays based on times of 50ms, 500ms, 1000ms, and 2000ms, and indexes as large as 1 million documents. The results are being reviewed now.
- API – We have completed a first draft of the functional specification for the HathiTrust Data API. This document is being discussed and revised internally and will be shared with a wider audience for input.
- Replication – Enhancements were made to support replication of full-text indexing on the repository instance in Indiana.
- 94,960 new volumes were added in January.
- As of February 1st, the repository contained a total of 2,572,831 volumes.
- 7,448 public domain volumes were added in January, bringing the total number of public domain volumes to 379,533 (15% of the total content).
- Ingest of Wisconsin materials continued. As of February 1, 2009, HathiTrust contained 148,820 Wisconsin volumes.
- Forecast for February development
- Complete web hosting infrastructure work at the Indiana site and put it into active service.
- Begin discussions and planning for adaptations to the ingest system to work with the new version of Google’s delivery system (GRIN) and to accommodate UC content, which is the first body of materials to include OCR coordinate information.
- Large-scale search activity will continue with more load testing, evaluation of results, and planning of next steps. There is a common desire to discuss the capabilities of XTF with California Digital Library staff. XTF employs Bi-grams to improve full-text query performance and is built on top of Lucene, as is Solr.
- Work will continue on the HathiTrust Data API specification with
intent to finalize it in February. Coding will begin in February as
PLEASE NOTE: Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
- For major work, Friday evenings (8pm-1am) and Sunday mornings (5am-10am);
- For minor work, weekdays from 6:30am-8am.
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
- Outages in December: On Friday, December 19 at 7:30am EST, HathiTrust was down briefly to apply security updates to a database server. Service was restored at 7:40am EST.
- Outages planned for January/February: A brief outage will be scheduled in January for a storage system software upgrade.