Update on March 2009 Activities
April 10, 2009
- General News
- Datasets – Sample datasets containing the OCR of volumes in HathiTrust are now available. These datasets are provided in the same directory structure and format as they are stored in the repository. They are intended to give researchers the opportunity to develop routines that can be run later on larger portions of the corpus. Interested parties should contact email@example.com with a description of the research they intend to conduct. More information is available at http://www.hathitrust.org/hathitrust_datasets.
- Coordination between UM and UC Staff – Collaboration ramped up significantly between teams at the University of Michigan and the University of California in March, in preparation for ingest of content from the University of California. Weekly conference calls speeded the teams’ progress in addressing a checklist of ingest items including coordination of bibliographic information, inclusion of coordinate data for OCR files, and reporting on ingested volumes.
- Ingest from Indiana University – Bibliographic metadata from Indiana University has been received at the University of Michigan and is being loaded into local systems. Once the metadata is loaded, ingest of content will begin.
- HathiTrust growth – Ingest rates decreased in March, with just under 130,000 volumes entering the repository. As in previous months, this decrease reflects the fact that ingest rates are matching the output of digital content from the University Michigan and the University of Wisconsin. When ingest of content from the University of California and Indiana University begins (projected for April) ingest rates will rise closer to our planned capacity of 500,000 volumes per month.
- Deployment Status
- Establishing Indiana Mirror Site – Deployment of indexing and access systems on the Indiana University repository instance was completed in March. The repository is now a fully functioning mirror of the site at the University of Michigan with load balancing and fail-over.
- Development Update
- Storage - The partners purchased additional storage for the Michigan and Indiana sites in March. The new storage will be installed in April and May, respectively, bringing both environments to approximately 320TB of capacity.
- Large-scale Search – We are using the results of large-scale search testing done so far to develop a hardware configuration for production Solr infrastructure. Investigations continue into software solutions for improving response times for slow queries.
- Data API – The first draft of a functional specification for the HathiTrust Data API is complete and has been made available publicly on the HathiTrust website for feedback (http://www.hathitrust.org/hathitrust_data_api). Work on the implementation of this specification is underway and will continue in parallel as feedback is received.
- Replication – Additional work was done to support replication of full-text indexing on the repository instance in Indiana.
- Public Discovery Interface - Initial development of the temporary beta catalog for HathiTrust is nearly complete and the catalog will be released within the next several weeks. It will provide bibliographic search and faceted browse of all volumes in HathiTrust, integrating with the HathiTrust Page Turner to provide access to individual items. Integration with the Collection Builder application will be completed in a second phase of development.
- 129,819 new volumes were added in March.
- As of April 1st, the repository contained a total of 2,780,007 volumes.
- 30,758 public domain volumes were added in March, bringing the total number of public domain volumes to 433,641 (15% of the total content).
- Ingest of Wisconsin materials continued. As of April 1, 2009, HathiTrust contained 168,098 Wisconsin volumes.
- Forecast for April development
- Continue to investigate ways to improve performance for slow queries in large-scale search.
- Continue work on the HathiTrust Data API specification and gather input from a broader audience.
- Continue coding the initial Data API implementation.
- Continue work on the HathiTrust Data API specification and gather input from a broader audience. Continue coding the initial Data API implementation.
- Complete initial development of the temporary public beta catalog for HathiTrust.
PLEASE NOTE: Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
- For major work, Friday evenings (8pm-1am) and Sunday mornings (5am-10am);
- For minor work, weekdays from 6:30am-8am.
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
- Outages in March: HathiTrust was unavailable on Tuesday, March 3 from 7:00-8:00am EST and on Thursday, March 5 from 7:00-7:45am EST for operating system and database software upgrades.
- Outages planned for April/May: No outages are planned at this time.