8 August 2008
This is the fifth regular update on activities in the HathiTrust, previously referred to as the Shared Digital Repository (SDR). These updates are distributed monthly, typically on the 2nd Friday of the month, and provide a variety of information about the general health of the repository and updates on the development of the HathiTrust. Each update will be sent via e-mail to the Library Director and CIO at each participating institution. We will soon release a website for the HathiTrust initiative, and will post all updates on that site. We plan to make an RSS feed for the updates available in order to share the information as broadly as possible.
Throughout this update, we refer to the draft Short-Term and Long-Term Functional Objectives (being articulated by the CIC’s SDR committee) as a work item relates to those Objectives. We plan to restructure future updates to provide an update on activities along with a review of work on those objectives.
Growth of the HathiTrust
As of August 1st, the HathiTrust contains:
- 1,475,755 volumes
- 1,032,641 titles
- Approximately 517 million pages
- 269,317 individual volumes in the public domain (approximately 18% of total)
We have completed a draft response to the required elements in the “Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist.” The draft will be published online as part of the release of the HathiTrust website. We have had preliminary discussions with the Center for Research Libraries about the possibility of a formal review of the HathiTrust repository.
As mentioned in an earlier update, we coordinated a site visit by a team from the Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) effort in the European Union. The DRAMBORA team plans to make their report public. (CIC SDR Short-Term Functional Objectives)
- Basic hardware deployment: New equipment has been ordered and may arrive in Indianapolis by early August. Pending equipment arriving in early August, Michigan staff will travel to IU on August 12-13 to work with IU staff on installation and to make arrangements for maintenance. Shipment of the second instance of storage from Michigan to Indiana will take place after synchronization testing has been completed in Michigan.
- Deployment issues: Development work on reorganizing our file system to a more efficient and scalable layout to prepare for synchronization testing is in progress, and should be complete in early August.
- Ingesting Wisconsin content: Routine processes for bibliographic data ingest and linking to digital contents are in place for Wisconsin data and we have loaded 127,120 records. We loaded a small number of Wisconsin volumes before encountering a problem with image metadata in JPEG 2000 files provided by Google. Discussions with Google on the problem are in process and we expect to resume ingest shortly. Roughly 100 volumes were loaded before we encountered validation problems with the files.
Note: Last month we reported having ingested nearly 10,000 volumes of Wisconsin content. This was an error in data analysis that is related to duplication of volumes being digitized (i.e., from Michigan’s and Wisconsin’s collections). Working through this problem has helped us to refine our data gathering mechanisms.
- Ingest infrastructure: Recently installed hardware is being used to increase the rate of ingest. More than 200,000 volumes were added to the repository in July.
- Large-scale search: Beginning in August and using existing hardware, we will index 200,000 volumes and then add increments of 100,000 volumes. We will conduct tests at each breakpoint and collect data on response time. We will also begin research to determine whether replication or shards is the best method of splitting the index. We will also explore the question of what facets to include in the index and how to configure them for optimal performance. These two elements of analysis will help determine whether and what types of additional hardware might make large-scale full text searching possible. As part of our work, we may wish to engage a consultant such as Sematext for guidance in the best options.
- Institution-specific pageturner: An updated version of the pageturner, which includes support for the Collection Builder functions, is now in production. After consultation with the Operational Advisory Board, a new design (with branding elements) was created and will be released in September. (CIC SDR Short-Term Functional Objectives)
- Services for disabled users: Usability testing continues on the new interface for visually impaired users (optimized for use with JAWS and other screen readers). The current version presents the entire text version of a volume with navigation to the user on one screen. Work continues with two University of Michigan School of Information interns over the summer to optimize this interface for use with screen readers, as well as to improve the general accessibility of the pageturner. (CIC SDR Short-Term Functional Objectives)
- Fedora programmer: The programmer position was reposted in mid-July as a more general system-engineering job with an initial focus on work with Fedora, and we are receiving applications. (CIC SDR Long-Term Functional Objectives)
- Collection Builder: The Collection Builder is now in production at http://sdr.lib.umich.edu/cgi/mb. An online survey is being used to collect feedback on this new functionality. (CIC SDR Short-Term Functional Objectives)
- API development: HathiTrust now provides an API much like the Google API that allows an institution to pass an identifier to the API and return information about an item's availability, its URL, and access privileges (e.g., that the item is in the public domain). Documentation on the use of this API will be available on the HathiTrust website. (CIC SDR Long-Term Functional Objectives).
- Initial release of bibliographic data distribution mechanisms: Metadata identifying materials in the HathiTrust repository are now available for download as tab-delimited files. Please contact Tim Prettyman (timothy at umich.edu) for early release access to the files. The files include a small number of bibliographic elements to aid an institution in making decisions as to records they want to retrieve. More information will be available on the HathiTrust website. (CIC SDR Short-Term Functional Objectives)
Forecasting August development
- Benchmarking of large-scale SOLR indexing performance.
- Pending equipment ship dates, initial installation of hardware in Indianapolis.
- Beginning tests of synchronization of storage instances.
- Reinitiating ingest of Wisconsin content.
- The executive management committee of the HathiTrust meets monthly and continues to work on a variety of issues ranging from HathiTrust finances to development priorities. The first meeting of the Operational Advisory Board took place in June. The agenda focused on a review of the CIC Steering Committee’s Short- and Long-Term Functional Objectives and, where appropriate, status reports. It was agreed that some of these items would be best addressed by CIC collaborations, while others are the responsibility of the centrally-funded effort. The CIC will soon convene a committee to help better define the objective to create a Public Interface for the HathiTrust.
- We continue to have productive conversations with other several other institutions about possible participation in the HathiTrust and hope to provide information on our progress in this regard in future Updates.
Status/availability of the HathiTrust
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
- For major work, Friday evenings (8pm-1am) and Sunday mornings (5am-10am);
- For minor work, weekdays from 6:30am-8am.
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
Please contact Phyllis White (pmwhite at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
Service was unavailable on Thursday July 31 from 7:00-7:30am EDT for a storage system software upgrade.
At this time, the following outages are scheduled:
- August: No outages are planned at this time.
- September: No outages are planned at this time.