11 July 2008
This is the fourth regular update on activities in HathiTrust, previously referred to as the Shared Digital Repository (SDR). These updates are distributed monthly, typically on the 2nd Friday of the month, and provide a variety of information about the general health of the repository and updates on the development of HathiTrust. Each update will be sent via e-mail to the Library Director and CIO at each participating institution. We will soon release a website for the initiative, and will post all updates on that site. We plan to make an RSS feed for the updates available in order to share the information as broadly as possible.
Throughout this update, we refer to the draft Short-Term and Long-Term Functional Objectives (being articulated by the CIC’s SDR committee) as a work item relates to those Objectives. We plan to restructure future updates to provide specific reports on the CIC’s short-term and long-term functional objectives.
- The executive management committee of HathiTrust meets monthly and continues to work on a variety of issues ranging from HathiTrust finances to development priorities. The first meeting of the Operational Advisory Board took place in June. The agenda focused on a review of the CIC Steering Committee’s Short- and Long-Term Functional Objectives and, where appropriate, status reports. It was agreed that some of these items would be best addressed by CIC collaborations, while others are the responsibility of the centrally-funded effort. The CIC will soon convene a committee to help better define the objective to create a Public Interface for HathiTrust.
- We continue to have productive conversations with other several other institutions about possible participation in HathiTrust and hope to provide information on our progress in this regard in future Updates.
Growth of HathiTrust
As of July 1st, HathiTrust contains:
- 1,273,784 volumes
- 908,161 titles
- Approximately 446 million pages
- 234,583 individual volumes in the public domain (approximately 18% of total)
We have completed a draft response to the required elements in the “Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist” and are currently reviewing the draft for public release in July or early August.
As mentioned in an earlier update, we coordinated a site visit by a team from the Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) effort in the European Union. Their report, an extremely favorable review of the repository, should be released publicly soon. (CIC SDR Short-Term Functional Objectives)
- Basic hardware deployment: IU and Michigan staff continue to work on site preparations for our redundant site in Indianapolis. We expect to deploy this site in summer.
- Deployment issues: The bundling of page images and text files into a single file per volume is complete, and we are currently reorganizing our file system to a more efficient and scalable layout in preparation for synchronization.
- Ingesting Wisconsin content: Routine processes for bibliographic data ingest and linking to digital contents once received are in place for Wisconsin data. We discovered a problem with image metadata in JPEG 2000 files provided by Google, and this has caused us to halt ingest. Discussions with Google on the problem are in process and we expect to resume ingest in July. We have preliminarily loaded approximately 10,000 Wisconsin volumes.
- Ingest infrastructure: We have installed and configured additional server hardware dedicated to the task of ingest processing and validation. The additional capacity will accommodate the projected demands of CIC content.
- Large-scale search: Lucene and Solr have been installed on development and production servers, and testing is beginning on indexing large amounts of text, starting with all volumes in the public domain. (CIC SDR Long-Term Functional Objectives)
- Institution-specific pageturner: An updated version of the pageturner, which includes support for the Collection Builder functions, is now in production. We will deliver XSL and CSS to Wisconsin in July for their modifications. (CIC SDR Short-Term Functional Objectives)
- Services for visually-disabled users: We released the new interface for visually impaired users (optimized for use with JAWS and other screen readers) in May. That version presents the entire text version of a volume with navigation to the user on one screen. We are working with two University of Michigan School of Information interns over the summer to optimize this interface for use with screen readers, as well as to improve the general accessibility of the pageturner. (CIC SDR Short-Term Functional Objectives)
- Fedora programmer: We continue to search for a programmer to aid in implementing Fedora in conjunction with HathiTrust. The position will soon be reposted as a more general system-engineering job, with testing Fedora to be the first project focus in this position. This job posting will be announced in early July. (CIC SDR Long-Term Functional Objectives)
- Collection Builder: The Collection Builder is now in production. An online survey is being used to collect feedback on this new functionality. (CIC SDR Short-Term Functional Objectives)
- API development: We have made available to participating partner institutions a proof-of-concept interface allowing simple lookup of handle and rights information (similar to the functionality of the Google API). Chicago, Wisconsin, and Northwestern have all expressed interest in testing the API. They are currently evaluating this preliminary API and will be providing input on additions/changes to functionality. (CIC SDR Long-Term Functional Objectives)
- Distributing bibliographic information about the contents of HathiTrust: Multiple design sessions for the initial mechanism have taken place. We are now in the process of implementing the results. The mechanisms will make available a tab-delimited file, one line per volume, containing the volume’s ID number, standard numbers found in the record, the record number from the originating institution, Michigan’s record number, enumeration and chronology data, rights data, title (245), and imprint (260). We will begin with a full file and then make available daily updates; we will also make available a new, cumulated file on a monthly basis. We will distribute information on obtaining and using the file shortly. In addition, records describing the public domain materials in HathiTrust continue to be available via OAI. (CIC SDR Short-Term Functional Objectives)
Forecasting July development
- A HathiTrust.org web site for information about the initiative, including updates and technical information.
- Initial release of bibliographic distribution mechanisms.
- Testing of XSLT and CSS for Wisconsin version of the institution-specific pageturner.
- Large-scale loading of Wisconsin data.
- Work on scaling analysis of large-scale search.
Status/availability of HathiTrust
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
- For major work, Friday evenings (8pm-1am) and Sunday mornings (5am-10am);
- For minor work, weekdays from 6:30am-8am.
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
Please contact Phyllis White (pmwhite at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
There were no interruptions in service in June.
At this time, the following outages are scheduled:
- July: The brief outage originally anticipated for June (for a minor storage system software upgrade) will be scheduled for a date in July.
- August: No outages are planned at this time.