October 10, 2008
- HathiTrust web site – The HathiTrust web site, http://www.hathitrust.org/, continues to evolve. This month, we begin a process of providing more detailed reports on specific development initiatives (e.g., large-scale search). We have also added a dynamically updated report of the size of the HathiTrust repository, shown in the sidebar on the opening page of the website and on the Updates page.
- Deployment Status
- Establishing Indiana mirror site – Our work on configuration of the servers in Indiana now provides us with full remote management capabilities. We plan to have the second instance of storage in place in Indianapolis during the week of October 20th, at which point we will perform the remaining configuration work.
- Directory Organization – In early September, all files in the repository were reorganized using the “Pairtrees for Object Storage” scheme (http://www.ietf.org/internet-drafts/draft-kunze-pairtree-00.txt).
- Data synchronization – With the pairtree reorganization complete, synchronization testing between the primary storage system and the secondary storage system, both currently in Ann Arbor, began. The first full synchronization of approximately 70TB was conducted in less than four days. We will finalize testing on incremental synchronization in early October.
- New storage – Just under 180TB of additional storage was ordered and received, half going to each of the two sites. Adding new storage to the existing system will not disrupt access to the repository. The additional storage will bring the total repository capacity to nearly 190TB at each site. We will install the new storage after the existing secondary storage system is shipped to Indiana. The expansion is timely, as we are already consuming 75TB of the current 100TB capacity and growing rapidly.
- Development Update
- Large-scale Search – Substantial progress was made on a large-scale full text search strategy. We mapped out an overall strategy, built most of the large indexes (including an index for 700,000 volumes), and performed some preliminary benchmarking of search results. A full discussion of the work to-date is online at http://www.hathitrust.org/large_scale_search.
- Future Strategies – We will be embarking on a three-part strategy to facilitate development by, or in collaboration with, partner institutions. This includes:
- Migrating the current page turner application to the use of an API – We hope to report soon on a strategy to reengineer the current page turner application so that it provides access through an API. We hope that, through this API, we can make other functions or modes of access possible.
- Creating a repository development 'sandbox' for shared development – We intend to create a sandbox system that contains representative content from the repository and gives developers access to the functions of the HathiTrust repository system. We hope that the availability of this development sandbox will make it possible for partner institutions to collaborate in creating new services through, for example, new or expanded APIs.
- Supporting general "public interface" to content – We have initiated a multi-stage strategy to create a “public interface” mechanism, an interface with which digital books and journals in the HathiTrust repository can be discovered and accessed. For more information, please see the report on Short- and Long-term Objectives.
- 295,108 volumes were added in September.
- As of October 1st, the repository contained a total of 1,903,670 volumes.
- 28,898 public domain volumes were added in September, bringing the total number of public domain volumes to 319,594 (17% of the total content).
- Wisconsin Ingest – Subsequent to resolving Google-related content validation issues in August, we ingested Wisconsin content throughout September. By the end of the month, we had loaded 107,529 Wisconsin volumes in the repository.
Forecast for October development
- Complete content synchronization in early October.
- Install the second instance of storage in Indianapolis in late October.
- Index larger numbers of volumes (e.g., 1 million) with Solr, run query suite for all index increments, and test for different levels of memory. For more information, see http://www.hathitrust.org/large_scale_search.
PLEASE NOTE: We still do not yet have contact email addresses for institutions for notification. As the service becomes more widely used, this will be an essential means of communication. Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
- For major work, Friday evenings (8pm-1am) and Sunday mornings (5am-10am);
- For minor work, weekdays from 6:30am-8am.
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
- Outages in September: On Thursday, September 18 at approximately 9:30am EDT, HathiTrust became inaccessible due to a software problem on a storage system; the problem was related to our work with data synchronization. Support was contacted and the problem was resolved at 10:45am EDT.
- Outages planned for October/November: A brief outage will be scheduled in October for a storage system software upgrade.