Large-Scale Search

An introduction to Large-scale Full-text Search for the HathiTrust Repository is given following the most recent monthly progress reports below. A report detailing results of stages 1 and 2 is available at http://www.hathitrust.org/technical_reports/Large-Scale-Search.pdf. In addition to reviewing the monthly updates, please visit the Large-Scale Search Blog for up-to-date information.

Updates by month

From the November Update:

On November 19, HathiTrust launched a new service enabling full-text search of all volumes in the repository. Indexing of newly ingested volumes is ongoing, but the release of the first production index (containing approximately 4.6 million volumes) is the culmination of more than a year of research and benchmark testing conducted by staff at the University of Michigan. This new service dramatically changes the way researchers are able to use our collections and, along with the release of the bibliographic catalog in May, demonstrates HathiTrust’s commitment to providing sophisticated ways of accessing and using collections preserved in the digital repository. The official news release is available at http://www.ns.umich.edu/htdocs/releases/story.php?id=7426. More can be read about large-scale search in the Development Updates section below.

From Development Updates:

The launch of HathiTrust’s large-scale search application was postponed in October in order to acquire additional hardware to accommodate new index growth. Due to a variety of factors including a delay in hardware delivery, staff at the University of Michigan altered their index storage strategy and reconfigured the Solr index servers at Michigan to use the Isilon storage system as a back-end. In addition to solving issues related to the size of the index, moving from existing direct-attached storage to the Isilon network-attached storage more readily accommodates the significant index growth that occurs during routine index optimization. The move to Islion is a temporary strategy, however, and staff at UM will be investigating alternative options for storing the large-scale search index over the long-term.

After the storage reorganization, a small backlog of indexing was completed and a new automatic daily indexing process was developed. The University of Michigan launched the full-text service in mid-November and it is performing well.

With an eye toward achieving full redundancy of the search service, staff at UM implemented a nightly synchronization of the index to the Indiana site. Work toward redundancy is ongoing, however, and will involve further research to determine the optimal size of index shards. The size of index shards will help to determine the optimal number of index servers to deploy to guarantee adequate search performance, as well as the additional server deployments and workflows needed to support continuing testing of the search system, routine indexing, and volume re-indexing. Once complete, additional equipment will be purchased and installed at both the Michigan and Indiana sites as appropriate to establish full redundancy.

From the October Update:

Staff at the University of Michigan successfully indexed all volumes in HathiTrust using the newly acquired hardware. However, the official launch of the large-scale search application was postponed in order to acquire additional hardware to accommodate new index growth. The original estimate of storage requirements turned out to be low once common-grams technology was introduced. Common-grams offer significantly better search performance but result in an increased index size. The very large number of volumes ingested into the repository in October contributed to the immediate need for more indexing space as well. Optimization of the index, a process occurring at regular intervals, requires as much as 3 times the size of the index shard being optimized.

Faceting of search results, a feature supported by Solr, was further explored in October. Faceting requires the addition of bibliographic data to the full-text index. A faceted index was built across two shards to look for potential problems in scaling. Early indications are that performance is only affected slightly with the facets employed.

From the September Update:

In September, the University of Michigan worked to revise and debug production index-building routines to support a comprehensive index of HathiTrust volumes. This index is distributed across five servers with two Solr shards, or index fragments, on each server. In the process of running the routines it was confirmed that Logical Volume Manager (LVM) snapshots could be used effectively to deploy index updates. Concurrent testing of the indexes in the new search environment showed a significant improvement in performance over the current environment, as had been expected. The new full-text search service is targeted for release on October 19. When it is live, the full text of the more than 4 million volumes in HathiTrust will be searchable by anyone with a Web browser. At that time, a new portal interface will replace the current page at http://catalog.hathitrust.org, providing access to full-text search, bibliographic search, and linking to custom collections in the Collection Builder.

With the release of full-text search on the horizon, HathiTrust has begun exploring options for offering faceted browsing of content in conjunction with full-text search. The University of Michigan has built and performed preliminary testing on an index of 500,000 volumes that includes metadata suitable for faceting of search results. The tests suggest that the impact of faceting on full-text search performance will be tolerable in the new environment.

Principal developers for the open source Solr software integrated Michigan’s contribution of common-grams code into the Solr code base. It is now a permanent feature of Solr and, of course, the HathiTrust indexing process.

From the August Update:

After additional search performance testing in August, an improved index configuration was established by staff at the University of Michigan using a punctuation filter and a list of 400 common words (see blog post for details: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-perform...). This index configuration will be put into production on the new dedicated server hardware, which was installed in August. Michigan also completed additions to the indexing control software (SLIP) to support distribution of indexing across several servers, each with multiple Solr index shards. A continuous indexing strategy for this distributed system and corresponding requirements for storage configuration and scripting has been implemented, and the first indexing tests will have begun by the time this report is published.

From the July Update:

University of Michigan staff investigated the indexing problems with the beta large-scale search that were reported in the last update. The problems were due to a shortage of available memory. However, a decision was taken to wait for new hardware to be deployed before taking further action. The new hardware, purchased in June to support large-scale search, was received in July, and is currently being prepared for testing and use. With the new hardware in place, it is planned to have full text search of all volumes in HathiTrust by October 1st.

UM staff made refinements to the custom punctuation filter for large scale search, and ran tests only to discover the filter did not provide the performance boost anticipated. The punctuation filter has been set aside temporarily, but has potential for future implementation. Tests conducted by staff to compare response times for common-grams Solr indexes in various configurations resulted in a new emphasis being placed on the importance of a well-tuned list of common words. A new program that evaluates the total number of term occurrences for the most frequently occurring words in an index was created to aid in the selection of common words for this list. Additional details can be found on the HathiTrust Large Scale Search Blog (http://www.hathitrust.org/blogs/large-scale-search/). Four new posts were added to the blog in July.

From the June Update:

University of Michigan staff ordered additional servers to support large-scale search in June, and prepared space for them in the MACC data center in Ann Arbor. UM also continued to explore the use of common-grams in large-scale search with a focus on refining the set of common terms in order to strike a balance between Solr index size and performance. Performance testing that was conducted in the process generated unexpected results that led to the discovery of a bug in a custom Solr punctuation filter. The bug was fixed, and tests will be conducted again in July.

The large-scale search team has also encountered a problem when building full-text indexes for the beta large-scale search (http://babel.hathitrust.org/cgi/ls), in which indexing stops when memory errors are encountered after about a day and a half of indexing. This problem will be investigated further in July.

From the May Update:

A hardware configuration for servers to support large-scale search in HathiTrust was finalized in May, and data center space is nearly ready for use, pending installation of network gear. An order for the servers will be placed in early June for a planned deployment in July. Deployment of common-grams phrase searching in HathiTrust’s experimental full-text beta search (http://babel.hathitrust.org/cgi/ls) was delayed in May, but was completed as of press time for this newsletter. In search benchmarking tests, the use of common-grams in phrase searching reduced average query response over a representative set of queries by more than 85%. The response time for the slowest query in the set decreased from 2 minutes to only 8 seconds. Up-to-date information on large-scale search benchmarking will be posted on the new large-scale search blog (http://www.hathitrust.org/blogs/large-scale-search).

From the April Update:

Michigan and California developers shared experiences and ideas in a fruitful discussion about Lucene-based search engines, XTF and Solr. Investigations into software solutions for improving response times for slow queries led us to add common-gram indexing and searching capabilities to Solr, significantly improving performance of slow phrase queries. Common-grams increase index size, but the difference so far seems to be manageable and worthwhile. We are continuing to refine a hardware configuration to use for Solr servers based on discussions of indexing workflows and continuing research of different indexing algorithms, which have an impact on storage requirements.

From the March Update:

We are using the results of large-scale search testing done so far to develop a hardware configuration for production Solr infrastructure. Investigations continue into software solutions for improving response times for slow queries.

From the February Update:

Load testing was completed in February, with the exception of tests on smaller indexes to confirm limits on input/output operations per second. Strategies for hardware configuration and acquisition are now being explored.

From the January Update:

January saw our experiments shift to load testing. An array of tests were executed against Solr indexes using JMeter to estimate real world use with 4, 8, and 10 simultaneous users, randomized delays based on times of 50ms, 500ms, 1000ms, and 2000ms, and indexes as large as 1 million documents. The results are being reviewed now.

Introduction

The ability to discover content in the HathiTrust repository benefits the archive in a variety of ways. The greater the ability of users to find and use content in the repository, the greater their appreciation of what might otherwise be seen as a preservation effort of hypothetical value. In addition, the process of revealing content in the repository also adds a method for ensuring the integrity of the files; use of those files can reveal problems that might go undetected in a dark archive. While we can facilitate basic discovery through bibliographic searches, deeper discovery through full-text searches across the entire repository provides even greater benefits.

When we began investigating options for large-scale search over a year ago, research in this area was its infancy and there were few clear strategies for searching a repository the size of HathiTrust. The major large-scale open source search engine, Lucene, did not provide benchmarking information for data sets this large, and Solr, the most widely deployed implementation of Lucene, had only recently begun gathering benchmarking data. We embarked on trying to solve this problem with only general guidance on strategies.

Research programmers in the University of Michigan’s Digital Library Production Service undertook a process to generate benchmarking data to help shape our strategies. After a preliminary investigation of options, they chose to use Solr and they engaged the Solr development community in helping to define paths. One feature of Solr is its ability to scale searches across very large bodies of content through its use of distributed searching and “shards.” When an index becomes too large to fit on a single system, or when a single query takes too long to execute, an index can be split into multiple shards, and Solr can query and merge results across those shards. Although the size of our data clearly points to the need for shards, there are many other variables in designing a successful approach, one that scales to large amounts of data and provides meaningful results. This introduction summarizes the strategy we are taking.

Overview of Strategy

We have attempted to define the variables that have the greatest impact on large-scale searching. We have also tried to stage our benchmarking process so that we start with the simplest approach and introduce each new variable only after collecting benchmarking data on the previous instantiation of the index and environment. Our stages are as follows:

  • Stage 1: Growing the index Complete: As the index gets larger, we expect to learn about the size of the index relative to the body of content, time to index with a growing body of content, and degradation of search performance as the amount of content increases. In order to gain a clear sense of the way that these phenomena take place, we are conducting tests on indexes in 100,000 volume steps, from 100,000 to 1,000,000 (with additional increments at 10,000 and 50,000).
  • Stage 2: Impact of memory Complete: Increasing physical memory and changing JVM configurations will also influence performance. We will increase physical memory from 4GB to 8GB and test several JVM configuration changes in combination with the test suite and each of these index sizes.
  • Stage 3: Using shards Complete: We will introduce shards in the approach, employing multiple shards with the 8GB memory and optimal JVM configuration. We will test the suite of searches with one shard on each of two physical servers. We will then test the suite of searches with one shard on each of two virtual servers on each physical server (i.e., four shards). Benchmarking data will be gathered for all of the index steps.
  • Stage 4: Load testing Complete: We will introduce load testing for the single shard and multi-shard approaches, attempting to see what impact a large number of users will have on performance.
  • Stage 5: Faceting results: Full-text searching of a large number of documents will undoubtedly lead to the retrieval of a large numbers of results, and thus usability problems. One obvious strategy for improving navigation of large numbers of results is the use of faceted displays from associated bibliographic data. We will add relatively full bibliographic records to each of the documents and repeat the testing process with a faceted display of results from the bibliographic data.