Large-Scale Search
An introduction to Large-scale Full-text Search for the HathiTrust Repository is given following the most recent monthly progress reports below. A report detailing results of stages 1 and 2 is available at http://www.hathitrust.org/technical_reports/Large-Scale-Search.pdf. In addition to reviewing the monthly updates, please visit the Large-Scale Search Blog for up-to-date information.
Updates by month
From the May Update:
From the April Update:
From the March Update:
From the February Update:
From the January Update:
Introduction
The ability to discover content in the HathiTrust repository benefits the archive in a variety of ways. The greater the ability of users to find and use content in the repository, the greater their appreciation of what might otherwise be seen as a preservation effort of hypothetical value. In addition, the process of revealing content in the repository also adds a method for ensuring the integrity of the files; use of those files can reveal problems that might go undetected in a dark archive. While we can facilitate basic discovery through bibliographic searches, deeper discovery through full-text searches across the entire repository provides even greater benefits.
Research into searching a body of content this large is still in its infancy (the repository is approaching one billion pages), and few clear strategies for accomplishing such tasks exist. The major large-scale open source search engine, Lucene, does not provide benchmarking information for data sets this large, and Solr, the most widely deployed implementation of Lucene, has only recently begun gathering benchmarking data. We embark on trying to solve this problem with only general guidance on strategies.
Research programmers in the University of Michigan’s Digital Library Production Service have undertaken a process to generate benchmarking data to help shape our strategies. After a preliminary investigation of options, they chose to use Solr and they engaged the Solr development community in helping to define paths. One feature of Solr is its ability to scale searches across very large bodies of content through its use of distributed searching and “shards.” When an index becomes too large to fit on a single system, or when a single query takes too long to execute, an index can be split into multiple shards, and Solr can query and merge results across those shards. Although the size of our data clearly points to the need for shards, there are many other variables in designing a successful approach, one that scales to large amounts of data and provides meaningful results. This introduction summarizes the strategy we are taking. A report on ongoing activities is available here.
One final challenge in this process is assembling a suite of test searches. In initial testing, we quickly learned that our first suite of queries would not provide us with a typical range of performance results. Typical queries are likely to vary significantly, depending on a vast number of variables. We continue to refine our approach to defining a query set, a process likely to change shape as we deploy a search mechanism for end users.
Overview of Strategy
We have attempted to define the variables that have the greatest impact on large-scale searching. We have also tried to stage our benchmarking process so that we start with the simplest approach and introduce each new variable only after collecting benchmarking data on the previous instantiation of the index and environment. Our stages are as follows:
- Stage 1: Growing the index Complete: As the index gets larger, we expect to learn about the size of the index relative to the body of content, time to index with a growing body of content, and degradation of search performance as the amount of content increases. In order to gain a clear sense of the way that these phenomena take place, we are conducting tests on indexes in 100,000 volume steps, from 100,000 to 1,000,000 (with additional increments at 10,000 and 50,000).
- Stage 2: Impact of memory Complete: Increasing physical memory and changing JVM configurations will also influence performance. We will increase physical memory from 4GB to 8GB and test several JVM configuration changes in combination with the test suite and each of these index sizes.
- Stage 3: Using shards Complete: We will introduce shards in the approach, employing multiple shards with the 8GB memory and optimal JVM configuration. We will test the suite of searches with one shard on each of two physical servers. We will then test the suite of searches with one shard on each of two virtual servers on each physical server (i.e., four shards). Benchmarking data will be gathered for all of the index steps.
- Stage 4: Load testing: We will introduce load testing for the single shard and multi-shard approaches, attempting to see what impact a large number of users will have on performance.
- Stage 5: Faceting results: Full-text searching of a large number of documents will undoubtedly lead to the retrieval of a large numbers of results, and thus usability problems. One obvious strategy for improving navigation of large numbers of results is the use of faceted displays from associated bibliographic data. We will add relatively full bibliographic records to each of the documents and repeat the testing process with a faceted display of results from the bibliographic data.

