An introduction to HathiTrust's strategy and rationale for providing large-scale full-text search of the digital library is given below. Further information can be found in our monthly updates  and the Large-Scale Search Blog .
The ability to discover content in the HathiTrust repository benefits the archive in a variety of ways. The greater the ability of users to find and use content in the repository, the greater their appreciation of what might otherwise be seen as a preservation effort of hypothetical value. In addition, the process of revealing content in the repository also adds a method for ensuring the integrity of the files; use of those files can reveal problems that might go undetected in a dark archive. While we can facilitate basic discovery through bibliographic searches, deeper discovery through full-text searches across the entire repository provides even greater benefits.
When we began investigating options for large-scale search in 2008, research in this area was its infancy and there were few clear strategies for searching a repository the size of HathiTrust. The major large-scale open source search engine, Lucene, did not provide benchmarking information for data sets this large, and Solr, the most widely deployed implementation of Lucene, had only recently begun gathering benchmarking data. We embarked on trying to solve this problem with only general guidance on strategies. Research programmers in the University of Michigan’s Digital Library Production Service undertook a process to generate benchmarking data to help shape our strategies. After a preliminary investigation of options, they chose to use Solr and they engaged the Solr development community in helping to define paths.
One feature of Solr is its ability to scale searches across very large bodies of content through its use of distributed searching and “shards.” When an index becomes too large to fit on a single system, or when a single query takes too long to execute, an index can be split into multiple shards, and Solr can query and merge results across those shards. Although the size of our data clearly points to the need for shards, there are many other variables in designing a successful approach, one that scales to large amounts of data and provides meaningful results. This introduction summarizes the strategy we are taking.
We have attempted to define the variables that have the greatest impact on large-scale searching. We have also tried to stage our benchmarking process so that we start with the simplest approach and introduce each new variable only after collecting benchmarking data on the previous instantiation of the index and environment. Our stages are as follows. A report detailing results of stages 1 and 2  is available.