To scale up from 500,000 volumes of full-text to 5 million, we decided to use Solr’s distributed search  feature which allows us to split up an index into a number of separate indexes (called “shards”). Solr's distributed search feature allows the indexes to be searched in parallel and then the results aggregated so performance is better than having a very large single index.
Sizing the shards
Our previous tests indicated that disk I/O was the bottleneck in Solr search performance and that after the size of a single index increases beyond about 600,000 volumes, I/O demands rise sharply . (See Update on Testing: Memory and Load tests  for details.) For that reason we decided that shards of between 500,000 and 600,000 volumes would be appropriate. For 5 million volumes, we decided on 10 shards each indexing 500,000 volumes. The corresponding size of the indexes for each shard would be around 200 GB for a total of about 2 terrabytes. We plan to add shards to scale beyond 5 million volumes. For details on the server hardware see New Hardware for searching 5 million+ volumes of full-text .
Several postings on the Solr list advised against using Network Attached Storage (NAS), so we initially configured the hardware to use Direct Attached Storage (DAS). 
Earlier tests indicated we needed to optimize our indexes in order to get acceptable performance. Solr requires about 2x the size of the index as temporary space during the optimization process of the indexing. So we needed about 400 GB of disk to accommodate 200 GB of index. Our tests also indicated that while Solr provides a mechanism to allow searching to take place while an index is being optimized, search performance severely degrades during optimization. To avoid this, we decided to build and optimize the indexes on separate Solr instances. However, to avoid copying our large indexes over the network frequently, we set up the separate index building and searching Solr instances on the same servers for the shards they serve.
We want to update the indexes daily to reflect newly ingested content. Because of the size of our index shards (almost 200GB) and our desire for a daily index update, we didn't want to spend time making copies of the newly-built shards every day. We decided to use LVM snapshots and simply take snapshots of the quiescent index shards after their daily build, avoiding copying the entire index every day.
We configured our servers with the largest and fastest hard disks available. Each server had 6 450GB 15K RPM SAS drives in a RAID 6 configuration.  
After the overhead for the OS, filesystem, and RAID, we had about 1,600 GB usable per server, which we divided into four 400 GB logical volumes for the four Solr shards: two for indexing and two for serving.   So with 2 500,000 volume serving shards per server, we will have 1 million volumes per server. To accomodate 5 million volumes, we set up 5 servers.
Index building process
Nightly, after indexing for all shards was completed, our driver script ran an optimize on each of the indexing shards. After the optimize finished the driver script triggered read-only snapshots of the optimized indexes and mounted them for use as serving shards. Once the snapshots were taken, indexing could resume on the indexing shards.
As we were scaling up, when we got close to an index size of 190GB per shard, we started having problems with the optimization process running out of disk space. Several posts to the Solr list revealed that under various conditions, optimization can take up to 3 times the final optimized index size rather than the 2 times that we were expecting.  
We implemented a workaround. We wrote a program which uses the "maxSegments" parameter to the Solr optimize command and optimizes in stages. For example the program would optimize down to 16 segments, then to 8, then to 4… and finally down to 1. The clean up process at the end of each optimize gets rid of any unneeded files and/or file handles.  
Our early performance tests with this architecture indicated good response time and confirmed that disk I/O continued to be the bottleneck in search performance. Other resources appeared to be underutilized. During continuous search testing, CPU utilization tended to be between 10 and 20%.
Our intent was to have nightly index updates and for the search system to be available 24/7. At some point we ran our tests while an optimization was occurring and discovered that performance slowed to a crawl during optimization. This is because optimization is very disk intensive (the entire index needs to be read and then the optimized index written), and the disk I/O for optimization competes with the disk I/O needed for searching.
Once we noticed the poor searching performance during optimization we realized that the disk I/O impact of optimization was exacerbated by the additional I/O load of block-level LVM snapshots.   We examined our options for reducing I/O contention in a small 6-drive configuration and considered eliminating our use of snapshots and putting the indexing shard and the serving shard on separate RAID sets. This would reduce contention for disk I/O since indexing/optimizing and serving would be making read/write requests to entirely different disks, although some contention to the same RAID controller would remain.   However, splitting the configuration into two RAID sets would mean doubling the RAID overhead which would effectively reduce capacity by almost a terabyte or change from RAID 6 to RAID 5. A 1 terabyte reduction in capacity would mean reducing the size of each index by almost half, thus requiring twice as many servers. We abandoned the option of separate RAID controllers because we were neither willing to lose almost 1TB in capacity nor forego RAID 6.
Given the options with the existing hardware, we also considered confining optimizing to once a week during off-peak hours on the weekend. However, we soon ran in to another problem; we were rapidly running out of disk space to accomodate daily index growth. This was due to a number of factors including:
- Ingest of volumes accelerated considerably beyond our predictions
- We underestimated the disk space needed. When we implemented CommonGrams  to meet our performance goals, the index size went from about 35% of the size of the documents to about 47% of the size of the documents. Using CommonGrams, the 200GB index size will accommodate about 425,000 documents per shard instead of the 500-600,000 we first estimated
- Our order for more hardware to keep up with the demand was delayed significantly.
These considerations along with a concern for a more manageable storage solution resulted in a decision to consider our options for a high-performance NAS solution, and to use our existing Isilon cluster (which houses the repository) provisionally.
Moving Large Scale Search to a high performance Network Attached Storage system
NAS or SAN storage is generally preferable to DAS because it is easier to manage and for a number of other reasons  . We initially pursued DAS based on various posts on the Solr list that led us to believe that DAS would have better performance. Given the size of our indexes and our requirements, we learned that NAS was a much better fit in our circumstances. The benefits of using our large-scale NAS over our simple DAS configuration for this project are significant, including:
- Significantly more spindles contributing to I/O. Even though we had 30 15K drives for storage in our DAS configuration, any given I/O transaction had access to only 6 of those, which barely distributes I/O load. Our Isilon cluster, on the other hand, has 480 drives, and is highly virtualized: every block of every file is striped over no less than 14 different drives. I/O is therefore distributed throughout the cluster, minimizing the I/O demand on any given drive.
- Shared storage. We could now build index shards on completely different servers without copying terabytes of index data over the network.
- Storage efficiency resulting from a shared overhead for redundancy and free space. With the DAS configuration, we allocated 1/3 of our storage to RAID 6 parity and could use at most 50% of our storage because of the need for excess capacity on each server during optimization. With the NAS configuration, we have a smaller, fixed redundancy overhead of under 20% because our stripe width is more than twice as large, and our free space overhead is shared among all uses of the system distributed over time, so our overall storage utilitization can be higher.
- Independent scaling of compute and storage resources. With DAS, we were going to be adding servers to get more storage when we didn't really need more compute resources. With NAS, we can add more of the resource we need. We can add more shards per server or index more fields by increasing our storage allocation, or increase Solr query throughput capacity by adding more servers.
- The Isilon system we are currently using has some characteristics that make it well-suited to this task, namely that it can scale performance linear with capacity due to its cluster architecture.
- NAS systems typically offer file-based rather than the block-based snapshots used by LVM, which are more compatible with the I/O characteristics encountered during optimization.
With the flexibility of the Isilon NAS, we can completely separate building the index from serving. The Isilon snapshot technology is file based rather than block based and because data can be striped across so many disks, we can use snapshots without the adverse effects that we had when we were using attached hard drives. Our current configuration is
- 4 Servers with one Tomcat and 3 shards on each server (only 10 of 12 shards currently in use)
- 72 GB memory, 16 GB allocated to the JVM
- 1 Server with 12 Tomcats and 12 shards (only 10 of 12 Tomcats/shards currently in use)
- 72 GB memory, 6 GB allocated to each of 10 JVMs
Current Production Index building process
Every morning the index driver gets a list of newly added HathiTrust volumes and queues them for indexing. Indexing currently takes from 1 to 3 hours depending on the number of documents being indexed. After indexing is completed the index driver runs a program that optimizes the index on each shard down to two segments.   This currently takes about 3 hours but will increase as the number of documents in the second segment increases. At 3:00 am the next morning, a snapshot is taken of each optimized shard and synchronization of the index to our second site in Indiana begins. At 6:00am, a script performs sanity checks to ensure indexing and optimization finished correctly and simply changes the symbolic link pointing to the day-old shard to point to the newly-updated shard and restarts Tomcat.. At 6:05am a separate program runs 1600 cache-warming queries to warm the Solr and OS caches, and then our test suite to monitor response time.
Scaling beyond 5 million volumes
With the use of NAS, we can easily add more disk capacity to Large Scale search, so decisions about scaling up from 5 million to 10 million volumes will primarily involve resources other than disk storage.
We currently have about 5.3 million documents in Large-Scale search, which averages out to about 530,000 volumes per shard. We are not yet using shards 11 and 12. At the current shard size we could accomodate about an additional 1.1 million documents using those two shards. However, the amount of growth that can be accomodated by adding more documents to existing shards and utilizing the two currently empty shards depends on how large we allow the shards to grow. We are working on experiments to determine the optimimum shard size, based primarily on search response time considerations.   We also plan to experiment to determine the optimum number of shards per server. As mentioned previously, the CPU utilization in the search servers is low. Once we determine the optimum size of each shard and the optimium number of shards per server, we will be able to develop better strategies for adding more memory or more servers..
As the size of the index grows, we have seen that we are reaching the limits of the 72 GB of memory currently installed on our build machine with just 10 Solrs building shards. We believe we will not be able to build indexes on the other 2 shards without either adding memory or implementing some complex code that would start and stop Solr instances. We plan to obtain two new build servers each with twice the memory (144GB). This should accomodate future growth as well as provide us with failover options and the flexibility we will need to rebuild the entire index when necessary without disrupting our our normal production indexing workflow  
 See http://www.lucidimagination.com/search/document/2105e4dba8711d71/two_solr_instances_index_and_commit#2105e4dba8711d71  and http://www.lucidimagination.com/search/document/f67e23ea39be9361/solr_and_nfs_in_distributed_deployment_real_time_indexing_and_real_time_searching#f67e23ea39be9361 
  After hearing a presentation  on using Solid State Drives for the Summa project we considered Solid State Drives but the cost of 2 terrabytes of SSDs (over $150,000) was prohibitive. (PDF of presentation on SSDs) 
  The index building shards needed 400GB or 2x the size of the 200GB index in order to run the index optimization process. Because of the nature of the LVM block based snapshot technology, we also needed 400GB for the serving shards. This is explained in more detail in note 6.
  These posts to the Solr list revealed that optimizing can take 3 times the final optimized index size rather than the 2 times we were expecting: http://old.nabble.com/How-much-disk-space-does-optimize-really-take-to25790344.html#a25792748 
  See http://old.nabble.com/Optimization-of-large-shard-succeeded-to25809281.html#a25809281 . One of the Solr committers has opened an issue to implement our workaround as a default in Solr: https://issues.apache.org/jira/browse/SOLR-1560 
  LVM is block based and its snapshots are block based. When first taken a snapshot consumes no space other than metadata. As changes are made to the source file system, the snapshot grows because changed blocks must be preserved, regardless of whether the blocks contained information about current files before they were changed. So in our setup where file systems were sized at 2x the index to accommodate optimization, the act of optimization changed every block in the file system. This in turn required LVM to copy every block in the file system in order to maintain the snapshot, effectively doubling the I/O load on the RAID set. In addition, our system administrators observed lengthy pauses in disk I/O during optimization (on the order of a minute), where the only explanation appeared to be that LVM required time for "house cleaning", for lack of a better term. During such pauses, the processing of a search query would likely have to wait until I/O operations are resumed.
  With the build and serve shards on separate RAID sets, we would still need 400GB for each shard. Once the build shard optimizes the index (which requires 400GB for a 200GB index) there are two alternatives to mounting the newly optimized index on the serving shard:
1) Copy the newly optimized index from the build shard to the serving shard. Since this takes considerable time, the serving shard would have to continue to serve from the old index until the copy is completed. This would require 200GB for the old index and 200GB for the new index. After the copy the old index could be deleted.
2) Change symlinks so that the server is pointing to the newly optimized index and the building Solr instance is pointing to the previous serve index.
  Conventionally, the benefits of networked storage (NAS or SAN) are easier management, consolidation, and flexible deployment, all coming with the additional cost of some sort of sophisticated storage controller system. If architected correctly, performance of I/O-intensive workloads, even though directed over a network, will also benefit.
Most of these benefits come through virtualization, which is the technology that allows all the disks in a networked storage system to be pooled together and for logical volumes are created within the pool, typically spread across many devices, controllers, and caches. Not only is it quicker and easier to create logical volumes with a single unified system, but logical volumes tend to perform better because the I/O activity in a virtualized system is balanced throughout the system and thus minimized on any single drive. In particular, throughput is increased because of the aggregation of multiple drives each capable of reading or writing through their own disk interface, and rotational delay is minimized because data retrieval can be parallelized rather than serialized across multiple drives. This effect is maximized in storage architectures that are large, highly virtualized, and designed for performance scaling as well as capacity scaling (so that the controllers can keep up with the additional I/O capacity of the aggregated system).
In evaluating whether to use a NAS or SAN system, the primary consideration is whether multiple servers need to be able to access the same data. NAS systems, by their nature, use protocols such as NFS that allow multiple read-write access to file systems. SAN systems, on the other hand, use protocols such as SCSI over fiber channel that permit only a single server to have access without special lock management software. Either provides the benefits described above but generally one will be a much better fit for a given application. For our project, NAS makes the most sense because we need to have the same indexes visible on multiple servers for building versus indexing, and down the road, for server failover.
  Prior to going live in production we optimized all the shards. At that point the index size per shard was about 220 GB. Every night we indexed additional documents, which created unoptimized index segments. If we optimized every night down to 1 segment, the entire index needs to be read, processed, and written to disk. This would be about 220 GB of reading and 220 GB of writing. We decided instead to optimize down to 2 segments, which means that the large 220 GB segment is not affected and the incremental added indexing gets optimized down to a second segment. So far adding about 600,000 total volumes (60,000 per shard), has resulted in a second segment in each shard of about 50GB. It currently takes between 3 and 4 hours to optimize the daily added indexing because of the time it takes to read and write 50GB of data for each shard. At some point when this time becomes too large, we will optimize down to 1 segment and start over.
 Our previous experiments showing a sharp rise in I/O demands with indexes over 600,000 were conducted in a significantly different environment. In particular, these experiments were conducted prior to the implementation of CommonGrams, which significantly reduces I/O for queries with common words. The new environment also has significantly more processing power and memory. We plan to repeat our experiments in the new environment on a test shard, to determine the relationship of index/shard size, to I/O demands.
  We plan to have each of the two new build servers host only 6 of the twelve shards during the normal indexing/build process. However, they will be configured so either one can take over building all 12 shards in the case that one of them fails. When we need to re-index the entire 5 million volumes, we plan to dedicate one server to the normal incremental indexing process on all 12 shards and the other to building the new index on 12 new shards. We need to be able to reindex the entire corpus whenever we change the Solr schema or the list of common words. Adding fields such as OCLC numbers or MARC metadata fields, changing the way the indexer processes punctuation or other changes to the indexing to improve response time or improve the user experience will require reindexing the entire corpus,