Available Indexes

Scaling up Large Scale Search from 500,000 volumes to 5 Million volumes and beyond

To scale up from 500,000 volumes of full-text  to 5 million, we decided to use Solr’s distributed search feature which allows us to split up an index into a number of separate indexes (called “shards”).  Solr's distributed search feature allows the indexes to be searched in parallel and then the results aggregated so performance is better than having a very large single index.

Sizing the shards

Our previous tests indicated that disk I/O was the bottleneck in Solr search performance and that after the size of a single index increases beyond about 600,000 volumes, I/O demands rise sharply . (See  Update on Testing: Memory and Load tests for details.)  For that reason we decided that shards of between 500,000 and 600,000 volumes would be appropriate.  For 5 million volumes, we decided on 10 shards each indexing 500,000 volumes. The corresponding size of the indexes for each shard would be around 200 GB for a total of about 2 terrabytes.  We plan to add shards to scale beyond 5 million volumes.  For details on the server hardware see New Hardware for searching 5 million+ volumes of full-text.

Several postings on the Solr list advised against using Network Attached Storage (NAS), so we initially configured the hardware to use Direct Attached Storage (DAS).[1]

Earlier tests indicated we needed to optimize our indexes in order to get acceptable performance.  Solr requires about 2x the size of the index as temporary space during the optimization process of the indexing.  So we needed about 400 GB of disk to accommodate 200 GB of index.  Our tests also indicated that while Solr provides a mechanism to allow searching to take place while an index is being optimized, search performance severely degrades during optimization. To avoid this, we decided to build and optimize the indexes on separate Solr instances. However, to avoid copying our large indexes over the network frequently, we set up the separate index building and searching Solr instances on the same servers for the shards they serve.

We want to update the indexes daily to reflect newly ingested content. Because of the size of our index shards (almost 200GB) and our desire for a daily index update, we didn't want to spend time making copies of the newly-built shards every day. We decided to use LVM snapshots and simply take snapshots of the quiescent index shards after their daily build, avoiding copying the entire index every day.

We configured our servers with the largest and fastest hard disks available. Each server had 6 450GB 15K RPM SAS drives in a RAID 6 configuration. [2]

After the overhead for the OS, filesystem, and RAID, we had about 1,600 GB usable per server, which we divided into four 400 GB logical volumes for the four Solr shards: two for indexing and two for serving.  [3]  So with 2  500,000 volume serving shards per server, we will have 1 million volumes per server.  To accomodate 5 million volumes, we set up 5 servers.

 Index building process

Nightly, after indexing for all shards was completed, our driver script ran an optimize on each of the indexing shards.  After the optimize finished the driver script triggered read-only snapshots of the optimized indexes and mounted them for use as serving shards.  Once the snapshots were taken, indexing could resume on the indexing shards.   

Implementation Issues.

As we were scaling up, when we got close to an index size of 190GB per shard, we started having problems with the optimization process running out of disk space.   Several posts to the Solr list revealed that under various conditions, optimization can take up to 3 times the final optimized index size rather than the 2 times that we were expecting.  [4]

We implemented a workaround.  We wrote a program which uses the "maxSegments" parameter to the Solr optimize command and optimizes in stages.  For example the program would optimize down to 16 segments, then to 8, then to 4… and finally down to 1.  The clean up process at the end of each optimize gets rid of any unneeded files and/or file handles. [5]

Our early performance tests with this architecture indicated good response time and confirmed that disk I/O continued to be the bottleneck in search performance.  Other resources appeared to be underutilized.  During continuous search testing, CPU utilization tended to be between 10 and 20%.

Our intent was to have nightly index updates and for the search  system to be available 24/7. At some point we ran our tests while an optimization was occurring and discovered that performance slowed to a crawl during optimization.   This is because optimization is very disk intensive (the entire index needs to be read and then the optimized index written), and the disk I/O for optimization competes with the disk I/O needed for searching.

Once we noticed the poor searching performance during optimization we realized that the disk I/O impact of optimization was exacerbated by the additional I/O load of block-level LVM snapshots. [6] We examined our options for reducing I/O contention in a small 6-drive configuration and considered eliminating our use of snapshots and putting the indexing shard and the serving shard on separate RAID sets. This would reduce contention for disk I/O since indexing/optimizing and serving would be making read/write requests to entirely different disks, although some contention to the same RAID controller would remain. [7] However, splitting the configuration into two RAID sets would mean doubling the RAID overhead which would effectively reduce capacity by almost a terabyte or change from RAID 6 to RAID 5. A 1 terabyte reduction in capacity would mean reducing the size of each index by almost half, thus requiring twice as many servers. We abandoned the option of separate RAID controllers because we were neither willing to lose almost 1TB in capacity nor forego RAID 6.

Given the options with the existing hardware, we also considered confining optimizing to once a week during off-peak hours on the weekend. However, we soon ran in to another problem; we were rapidly running out of disk space to accomodate daily index growth.  This was due to a number of factors including:

  • Ingest of volumes accelerated considerably beyond our predictions
  • We underestimated the disk space needed.  When we implemented CommonGrams to meet our performance goals, the index size went from about 35% of the size of the documents to about 47% of the size of the documents.  Using CommonGrams, the 200GB index size will accommodate about 425,000 documents per shard instead of the 500-600,000 we first estimated
  • Our order for more hardware to keep up with the demand was delayed significantly. 

These considerations along with a concern for a more manageable storage solution resulted in a decision to consider our options for a high-performance NAS solution, and to use our existing Isilon cluster (which houses the repository) provisionally.

Moving Large Scale Search to a high performance Network Attached Storage system

NAS or SAN storage is generally preferable to DAS because it is easier to manage and for a number of other  reasons [8].  We initially pursued DAS based on various posts on the Solr list that led us to believe that DAS would have better  performance. Given the size of our indexes and our requirements, we learned that NAS was a much better fit in our circumstances. The benefits of using our large-scale NAS over our simple DAS configuration for this project are significant, including:

  • Significantly more spindles contributing to I/O. Even though we had 30 15K drives for storage in our DAS configuration, any given I/O transaction had access to only 6 of those, which barely distributes I/O load. Our Isilon cluster, on the other hand, has 480 drives, and is highly virtualized: every block of every file is striped over no less than 14 different drives. I/O is therefore distributed throughout the cluster, minimizing the I/O demand on any given drive.
  • Shared storage. We could now build index shards on completely different servers without copying terabytes of index data over the network.
  • Storage efficiency resulting from a shared overhead for redundancy and free space. With the DAS configuration, we allocated 1/3 of our storage to RAID 6 parity and could use at most 50% of our storage because of the need for excess capacity on each server during optimization. With the NAS configuration, we have a smaller, fixed redundancy overhead of under 20% because our stripe width is more than twice as large, and our free space overhead is shared among all uses of the system distributed over time, so our overall storage utilitization can be higher.
  • Independent scaling of compute and storage resources. With DAS, we were going to be adding servers to get more storage when we didn't really need more compute resources. With NAS, we can add more of the resource we need. We can add more shards per server or index more fields by increasing our storage allocation, or increase Solr query throughput capacity by adding more servers.
  • The Isilon system we are currently using has some characteristics that make it well-suited to this task, namely that it can scale performance linear with capacity due to its cluster architecture.
  • NAS systems typically offer file-based rather than the block-based snapshots used by LVM, which are more compatible with the I/O characteristics encountered during optimization. 

Current architecture

With the flexibility of the Isilon NAS, we can completely separate building the index from serving.  The Isilon snapshot technology is file based rather than block based and because data can be striped across so many disks, we can use snapshots without the adverse effects that we had when we were using attached hard drives. Our current configuration is

Search  Servers

  • 4 Servers with one Tomcat and 3 shards on each server (only 10 of 12 shards currently in use)
  • 72 GB memory,  16 GB allocated to the JVM

Index  Server

  •  1 Server with 12 Tomcats and 12 shards (only 10 of 12 Tomcats/shards currently in use)
  •  72 GB memory,  6 GB allocated to each of 10 JVMs

Current Production  Index building process

Every morning the index driver gets a list of newly added HathiTrust volumes and queues them for indexing.  Indexing currently takes from 1 to 3 hours depending on the number of documents being indexed.  After indexing is completed the index driver runs a program that optimizes the index on each shard down to two segments. [9] This currently takes about 3 hours but will increase as the number of documents in the second segment increases.  At 3:00 am the next morning, a snapshot is taken of each optimized shard and synchronization of the index to our second site in Indiana begins. At 6:00am, a script performs sanity checks to ensure indexing and optimization finished correctly and simply changes the symbolic link pointing to the day-old shard to point to the newly-updated shard and restarts Tomcat..  At 6:05am a separate program runs 1600 cache-warming queries to warm the Solr and OS caches, and then our test suite to monitor response time.

Scaling beyond 5 million volumes

With the use of NAS, we can easily add more disk capacity to Large Scale search, so decisions about scaling up from 5 million to 10 million volumes will primarily involve resources other than disk storage.

Scaling Search

We currently have about 5.3 million documents in Large-Scale search, which averages out to about 530,000 volumes per shard.  We are not yet using shards 11 and 12.  At the current shard size we could accomodate about an additional 1.1 million documents using those two shards.  However, the amount of growth that can be accomodated by adding more documents to existing shards and utilizing the two currently empty shards depends on how large we allow the shards to grow.   We are working on experiments to determine the optimimum shard size, based primarily on search response time considerations. [10]  We also plan to experiment to determine the optimum number of shards per server.   As mentioned previously, the CPU utilization in the search servers is low.   Once we determine the optimum size of each shard and the optimium number of shards per server, we will be able to develop better strategies for adding more memory or more servers..

Scaling Indexing

As the size of the index grows, we have seen that we are reaching the limits of the 72 GB of memory currently installed on our build machine with just 10 Solrs building shards.   We believe we will not be able to build indexes on the other 2 shards without either adding memory or implementing some complex code that would start and stop Solr instances.   We plan to obtain two new build servers each with twice the memory (144GB).  This should accomodate future growth as well as provide us with failover options and the flexibility we will need to rebuild the entire index when necessary without disrupting our our normal production indexing workflow [11]



[1]See http://www.lucidimagination.com/search/document/2105e4dba8711d71/two_solr_instances_index_and_commit#2105e4dba8711d71 and http://www.lucidimagination.com/search/document/f67e23ea39be9361/solr_and_nfs_in_distributed_deployment_real_time_indexing_and_real_time_searching#f67e23ea39be9361

[2]   After hearing a presentation on using Solid State Drives for the Summa project we considered Solid State Drives but the cost  of 2 terrabytes of SSDs (over $150,000) was prohibitive.  (PDF of presentation on SSDs)

[3]  The index building shards needed 400GB or 2x the size of the 200GB index in order to run the index optimization process.  Because of the nature of the LVM block based snapshot technology, we also needed 400GB for the serving shards.  This is explained in more detail in note 6.

[4]  These posts to the Solr list revealed that optimizing can take 3 times the final optimized index size rather than the 2 times we were expecting: http://old.nabble.com/How-much-disk-space-does-optimize-really-take-to25790344.html#a25792748



[5]  See http://old.nabble.com/Optimization-of-large-shard-succeeded-to25809281.html#a25809281.  One of the Solr committers has opened an issue to implement our workaround as a default in Solr: https://issues.apache.org/jira/browse/SOLR-1560

[6] LVM is block based and its snapshots are block based. When first taken a snapshot consumes no space other than metadata.  As changes are made to the source file system, the snapshot grows because changed blocks must be preserved, regardless of whether the blocks contained information about current files before they were changed.  So in our setup where file systems were sized at 2x the index to accommodate optimization, the act of optimization changed every block in the file system.  This in turn required LVM to copy every block in the file system in order to maintain the snapshot, effectively doubling the I/O load on the RAID set.  In addition, our system administrators observed lengthy pauses in disk I/O during optimization (on the order of a minute), where the only explanation appeared to be that LVM required time for "house cleaning", for lack of a better term.  During such pauses, the processing of a search query would likely have to wait until I/O operations are resumed.

[7]  With the build and serve shards on separate RAID sets, we would still need 400GB for each shard.  Once the build shard optimizes the index (which requires 400GB for a 200GB index)  there are two alternatives to mounting the newly optimized index on the serving shard:


1) Copy the newly optimized index from the build shard to the serving shard.  Since this takes considerable time, the serving shard would have to continue to serve from the old index until the copy is completed.  This would require 200GB for the old index and 200GB for the new index. After the copy the old index could be deleted.


2) Change  symlinks so that the server is pointing to the newly optimized index and the building Solr instance is pointing to the previous serve index.

[8]   Conventionally, the benefits of networked storage (NAS or SAN) are easier management, consolidation, and flexible deployment, all coming with the additional cost of some sort of sophisticated storage controller system. If architected correctly, performance of I/O-intensive workloads, even though directed over a network, will also benefit.

Most of these benefits come through virtualization, which is the technology that allows all the disks in a networked storage system to be pooled together and for logical volumes are created within the pool, typically spread across many devices, controllers, and caches. Not only is it quicker and easier to create logical volumes with a single unified system, but logical volumes tend to perform better because the I/O activity in a virtualized system is balanced throughout the system and thus minimized on any single drive. In particular, throughput is increased because of the aggregation of multiple drives each capable of reading or writing through their own disk interface, and rotational delay is minimized because data retrieval can be parallelized rather than serialized across multiple drives. This effect is maximized in storage architectures that are large, highly virtualized, and designed for performance scaling as well as capacity scaling (so that the controllers can keep up with the additional I/O capacity of the aggregated system).

In evaluating whether to use a NAS or SAN system, the primary consideration is whether multiple servers need to be able to access the same data. NAS systems, by their nature, use protocols such as NFS that allow multiple read-write access to file systems. SAN systems, on the other hand, use protocols such as SCSI over fiber channel that permit only a single server to have access without special lock management software. Either provides the benefits described above but generally one will be a much better fit for a given application. For our project, NAS makes the most sense because we need to have the same indexes visible on multiple servers for building versus indexing, and down the road, for server failover.


[9] Prior to going live in production we optimized all the shards.  At that point the index size per shard was about 220 GB.  Every night we indexed additional documents, which created unoptimized index segments.  If we optimized every night down to 1 segment, the entire index needs to be read, processed, and written to disk.  This would be about 220 GB of reading and 220 GB of writing.  We decided instead to optimize down to 2 segments, which means that the large 220 GB segment is not affected and the incremental added indexing gets optimized down to a second segment. So far adding about 600,000 total volumes (60,000 per shard), has resulted in a second segment in each shard of about 50GB.  It  currently takes between 3 and 4 hours to optimize the daily added indexing because of the time it takes to read and write 50GB of data for each shard.  At some point when this time becomes too large, we will optimize down to 1 segment and start over.

[10]Our previous experiments showing a sharp rise in I/O demands with indexes over 600,000 were conducted in a significantly different environment. In particular, these experiments were conducted prior to the implementation of CommonGrams, which significantly reduces I/O for queries with common words. The new environment also has significantly more processing power and memory. We plan to repeat our experiments in the new environment on a test shard, to determine the relationship of index/shard size, to I/O demands.

[11] We plan to have each of the two new build servers host only 6 of the twelve shards during the normal indexing/build process.  However, they will be configured so either one can take over building all 12 shards in the case that one of them fails.  When we need to re-index the entire 5 million volumes, we plan to dedicate one server to the normal incremental indexing process on all 12 shards and the other to building the new index on 12 new shards.  We need to be able to reindex the entire corpus whenever we change the Solr schema or the list of common words. Adding fields such as OCLC numbers or  MARC metadata fields, changing the way the indexer processes punctuation or other changes to the indexing to improve response time or improve the user experience will require reindexing the entire corpus, 



2TB of SSD for $150.000 sounds like the supplier is taking a very high premium. Looking at Intel's server-class X25 (the best choice in my book at the moment), the retail price for the drives is $30.000. As using the enterprise-class drives for the "Write once a day, read all the time"-workflow is overkill, one might even argue going for the mainstream drives, putting the price for 2TB at $7000. I realise that the enterprise world does not normally permit sending a guy down to Newegg to fill a bag (a small and light bag in this case) with SSD's, but that is more or less the way we're going as we're ramping up for TB-scale indexing this fall.

The approximate pricing at the time we looked into this (November 2008) was for a 2TB fiber-attached flash-based SAN from Texas Memory Systems. These systems have full management capabilities, striping, parity redundancy and hot-swappable components, DDR cache, wear leveling, etc. That's why the cost per GB of this sort of approach doesn't compare to that of just raw SSD storage devices. Using raw devices is one way to go, but at our scale we would want them to be aggregated by some sort of managed controller unit. The thought of installing commodity SSDs into a JBOD or SCSI/SATA hardware RAID array did cross our minds :), but we decided against that approach because the internal components had not been designed to work with SSDs, and so would likely not maximize performance.

As I understand your setup, your indexing process is batch-based and do not have near-realtime requirements. This makes it possible (and often desirable) to have hardware dedicated for indexing and hardware dedicated for searching. With such a setup, enterprise-level stability on the search-side is not needed as catastrophic hardware-crash does not mean loss of data. Your argument about not maximizing performance is technically valid: RAIDing or just connecting SSDs up to the TB level would probably saturate most standard controllers. However, not achieving maximum possible performance still leaves room for a huge performance boost over conventional hard drives. Your setup is very interesting as you need both fast random IO and high bulk transfer rate. Our setup is not heavy on the bulk side (we don't use phrase searches much) and on a 4-core machine with a single previous-generation SSD, 4 parallel searches performed at 308% of a single search, indicating that the CPU was the main bottlenect. Thus, I would not worry too much about the random access performance for a commodity RAID setup with Lucene. This still leaves bulk transfers, but here I guesstimate that hardware specs will be fairly accurate as it is simpler to design for and measure.

Tom, what is the reason for running 10 independent Solr/JVM instances for the Indexer? Why did you choose that over running a single Solr/JVM with 10 Solr cores/indices?

Hi Otis,

I can answer the part about why we don't use a single JVM with 10 Solr instances. In our earlier system where we had 2 Solr instances sharing one tomcat/JVM, we found that for unknown reasons one Solr instance would claim all free heap (more than it would ever claim when running in its own JVM) and then the other Solr instance would hang when it needed more memory for a critical operation. One advantage of using 10 separate Solr/JVMs is that if one Solr instance/JVM gets an OOM while indexing, the others can continue indexing.

As far as using 10 cores, I don't know enough about the advantages (other than ease of configuration and deployment) of using 10 Solr cores instead of 10 separate Solr instances. Is there a performance or resource management advantage?


I agree with Toke - SSD's, in particular, Intel's X25-E drives are definitely the way to go. We run 24 off in 2 off Infortrend FC/SATA 12-slot chassis, multi-path connected to 2 HP SAN switches, in turn multi-path connected to 2 off HP DL380G5's running Oracle's cut of Redhat - OEL5, with Oracle RAC cluster and ASM mirroring data across the 2 Infortrend chassis. The rig ABSOLUTELY FLIES - Oracle's ASM "loves" the SSD's because the most important criteria - seek time - is basically eliminated compared to spinning disks. We never see any of the drives reach even 20% of their IO bandwidth capability, even running parallel queries across all 16 cores on the cluster. For your environment, SSD's will wipe the floor both performance, and cost wise, when compared with Texas Mem, and Isilon. Do try them, you will be pleasantly surprised by their effectiveness. We also load them into DL360's and DL380's, replacing what were many spindles for MySQL with just a single mirrored pair on SmartArray controllers. We're finding that we're now limited by CPU cycles (3Ghz 5160 and X5450's) - query performance is now always cpu-bound, never IO. Several benchmarks I've run recently for multi 500K row parallel inserts across 6 innodb tables on 1 MySQL instance are all showing CPU bound performance limits - the SSD just keeps upping it's IO numbers (180+MB/sec writes frequently seen) on 1 (mirrored) spindle. --D.

Straight IOps-to-IOps, I couldn't agree more; there is no comparison...and, as far as the geek factor, I'd love to play with SSDs...but our solutions need to be fiscally responsible. First and foremost, we need a storage configuration sophisticated enough to be shared across multiple nodes so we can offload indexing and handle failover. Simple DAS will not easily work for a 24x7x365 production environment with a large and dynamic index. The TMS product, or the incremental cost of your SAN fabric, are examples of how the cost increases beyond the simple $/GB in order to make storage manageable. Which brings me to the point I make with people on a weekly basis: cost comparisons of managed storage environments and raw media (disks) are apples and oranges. Production environments at any scale cannot efficiently use raw, unmanaged storage. So, how do you figure SSDs will wipe the floor cost-wise as compared to Isilon? I do not find the oft-quoted statistic of $/IOps compelling, for example, when there are other ready alternatives on my data center floor that provide performance within acceptable tolerances. We have 504 (soon to be 672) stripes in our Isilon cluster that see relatively little I/O; I would be a fool to not take advantage of that.

Add new comment