Available Indexes

Performance at 5 million volumes

On November 19, 2009, we put new hardware into production to provide full-text searching against about 4.6 million volumes.  Currently we have about 5.3 million volumes.  The average response time is about  3 seconds,  90% of queries take under 4 seconds, 9% of queries take between 4 seconds and 24 seconds, and 1% of queries take longer than 24 seconds.

These times approximate the response times experienced by users. They measure the amount of time between the time our test program sends a query to the Large Scale Search application and the time the test program receives a response.  The actual response time a user would experience is probably a bit longer due to the time it takes the user's web browser to load CSS, images and javascript, and to render the page.

The chart below shows the  average, median, 90th percentile, and 99th percentile response times for the 5.3 Million volume index and compares times based on our logs of actual user queries with the response times for our test queries.  The differences between the user queries and the test queries can be most easily observed in the 99th percentile times.[1]

Response times in milliseconds

User queries2,7301,9403,53323,177
Test queries1,5831,4522,3003,442


Our earlier tests reported only on the Solr reponse times rather than the time it takes for our Large Scale Search application to respond to a query.  The elapsed times reported above reflect the time the Large Scale Search application takes to process the user's request, send 2 queries to Solr, process the responses from Solr, and generate and send HTML. [2]   This includes any network time in communcation between the application and Solr and the application and our test program.  This overhead adds a second or more to the raw Solr response times previously reported in our earlier benchmarking.

For comparison with our earlier benchmarking, the chart below shows the Solr response times (qtime) for our previous tests against a 500,000 volume index on our test hardware and the same tests run against our production system with indexes of 4.7 million and 5.2 million volumes.

Solr Response time for 1,000 test queries (ms)    
Single 500K doc index test machine87351571,200
Production 4.7 Million volumes13467300894
Production 5.2 Million volumes207864441,399

We are working on a few things that should improve performance.[3]

  1. Tuning the network 
  2. Load balencing and replication
  3. Tuning the list of common words
  4. Testing to determine the optimum shard size and optimum number of shards per server
  5. Monitoring our logs for slow queries and working to determine the bottlenecks in processing


[1]   We are currently investigating the causes of the differences in response times between user queries and our test suite. We may need to modify our test suite to better reflect user queries. On the other hand when we rerun slow user queries (against a newly cleared and warmed cache), we see much faster response times than reported in the logs. We are in the process of trying to identify the factors responsible for the poorer performance showing up in the logs.

[2] The Large Scale Search application sends two queries to Solr for each user query it receives.  The first query is to get the first page of results and the second query is to get the count of either "full view only"  hits or all hits to populate the "All items" or "Only Full view" tabs.  The second query can get its results from the Solr caches and so is much faster than the first.

[3] Performance measures

1)The network cards on our Solr servers experience intermittent problems with the handling of jumbo packets.  We have programs in place to reset the cards when these problems occur.  We are experimenting with different NIC cards and drivers, which we hope will eliminate the problem.


2)Currently we are running the large scale search application both here at the University of Michigan and at Indiana.  However, until our planned deployment of new hardware  at Indiana, both applications must query the Solr servers here at UM.  Even with Internet 2  the network latency between Indiana and UM is about 35ms per packet.  Solr http requests and responses use multiple tcp packets, so queries that get sent to Indiana have slower overall response times.  We plan to install new hardware at Indiana to mirror/replicate our Solr servers here at UM to eliminate this latency problem and to provide for failover.

3) We did some analysis of the most frequent terms in our index and will be adding terms to our list of common words for CommonGrams. A future blog post will provide details.

4)  Our previous tests on  index sizes and I/O requirements  were performed before we implemented CommonGrams and in a significantly different hardware and storage environment.  We have created new indexes for shard sizes from about 500,000 documents up to over 1 million documents.  We plan on doing a series of tests on  the new hardware to determine the relationship of index size and shards per server, to I/O demands and response time.

5)Our logs show about 0.5% of user queries (or about 1 out of every 200)  take over 30 seconds.  When we rerun these same queries, we get response times of under 10 seconds.  We are currently working on trying to determine the cause of these slow queries so that we can eliminate these slow response times.






Thank you for taking the time to post this information. Is the main factor the size of your corpus? Are these times on a relatively lightly loaded box?

The slower Solr response times are due to the size of the corpus and the related disk I/O for queries containing common words (the total index size for all the shards is now about 3 terabytes.) We suspect that the slower *elapsed* times are due to network problems, which we are in the process of solving.

As far as the load on the servers, the Solr servers are dedicated to serving the search so there is no real load from anything else. So far we have not had a demanding query rate. We average about one query every 5-10 seconds. If we start seeing a query rate that looks like it will have a sustained rate of over 1 query per second we will consider replicating the index. Each shard would be replicated and we would load balance between the replicas. In the near future we will have a second copy of the index in Indiana (primarily for failover purposes) but we will load balance between the instance here and at Indiana, so that should give us some extra time before we have to consider further replication.


>>>Our logs show about 0.5% of user queries (or about 1 out of every 200) take over 30 seconds. When we rerun these same queries, we get response times of under 10 seconds. We are currently working on trying to determine the cause of these slow queries so that we can eliminate these slow response times.
With regards to the above, was it resolved? Is it related to network bandwidth or CPU Utilization? How many concurrent users used for these and what was the QPS?

We couldn't find any correlations with high QPS and/or concurrent users. We tend to get rates of less than 1 QPS. It appears that there were at least two causes. The first was a problem with the drivers for our network cards that resulted in dropped packets. Once that was resolved, the number of slow queries was reduced. The second cause appears to be long full garbage collections by the JVM. When we looked at the GC logs we noticed periodic stop-the-world garbage collections that took around 30 seconds. We recently made some changes that reduced memory consumption significantly and have not seen those large garbage collection times. (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
We also plan to change to the concurrent garbage collector after we run some tests but we haven't implemented this yet. I still need to do some further log analysis to be sure that this has solved the problem.


Add new comment