Large Scale Search Blog
Making personal collections from Large Scale Search Results
Submitted by Tom Burton-West on Tue, 07/06/2010 - 16:44We just released a new feature in our full-text Large Scale Search. When you do a search,you will see check boxes next to each search result. You can select items you want from the search results and create a personal collection. This should make it much easier to do repeated searches and explore a targeted subset of the HathiTrust volumes. If you are not logged in, the collection will be temporary. If you log in you can save the collection permanently.
Too Many Words!
Submitted by Tom Burton-West on Fri, 02/19/2010 - 18:59When we read that the Lucene index format used by Solr has a limit of 2.1 billion unique words per index segment, we didn't think we had to worry. However, a couple of weeks ago, after we optimized our indexes on each shard to one segment, we started seeing java "ArrayIndexOutOfBounds" exceptions in our logs. After a bit of investigation we determined that indeed, most of our index shards contained over 2.1 billion unique words and some queries were triggering these exeptions. Currently ea
Performance at 5 million volumes
Submitted by Tom Burton-West on Thu, 02/18/2010 - 19:11On November 19, 2009, we put new hardware into production to provide full-text searching against about 4.6 million volumes. Currently we have about 5.3 million volumes. The average response time is about 3 seconds, 90% of queries take under 4 seconds, 9% of queries take between 4 seconds and 24 seconds, and 1% of queries take longer than 24 seconds.
Scaling up Large Scale Search from 500,000 volumes to 5 Million volumes and beyond
Submitted by Tom Burton-West on Mon, 02/01/2010 - 16:56To scale up from 500,000 volumes of full-text to 5 million, we decided to use Solr’s distributed search feature which allows us to split up an index into a number of separate indexes (called “shards”). Solr's distributed search feature allows the indexes to be searched in parallel and then the results aggregated so performance is better than having a very large single index.
Sizing the shards
New Hardware for searching 5 million+ volumes of full-text
Submitted by Tom Burton-West on Thu, 01/07/2010 - 18:43On November 19, 2009, we put new hardware into production to provide full-text searching against about 4.6 million volumes. Currently we have about 5.3 million volumes indexed. Below is a brief description of our current production hardware. Future posts will give details about performance and background on our experiments with different system architectures and configurations.
Hardware details
Solr Server configuration
Tuning search performance
Submitted by Tom Burton-West on Fri, 08/28/2009 - 18:44Before we implemented the CommonGrams Index, our slowest query with the standard index was “the lives and literature of the beat generation” which took about 2 minutes for the 500,000 volume index. When we implemented the CommonGrams index, that query took only 3.6 seconds.
Slow Queries and Common Words (Part 2)
Submitted by Tom Burton-West on Mon, 07/27/2009 - 17:18In part 1 we talked about why some queries are slow and the effect of these slow queries on overall performance. The slowest queries are phrase queries containing common words. These queries are slow because the size of the positions index for common terms on disk is very large and disk seeks are slow. These long positions index entries cause three problems relating to overall response time:
Current Hardware Used for Testing
Submitted by Tom Burton-West on Fri, 07/24/2009 - 18:41This is a brief note on the current hardware and software environment we are using for Solr testing.
Solr Servers
- Two Dell PowerEdge 1950 blades
- 2 x Dual Core Intel Xeon 3.0 GHz 5160 Processors
- 8GB - 32GB RAM depending on the test configuration
- Red Hat Enterprise Linux 5.3 (kernel: 2.6.18 PAE)
- Java(TM) SE Runtime Environment (build: 1.6.0_11-b03)
- Solr 1.3
- Tomcat 5.5.26
Storage Server
Slow Queries and Common Words (Part 1)
Submitted by Tom Burton-West on Thu, 07/23/2009 - 16:19All Queries are not created equal
Update on Testing (Memory and Load tests)
Submitted by Tom Burton-West on Wed, 07/15/2009 - 13:56Since we finished the work described in the Large Scale Search Report we have made some changes to our test protocol and upgraded our Solr implementions to Solr 1.3. We have completed some testing with increased memory and some preliminary load testing.
The new test protocol has these features

