At the core of the Solr/Lucene search engine is an inverted index. The inverted index has a list of tokens and a list of the documents that contain those tokens. In order to index text, Solr needs to break strings of text into “tokens.” In English and Western European languages spaces are used to separate words, so Solr uses whitespace to determine what is a token for indexing. In a number of languages the words are not separated by spaces.
Forty days forty nights: Re-indexing 7+ million books (part 1)
Forty days and forty nights; That’s how long we estimated it would take to re-index all 7+ million volumes in HathiTrust. Because of this forty day turnaround time, when we found a problem with our current indexing, we were reluctant to do a complete re-index. Whenever feasible we would just re-index the affected materials.
After Mike McCandless increased the limit of unique words in a Lucene/Solr index segment from 2.4 billion words to around 274 billion words, we thought we didn't need to worry about having too many words (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words). We recently discovered that we were wrong!
We just released a new feature in our full-text Large Scale Search. When you do a search,you will see check boxes next to each search result. You can select items you want from the search results and create a personal collection. This should make it much easier to do repeated searches and explore a targeted subset of the HathiTrust volumes. If you are not logged in, the collection will be temporary. If you log in you can save the collection permanently.
When we read that the Lucene index format used by Solr has a limit of 2.1 billion unique words per index segment, we didn't think we had to worry. However, a couple of weeks ago, after we optimized our indexes on each shard to one segment, we started seeing java "ArrayIndexOutOfBounds" exceptions in our logs. After a bit of investigation we determined that indeed, most of our index shards contained over 2.1 billion unique words and some queries were triggering these exeptions. Currently ea
On November 19, 2009, we put new hardware into production to provide full-text searching against about 4.6 million volumes. Currently we have about 5.3 million volumes. The average response time is about 3 seconds, 90% of queries take under 4 seconds, 9% of queries take between 4 seconds and 24 seconds, and 1% of queries take longer than 24 seconds.
To scale up from 500,000 volumes of full-text to 5 million, we decided to use Solr’s distributed search feature which allows us to split up an index into a number of separate indexes (called “shards”). Solr's distributed search feature allows the indexes to be searched in parallel and then the results aggregated so performance is better than having a very large single index.
Sizing the shards
On November 19, 2009, we put new hardware into production to provide full-text searching against about 4.6 million volumes. Currently we have about 5.3 million volumes indexed. Below is a brief description of our current production hardware. Future posts will give details about performance and background on our experiments with different system architectures and configurations.
Solr Server configuration
Before we implemented the CommonGrams Index, our slowest query with the standard index was “the lives and literature of the beat generation” which took about 2 minutes for the 500,000 volume index. When we implemented the CommonGrams index, that query took only 3.6 seconds.