In part 1 we talked about why some queries are slow and the effect of these slow queries on overall performance. The slowest queries are phrase queries containing common words. These queries are slow because the size of the positions index for common terms on disk is very large and disk seeks are slow. These long positions index entries cause three problems relating to overall response time:
- With the 1 million document index, one out of 200 queries will have unacceptable response time. This could get worse as we scale up to 7 million documents.
- When multiple simultaneous queries are sent to the server, the slow queries contend for resources with other queries and lower overall throughput.
- Since the positions lists for common words are so large, phrase queries containing common words pollute the cache and cause a large number of subsequent queries to be slower than they would otherwise be, increasing the average response time for all queries.
There are several strategies for dealing with the problems caused by the size of the positions index entries for common terms.
One approach to dealing with slow queries due to common words is to create a list of “stop words” and process both incoming queries and the indexing process to ignore those words. This strategy was employed in the early days of information retrieval in order to keep index sizes relatively small. In general “stop words” are frequently occurring words that don’t add much specificity to the query. For example, since the word “the” occurs in 80% of the documents, it does not help much to narrow down search results. However, in a phrase query, a stop word or series of stop words may be important to the query and in combination they may result in a phrase which is very specific. One such phrase is “to be or not to be”. Some other queries that would be difficult or impossible if stop words are removed from queries and the index are[i]:
- “the who”
- “man on the moon” vs “man in the moon”
- “being there”
We have an additional problem in that we are currently indexing multiple languages in one index. A stop word in one language is often a content word in another. For example the German word “die” is a stopword in German but a content word in English. The words: “is” and “by” are stopwords in English but content words in Swedish (ice and village respectively). Because of these considerations, we would like to avoid removing stopwords from our index.
Bigrams and CommonGrams
There are several approaches to dealing with optimizing the performance of phrase queries containing common terms other than removing stop words. Some of them such as nextword indexes and phrase indexes(Chang & Poon, 2008; Williams, Zobel, & Bahle, 2004)), involve special indexing data structures and custom query processing, which would require serious hacking deep in the bowels of Solr/Lucene to implement. However there are two approaches that can be easily implemented as custom filters in Lucene. They are bigrams, (The Solr Shingle filter implements bigrams) and CommonGrams. The CommonGrams approach has been implemented in the Nutch project, an open-source web crawler and search engine based on Lucene. A similar approach is used in the California Digital Library’s XTF .
Bigrams work by combining every two words into one token. For example the slow query: “The lives and literature of the beat generation” would generate the following tokens:
“the-lives” “lives-and” “and-literature” “literature-of” “of-the” “the-beat” “beat-generation”.
The advantage of this is twofold. First, for a two word phrase query such as “beat generation”, the position index does not need to be consulted since the two words are indexed as one token. ( Studies of web logs have found that a very large proportion of phrase queries are two word queries.) Second, for phrase queries containing three or more words the position lists are shorter. (Phrase queries containing more than two words require consulting the position index since a single bigram only can match 2 of the 3 or more words of a query.) As an example of the comparitive size of the position list for a regular index compared to a bigram index, the position list for the token “the-lives” will be much shorter than the position list for “the”. The disadvantage of the simple bigram approach is that each word (with the exception of the first and last word in each document) gets indexed 3 times: Once by itself (all words are also indexed as unigrams in order to allow single term queries), once as the first word of a two word pair, and once as the second word of a two word pair. This means the size of the index is three times as large as an index without bigrams. Since we are mostly concerned with the position lists for very common words, we can get almost the same performance benefit as bigrams by only creating bigrams for two word pairs where one or both of the words is a stop word (or common word). To implement this approach we ported the Nutch CommonGrams algorithm to Solr : (https://issues.apache.org/jira/browse/SOLR-908.) We implemented a test index of 500,000 volumes using CommonGrams and compared it with a standard 500,000 volume index. The CommonGrams index (using 32 common words based on the Solr/Lucene StopFilter list), was 40% larger than the standard index.
The two charts below show the total number of occurrences of a token in the 500,000 volume index for the standard index and the index with CommonGrams.[ii]
|Term ||Total occurrences in corpus (millions)||Number of Docs (thousands)|
Total occurrences in corpus (millions)
|Number of Docs (thousands)|
The 3 most common terms in the query: “the”,”of” and “and” occurred a total of almost 4 billion times in the standard index. In contrast the most frequent term in the CommonGrams version of the query “of-the” occurred 446 million times with the rest of the terms all under 3 million times.
We ran tests against the two indexes. Below is a chart showing the improvements in query response time for a 500,000 document index using the standard index and for a 500,000 document index using CommonGrams with 32 common words used in the standard lucene/Solr stopfilter. The average response time went from 459ms to 68ms.
For the slowest 0.5% of the queries (queries taking from 10 seconds to 2 minutes with the standard index) the CommonGrams index reduced them all to below 8 seconds.
Comparison of 500 thousand document indexes (Response time in milliseconds)
Standard 500K Docs
CommonGrams 500K Docs
The chart below which is plotted on a logarithmic scale shows that the ratio of response times varies from a reduction of about 50% down to a reduction of 95%. For a detailed breakdown see Appendix 1
The chart below which is also plotted on a logarithmic scale, shows the slowest %1 of the queries. The slowest 0.5% of the queries were between 10 seconds and 2 minutes for the Standard index, but were reduced to between 5 and 10 seconds for the CommonGrams index.
If we add more words to the list of common words, we should get better performance, but the size of the index will increase. The CommonGrams index built with a 32 word list of common words is about 40% larger than the Standard index. We are currently experimenting to try to characterize the trade-off in increased performance with more words added to the common words list versus the increased index size.
Chang, M., & Poon, C. K. (2008). Efficient phrase querying with common phrase index. Information Processing & Management, 44(2), 756-769. doi: 10.1016/j.ipm.2007.06.003.
Williams, H. E., Zobel, J., & Bahle, D. (2004). Fast phrase querying with combined indexes. ACM Trans. Inf. Syst., 22(4), 573-594. doi: 10.1145/1028099.1028102.
[i] Walter Underwood the software engineer responsible for the Solr implementation at Netflix has a list of examples based on movie titles: http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html
[ii] We wrote a command line utility to read the index and output the number of documents containing a term and the total number of occurrences of the term in the corpus. To estimate the sizes of the postings and positions list, we simply multiplied the number (of the number of documents, or number of occurrences) by 1 byte. Actually the index stores only the difference between a document id and the document id preceding it in the list encoded in a variable byte encoding using from 1 to 4 bytes. The same logic holds for word positions. Our procedure probably underestimates the sizes as one byte only holds values from 0 to 127 and its likely that the difference between the position of a word and the subsequent position of the next occurrence of the word would be greater than 127 for many words. For more information on the data structures used in the Lucene index see Apache Lucene - Index File Formats