Navigation

Slow Queries and Common Words (Part 1)

All Queries are not created equal

We have been running a test suite of 10,000 warm-up queries and 1,000 test queries against indexes of up to 1 million full-text documents.  One aspect of our results is that response time for queries varies by several orders of magnitude.  For our 1 million volume index, the slowest query took over 2 minutes while the fastest query took less than one thousandth of a second.  Thats a ratio of about 100,000 to 1.  The chart below shows the response time for our 1,000 test queries on a log scale.  The queries are sorted from fastest to slowest.

Response time 1 Million Volumes 2 Shards 2 Machines (log scale)Response time 1 Million Volumes 2 Shards 2 Machines (log scale)

The good news is that 97% of queries took less than a second.  The real problem is in the upper right of the chart.  About 1% of queries took more than 7 seconds and the slowest 0.5% of queries took between 30 seconds and 2 minutes.  From a user-experience point of view, this means that most of the time response time will be just fine, but in one out of 200 times (0.5%), the response time will be unacceptable. [i] 

In our experience so far, response time scales linearly with index size, so when we scale up the index to 7 million volumes, if we don't split the index over more machines, the 1% of queries that took between 7 seconds and 2 minutes could take much longer.

Throughput

While it would seem that a response time of under 1 second for 97% of queries is acceptable, these response times are for queries executed one at a time.  However, in production we may get new queries faster than Solr can process them. What happens if one of the 0.5% of queries that takes 30 seconds to two minutes is executed and then we get one additional query per second for the next 30 seconds?  Those additional queries will be contending with the slow query for resources such as disk access, cpu, and memory.  In our load testing we found that as the number of queries per second was increased over 1 query per second, the average response time got significantly worse.    So we now have two problems related to slow queries:

  1. With the 1 million document index, one out of 200 queries will have unacceptable response time.  This could get worse as we scale up to 7 million documents.
  2. When multiple simultaneous queries are sent to the server, the slow queries contend for resources with other queries and lower overall throughput.

There is one additional problem which is that disk access's required to process the slow queries cause "cache pollution" which results in subsequent queries being slower than they would be otherwise.  We will discuss cache pollution shortly.

Why are some queries significantly slower?

The short answer is "more disk seeks".  The slowest queries are phrase queries with commonly occurring words such as "a", "of", "the", "and", "an", etc.  Queries with common words take longer because the data structures containing the lists of documents containing those words in the index are larger.  For example, in our 1 million document index, the word "the" occurs in about 800,000 documents whereas the word "aardvark" occurs in 1,500 documents.   In processing a query, the search engine has to look at the list of document ids for each word in the query and determine which documents contain all of the words.  The longer the lists, the more data must be read from disk into memory and the more processing must be done to compare the lists.  (See the appendix in the Large-Scale-Search report or : http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html and http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html )

Phrase queries take longer than Boolean queries because when the search engine processes phrase queries it needs to insure not only that all the words in the query are in a document, but that the words occur in the document adjacent to each other and in the same order as the query.  In order to do this the search engine needs to consult the positions index, which indexes the position that each word occurs in each document.    See ( http://nlp.stanford.edu/IR-book/html/htmledition/positional-postings-and-phrase-queries-1.html )

While the size of the postings index is proportional to the number of documents in the index, the size of the positions index is proportional to the total number of words in the index.  Thus the number of documents in the collection times the average number of words in a document provides a measure of the size of the positions index.  The average number of words in our documents is around 100,000 words. Full-text books tend to be longer than web pages, a common document type for information retrieval research. In contrast to most of the large test collections used in information retrieval research, our documents are comparatively very large.  The chart and table below compare average document sizes of the HathiTrust collection with several research test collections designed for testing large scale information retrieval.[ii]

Average Document Size for Test CollectionsAverage Document Size for Test Collections

Collection

Size

Documents

Average Doc Size

HathiTrust

4.5TB*

7 million*

700KB

TREC GOV2

0.456 TB

25 million

18KB

SPIRIT

1 TB

94 million

10KB

NW100G-04

1.3TB

100 million

16KB

* Projected size by 2010

 Because our documents are so long, in our Solr/lucene indexes, the positions index which is used for processing phrase queries is about 85% of the total index size.  In comparison the postings index which is used for Boolean queries and in the initial processing of Phrase queries is about 12% of the total index size.

We can see how these issues play out by looking at the slowest of our test queries, the phrase query :  "the lives and literature of the beat generation." [iii] In the chart below we show the number of documents containing each term, an estimate of the size of the postings list, the total number of times the term occurs in the entire 1 million volume corpus, and an estimate of the size of the positions list (for the 1 million document index.)  Note that the postings list sizes are in Kilobytes, while the position list sizes are in Gigabytes.[iv]

Word

Number of

Postings list

Total Term Occurrences

Position list

 

Documents

Size(KB)

 (Millions)

Size (GB)

the

800,000

800

4,351

4.35

of

892,000

892

2,795

2.8

and

769,000

769

1,870

1.87

literature

453,000

453

9

0.13

generation

414,000

414

5

0.01

lives

432,000

432

5

0.01

beat

278,000

278

1

0.01

Total

 

4,037

 

9

To process "the lives and literature of the beat generation" as a Boolean query about 4 megabytes of data would have to be read from cache or disk.  To process the phrase query, nearly 9 gigabytes of data has to be read from cache or disk. 

Cache Pollution

The difference in the amount of data needed for processing a normal (Boolean "AND") query and a phrase query has interesting implications for caching. Because reading data from disk is several orders of magnitude slower than getting data from memory, search engines implement caching schemes to keep frequently used data in memory. Solr, the search engine we use for Large Scale Search, relies on the operating system's disk cache to cache data from the postings list and position list.  This is why two of the more frequently suggested performance tuning tips on the Solr mailing list are: 

  1. Make sure that the server has enough memory to fit most of the index into memory
  2. Run a large number of "realistic" queries against the server to warm the cache before doing performance testing.

(See :  http://www.nabble.com/Re%3A-Help-optimizing-p17074425.html )

The size of our 1 million volume document index is about 225GB, which means that only a fraction of the index will fit into the total 32GB of memory we have in our test machines.  What this means is that once the cache is full, each further query causes some data to be evicted from the cache.  This is the normal way caches work and the end result after a reasonable number of queries have been run, is that data for the most frequently occurring words in queries tends to be in the cache.

Phrase queries are generally between 3-10% of all queries (Spink & Jansen, 2005) and about 5% of all queries in our test query sets.  From the previous analysis of our slowest query we can estimate that the upper limit of the amount of data that needs to be read for a Boolean query is in the range of a few megabytes whereas the upper limit of the amount of data that needs to be read for a phrase query is in the range of 9GB.  The total memory available for caching the postings and positions list in our system is about 30GB (we allocate 2GB for Tomcat and Solr.)  Since Boolean queries occur much more frequently and result in smaller amounts of cached data, a phrase query containing common words could potentially evict a very large number of postings from Boolean queries from the cache.  The end result is that the phrase query with common words "pollutes" the cache, and the processing for a large number of subsequent Boolean queries which might have been able to read data from the cache will instead have to re-read data from the disk.  Thus, slow phrase queries increase the response time for a large number of subsequent Boolean queries.

Summary

The slowest queries are phrase queries containing common words.  These queries are slow because the size of the positions index for common terms on disk is very large and disk seeks are slow.  These long positions index entries cause three problems relating to overall response time:

  1. With the 1 million document index, one out of 200 queries will have unacceptable response time.  This could get worse as we scale up to 7 million documents.
  2. When multiple simultaneous queries are sent to the server, the slow queries contend for resources with other queries and lower overall throughput.
  3. Since the positions lists for common words are so large, phrase queries containing common words pollute the cache and cause a large number of subsequent queries to be slower than they would otherwise be, increasing the average response time for all queries.

Coming next- Dealing with common words: Stop words, bigrams and commongrams

 


References

Spink, A., & Jansen, B. (2005). Web Search: Public Searching of the Web. Springer Netherlands. 2005.http://dx.doi.org/10.1007/1-4020-2269-7. 


Endnotes

[i]  General guidelines for computer interfaces suggest that response times of more than 10 seconds are distracting to users.  How long a user is willing to wait varies.  For more information see the following:

[ii] To be fair, it must be noted that research test collections may limit the total collection size in order to facilitate sharing the collections among researchers A possible exception is the collection for  this year's TREC web track, clueweb09,  which features a 25TB collection of 1 billion web pages.  This is certainly a very large collection.  However the average document size for the clueweb09 collection is only about 24KB. (See: http://boston.lti.cs.cmu.edu/Data/clueweb09/ .) The recent INEX Book  track is the closest match with HathiTrust in the size of documents , however, their total collection size is 50,000 documents with a total text size of about 50GB.  See Kazai et.al. (2008) "Overview of the INEX 2008 Book Track" in INEX 2008 Workshop Pre-Proceedings.  Available at:  http://www.inex.otago.ac.nz/data/proceedings/INEX2008-preproceedings.pdf.  For more background on the INEX book tracksee: http://www.inex.otago.ac.nz/tracks/books/books.asp  and http://research.microsoft.com/en-us/projects/booksearch/default.aspx.  

[iii] This is the subtitle of a book: "Naked Angels: The Lives and Literature of the Beat Generation"

[iv] We wrote a command line utility to read the index and output the number of documents containing a term and the   total number of occurrences of the term in the corpus.  (https://issues.apache.org/jira/browse/LUCENE-2393 ). To estimate the sizes of the postings and positions list, we simply multiplied the number (of the number of documents, or number of occurrences)  by 1 byte.   Actually the index stores only the difference between a document id and the document id preceding it in the list encoded in a variable byte encoding using from 1 to 4 bytes. The same logic holds for word positions.  Our procedure probably underestimates the sizes as one byte only holds values from 0 to 127 and its likely that the difference between the position of a word and the subsequent position of the next occurrence of the word would be greater than 127 for many words.  For more information on the data structures used in the Lucene index see Apache Lucene - Index File Formats



Posting/position list size computation

Hi Tom I just want a little clarification that weather you iterate all the terms for one field? Regards Ahsan Iqbal

Re: Posting/position list size computation

Hi Ahsan,

The stats on the size of the postings are for only one field. We have one field that contains all the OCR for a document (about 800K) and a number of very small metadata fields based on the MARC bibliographic metadata. Those other fields average 1KB or less. We are presently only searching against the OCR field, so all the calculations are for that one field. The tool we use to get the total number of occurrences of a term is now part of lucene/contrib/ if you use it with the -t flag it will count the total number of occurrences for the top N most frequent terms, which is a proxy for the size of the positions index. You can also specify the field as well as N.

java org.apache.lucene.misc.HighFreqTerms [-t][number_terms] [field]

-t: include totalTermFreq

See:

http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contri...

Tom

Info Required regarding implementation of Common Grams

Hi All I need some clarification regarding implementation of common grams, I have an index of approximately 250GB that contains almost 7 million documents, and there are some phrase(involving some stop words) queries that take some frustrating time to execute. now after reading your very helping post i have decided to implement common grams, So I have built a test index containing almost 20,000 documents without common gram approach, the size of this test index is 362MB, and then i get 65 common words from the test index and build a commonwords.txt and index same 20,000 documents with common gram approach now i noticed three things 1. Index size with common gram approach is almost double (702MB). 2. Response time is slower of the queries involving common words to the index with common gram approach. 3. One more thing that I have noticed is when i perform search involving common words like "+Contents:of", the index with common gram approach retrieves fewer documents than the one without common gram appraoch. Please guide if I m missing something.

Info Required regarding implementation of Common Grams

Hi Ahsan,

The index size increasing is an expected result since any common word gets indexed 3 times, once by itself, once with the preceding word, and once with the following word. We have seen a 50-60% increase rather than a doubling, but it probably depends on your input data.

Response time should not be slower. Can you give some example queries? You might also want to look at the queries in the Solr analysis tool and/or run a debug query to see if you can determine why they are slower.

Are you saying a one word query for the word "of" retrieves fewer documents than the index without commongrams? Are you doing any stemming or other processing?

Tom

Tom: Yes, I think this could

Tom: Yes, I think this could find its way into some Solr Admin tool. It would be nice to see this in JIRA, where we could figure out how to make the most use of this.

Posting/position list size computation?

Very useful, Tom. How did you compute the posting list and position list sizes for individual terms? Thanks.

Posting/position list size computation

Hi Otis,

I wrote a command line utility that takes a term as an argument and outputs the df (docFreq)of the term and the total number of occurrences of the term in the index. The df is related to the size of the postings list since the postings list has to encode the document ID for each document containing the term. The total number of occurrences is related to the size of the positions index entry for that term since the positions index needs to list the position of each occurrence. I assumed 1 byte per docID or position as an estimate. This probably underestimates the size since the data are stored in a VInt which can be 1-4 bytes as explained in endnote iv.

The code:

  • reads the term from the args[] array
  • opens an IndexReader
  • gets the df by calling IndexReader.docFreq(term)
  • gets a TermDocs enumeration by calling IndexReader.termDocs(term)
  • iterates through the TermDocs and totals up the tf counts for each document
while (TermDocs.next()){
	total_tf+=TermDocs.freq();
}

The code needs some clean up but I could post it to JIRA if you think others might want to use it.

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.