Available Indexes

Too Many Words!

When we read that the Lucene index format used by Solr has a limit of 2.1 billion unique words per index segment,  we didn't think we had to worry.  However, a couple of weeks ago, after we optimized our indexes on each shard to one segment, we started seeing java "ArrayIndexOutOfBounds" exceptions in our logs.  After a bit of investigation we determined that indeed, most of our index shards contained over 2.1 billion unique words and some queries were triggering these exeptions.  Currently each shard indexes a little over half a million documents. 

Running the Lucene CheckIndex tool against one of the indexes shows we had 555,000 documents  with a total of 97 billion words. Of those words 2.49 billion were unique. The OCR for the 555,000 documents totals somewhere between 450 GB and 600 GB. In comparison, Williams and Zobel analyzed 45GB of web documents with a total of 2 billion words and found only 9.74 *million* unique words.[1]

Some of the factors contributing to the large number of unique terms in our index are:

  1. We index materials in over 200 languages
  2. Dirty OCR probably contributes a significant number of "unique words"
  3. CommonGrams increase the number of unique words. (For example any word preceding or following the word the creates a new "word".  )

The Solr/Lucene index stores term information sorted lexicographically. In order to trigger the java exception, we just needed a query which contained a term that came after the 2,147,483,648th term.  We tried "zebra" and "zoo" but that did not trigger the exception. 

After a bit of digging in the log files, we found a query containing a Korean term which consistantly triggered the exception and used that for testing.   We re-read the index documentation and realized that the index entries are sorted lexicographically by UTF16 character code. ( Korean uses Hangul which is near the end of the Unicode BMP and therefore a very high UTF-16 character code range.)

We posted this issue to the Solr mailing list (http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tt27506243.html) and did some testing. After some back and forth on the list,  Michael McCandless, one of the Lucene committers and a co-author of Lucene In Action 2nd edition,  produced a patch that raises the limit to about 274 billion unique terms. (https://issues.apache.org/jira/browse/LUCENE-2257.)  We did some testing to confirm that the patch worked and then Michael committed the patch for Lucene  2.92, 3.0.1, and 3.1.    Hooray for open source!

We just put the patch into production and ran a series of tests to confirm that the patch has solved the problem.  Although we may increase the number of documents indexed in each segment from about 1/2 million documents to one or two million, we certainly don't plan to increase the number of documents indexed by a factor of 100 (2.1 billion v.s. 274 billion).   

This incident did get us thinking about how we might reduce the amount of dirty OCR in our index, and we may implement some filtering in our index pre-processing in the future.  Whatever heuristics we use to remove the dirty OCR have to work across all 200+ languages, so this may be challenging.




[1]  "Searchable words on the web", H.E. Williams and J. Zobel, International Journal of Digital Libraries, Springer. 5(2):99-105, 2005 http://ww2.cs.mu.oz.au/~jz/fulltext/ijodl05.pdf



Yes, this might seem to be surprising, but there are a lot of words out there. Still, billions of words from only 555,000 documents seem to be surprising. In comparison, I recently indexed a ClueWeb09 SubsetB collection of 50 million mostly English pages. I collected only around 200 million unique words that contained only Latin letters and digits (i.e., I ignored words with non-ascii chars). I would suggest that n-gram sequences is the major contributor to the number of unique words in your case.

Add new comment