When we first started working on large scale search we confronted the issue of whether to index pages or complete books as our fundamental unit of indexing.[i] We had some concerns about indexing on the page level. We knew we would need to scale to 10-20 million books and at an average of 300 pages per book that comes out to about 6 billion pages. At that time we did not think that Solr would scale to 6 billion pages.[ii] If we indexed by page, we also wanted to be able to present to the user the option of switching back and forth between seeing their search results as a list of relevance ranked pages or a list of relevance ranked books.. We didn’t know how we would efficiently group result sets by book if we indexed by page. We looked at Tricia William’s solution (SOLR-380), but at the time it seemed that development was stalled. [iii]
We decided to use a two-tier indexing architecture. For the first tier, all 10 million books are indexed and the entire book is indexed as a Solr document. When user clicks on a link to a particular document in the search results list, the search application activates the second indexing tier to index that document on the fly on the page level. Once the document is indexed in the second tier index, the search application then searches the newly created index to return page level hits within the document.[iv]
In the interests of getting the project up and running, for the second tier index we decided to use our home-grown search software based on the OpenText xpat search engine (which we have been using internally for many years in our Digital Library Software DLXS.)
This “temporary” solution worked well for several years, but as we added new features to large scale search, it was getting harder and harder to make search results consistent between the Solr-based large scale search and the xpat-based “search within a book.”
For example, when we announced that we made changes to large scale search to better accommodate non-Latin languages in January of 2011, we received feedback indicating that there were some problems with the "search within a book" features. The problems were due to differences between the capabilities of our aging xpat-based search engine which powers the "search within a book" , and Solr which powers large scale search. A user, excited that we had improved Hindi language searching, discovered that although they could get hits when searching all 10 million books (large scale search), when they clicked on any of links to the individual volumes in the search result list ("search within a book"), they got a message: “No pages matched your search for काव्य in this item.”
While Solr can index any language and script, our xpat-based search engine was not able to index Devanagari, the script that is used for Hindi. We had recently added support for Arabic to xpat, but the idea of adding support for all the unicode blocks that are supported by Solr was daunting. An additional factor we considered was that the xpat software consists of ancient C++ code and we have had increasing difficulty compiling the code on modern compilers. Another problem with the xpat search engine is that it does not do relevance ranking. Relevance ranking for “search within a book” was one of the top features in the list of high priority features for HathiTrust full-text search produced by the HathiTrust Full-Text Search working group
Replacing the xpat-based search engine with Solr for “search within a book”
We decided to replace the xpat back-end with Solr. With xpat we were creating a separate index for each book on-the-fly. With Solr we planned to add each book on-the-fly to one central page-level index. We added a “book_id” field to the Solr document for each page of a book so that the search application could use a Solr filter query on that “book_id” to limit search results to a particular book and thus produce a "search within a book" type of search .
With large scale search, in order to get the fastest possible response time when searching 10 million volumes, we index and optimize the index overnight on separate machines from the machines serving the search.[v] The next morning, after the index is built and optimized, we mount the optimized index on separate servers for searching.[vi]
For the ““search within a book”” we indexed the book on the fly, so we needed to pay careful attention to the tradeoff between fast indexing and fast searching. We wanted to tune the indexing to minimize the time it takes to make a book searchable as opposed to optimizing overall indexing throughput or optimizing to reduce search latency and throughput. (See Lucene In Action 11.1.3 “Tuning for index-to-search delay”.)[vii]
In order to keep response time fast, we wanted to avoid large merges of the “search within a book” index. Originally we planned to set up a merge policy that would prevent large merges and run an optimize in the wee hours of the night. However, even with the relatively small index (compared to our terabyte size large scale search shard indexes), we found that optimizing would take too long and search performance during the optimizing process slowed to an unacceptable level. We decided instead of optimizing the index, to simply delete the index every night, which is very fast compared to optimizing the index. The trade-off is that someone searching just after the index is deleted will have to wait a few seconds for their book to be re-indexed. Deleting the index has a smaller worse-case impact than having searches slow to a crawl during the optimize. [viii] The average re-indexing time for a book is around 4-6 seconds. Deleting the index every night also keeps the overall index size small.
Because the index is deleted nightly, it is relatively small in size, which allows us to implement features that might not scale up to all 10 million books. The maximum size of the index averages about 15GB and contains between 1,000 and 2,000 books (300,000-600,000 pages), depending on the volume of queries for different books during the day. (After the index is deleted, it starts empty and grows in size over 24 hours to reach its maximum size at the end of the day.)
Advantages of the two-tier indexing scheme
There are a number of scalability and performance advantages of this two-tier indexing strategy. For example to provide snippets, (Solr highlighting) the entire document has to be stored in the index. If we did this in our seven terabyte index of all 10 million books, the size of the index would more than double to somewhere over 14 terabytes. We suspect the Solr highlighting algorithm would not work well with documents containing an average of 100,000 words. With the separate page-level index, the entire index including the stored content comes to around 15GB, so snippets are feasible.[ix]
We are currently investigating a number of features which would be feasible to implement in “search within a book” but are not feasible in large scale search.
There are a number of features that are normally implemented in Solr by using the copyField feature to copy the data to a different field and then process the copied field in a different way. For example, if we wanted to provide the option of case-sensitive searching, we would create two fields: in one of the fields we would lower-case the text and in the other field keep the case as is. This would allow us to provide a default of case-insensitive searching, while still allowing users the option of switching to case-sensitive searching. We don’t use these features for the full-text field in large scale search because each copyField would increase the index size by another seven terabytes. However, we are considering implementing a number of features which require the use of copyField in the “search within a book” search.
Another scalability issue involves the size of the terms index for large scale search. Our terms indexes for each shard of the large scale search index each contain over 3 billion unique terms. This is caused by a combination of indexing books in over 400 languages and dirty OCR.[x] Currently we don’t allow truncation operators or wild-cards in large scale search because the large terms index makes truncation and wild-card searching take several minutes and during one of these searches, the response time of any other search becomes unacceptably slow. However, in the “search within a book” index the number of unique terms is in the range of 10-50 million[xi] and the resulting terms index is small enough that allowing truncation and wild-card operators would be feasible.[xii]
There are a number of other features currently under investigation that might be feasible to implement in “search within a book” that are not feasible in large scale search. However, we need to determine how to implement them in such a way that users don’t get confused when they find something works in “search within a book” but not in large scale search.
User experience issues
One of the issues with this two-tiered search is that occasionally a query which gets results in large scale search will not get search results when the same query is issued within the book. There are two primary causes of this issue.
One problem occurs when the query terms match something in the MARC metadata which is not in the full text (large scale search indexes both MARC metadata and full-text but "search within a book" only indexes the full-text). This occurs most frequently when the title or author of a work is transliterated and presented in the MARC record. For example, with a search for [Emperor of Japan], most of the top ten results are books by “The Emperor of Japan.” However if you went to the second book in the results list, "Ā Meiji Daikōtei." and searched for those words [Emperor of Japan] you would not get any results since the book is in Japanese and the author’s name is also in Japanese "明治天皇".
The most frequent cause of missing search results for “search within a book”, is a result of the different indexing units. When you are searching in large scale search, you are searching all 10 million documents with the entire book as a Solr document. So a two word query can match a book even if the first query word is on page one of the book and the second query word is on the last page of the book. When the user tries their search within the book, they may get no results if the two query words do not occur on the same page. To partially mitigate this, if a search within the book gets no hits for a multiple word query, we put up a link that when clicked repeats the search but switches the default operator between query terms from a Boolean “AND” to a Boolean “OR”: “Broaden your search to find pages having just one or more of your terms.” This will find all of the search terms even if they don’t occur on the same page.
When we move our storage back-end to a Solid State Drive appliance sometime in the next six months, we will add a proximity query to all large scale search queries that will have the effect of ranking documents where all the query terms are on the same page higher than matches where the terms do not occur on the same page. This should help to reduce the problem of a query matching the entire document, but not matching a page within a document.[xiii]
[i] In this post we refer to HathiTrust full-text search (search the full text of all 10 million books) as “large scale search” to distinguish it from HathiTrust catalog search, searching within a collection, and searching within an individual book.
We use the term “books” rather than the more accurate “volumes.” Each indexed item represents one bound volume. A volume might contain a single book, two books bound together, or it might contain an entire year or more of journal issues bound together.
[ii] Solr has a flat data model. Representing hierarchical relationships such as the relationship between a book, its chapters, and its pages is not something that is easy to do within Solr’s data model.
[iii] We are in the process of testing Solr’s recently implemented grouping (field-collapsing) and (in development) block-join capabilities.Preliminary tests indicate that we could not index all 10 million books on a page level without very significant investment in new hardware. More details will be forthcoming in a future post.
[iv] When first implemented, a click on the “Full view” or “Limited (search-only)” link in the search results in large scale search would automatically index the book and then repeat the search within the book. However, because of some user experience issues discussed later in this post, we later changed the behavior so that clicking on the link would take you to the first page of the book and populate the “search in this text” search box.
[v] Large merges are extremely I/O intensive since all the segments to be merged have to be read from disk and then written out into the merged segment. They are also CPU intensive due to decompressing, merging, and then recompressing the index segments before writing the merged index to disk. During our early testing, we found that if merging is occurring on the same machine that serves searches, query response time slows to a crawl during large merges. Because our indexes are so large (around 700GB), we can get very large merges. Merging segments containing a total of about 30,000 volumes takes over 30 minutes, and merge to a final segment size of over 50GB. Much more frequent merges of segments containing about 5,000 volumes, take about 5 minutes. In contrast, the average merge time for the search-within-a-book index is 2 seconds, and the 99th percentile merge time is 5 seconds.
[vi] See the diagram in http://www.hathitrust.org/blogs/large scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1. The indexing machines are represented in the diagram near the bottom as “Slurm-5” (Since that diagram was created we now have 4 dedicated indexing machines). The serving machines are indicated in the top of the diagram as Slurm-1 through Slurm-4. For a discussion of the rationale for having separate indexing and serving machines see: http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-s...
[viii] Optimizing takes between 20 minutes and an hour. About 90% of all searches within a specific book occur on the same day. Each day that the same item is searched it has to be re-indexed for the first search since we deleted the index the previous day. (These statistics are based on raw log numbers without removing robots. Preliminary analysis indicates that robots may account for a significant number of repeated searches extending over more than one day.)
[ix] Of the total of 15GB for the page-level index, the termvectors file (used for highlighting) is about 8GB (55% ) and the stored fields, (used to present the highlighted terms in context) are about 3.4GB (23%) of the total size. Without these two fields needed for highlighting/snippets, the index size would be about 3GB.
[xi] Looking at a sample index from one day, the index contained about 1,300 books and about 650,000 pages. These 1,300 books contained a total of about 540 million words. Of those words, the total number of unique words was about 38 million. We looked at the languages of the works in this index and we found works in 25 different languages, including 16 works in more than one language (works in multiple languages tend to have OCR errors) and 7 works in languages that tend to contain a high number of OCR errors(such as Mongolian, Ancient Egyptian, Ancient Greek, Malay, Hebrew, Japanese, and Ottoman Turkish.)
[xii] When we move to Solr4 we will test to see whether the new Finite State Transducer (FST) based terms index makes using wildcards and truncation feasible even with our 3 billion unique terms per index.
[xiii] Proximity queries require reading the Solr positions index (See Slow Queries and Common Words part 1 ), which with our current spinning disk back-end, takes too long and would cause unacceptable response times. Since SSD I/O is at least an order of magnitude faster than spinning disk, once we have the SSD back-end in production, we should be able to implement proximity queries without a significant impact on response time.