Navigation

Features Analysis for Full-text Search

The charge of the HathiTrust Full-Text Search working group is to examine the HathiTrust Full-Text search application and "identify and prioritize features and functions anticipated to have immediate high-impact value to users that can be reasonably afforded by the existing technology framework."

The chart below gives a brief description of potential features that meet these criteria and provides estimates for each of the three following areas: value to users, ease of implementation, and ease of user experience (UX) design. The body of the report gives a detailed description of the features and lists the rationale for including each feature, as well as applicable user-centered design principles and implementation issues that may have an impact either on implementation effort or usability.

The group started by brainstorming and listing features and functions believed to be of high value to users. We then did a preliminary rating of value to users based on our collective knowledge of user needs, information-seeking behavior, user behavior, and usability principles. After a number of group discussions about the features, each member of the group did a final rating of estimated value to users on a scale of 1-10, and the average value is what is reported in the chart on the following page.

The emphasis of the charge on features that can be added quickly without requiring additional resources led us to include estimates of the implementation effort. These estimates were made by Tom Burton-West of the University of Michigan who will have primary responsibility for implementing the features.

The implementation effort will also include considerable effort in user experience design. Estimates of the effort for user experience design were added by Suzanne Chapman of the University of Michigan who will be the lead for UX design for the implementation.

In order to facilitate comparison, estimates of implementation effort were converted to ease of implementation. These are expressed on a scale of 1-10 where 10 is easiest to implement.

These estimates of value to the user and ease of implementation were combined to provide a final score. Since ease of implementation was evaluated separately for technical implementation and UX design, the scores for those two aspects were combined to determine a total ease of implementation score. In order to give equal weight to “value to users” and “ease of implementation”, the rating for “value to users” was doubled. Thus, the doubled “value to users” was added to the combined score for the ease of technical and UX design implementation to provide the final weighted total score shown in the chart.

The result, presented on the chart below, is a list of recommended features in priority order. If in the future more resources become available, or a higher priority is given to “value to users,” these estimates could be combined in a different way. For example, the “value to users” estimates could be multiplied by four to produce a prioritized list that provides a greater emphasis on “value to users” than on minimizing the need for additional development resources.

The top 5 features in the list are:

  1. Facets
  2. Using bibliographic metadata in relevance ranking
  3. Improved results display for multiword searches in “within book search”
  4. Show number of times term(s) occur in book
  5. Spelling suggestion(s)
Feature user value implem-
entation ease
UX design ease weighted total
1a. Facets: show, and interact with, on results page 8 10 8 34
2. Using bibliographic metadata in relevance ranking 9 8 6 32
3a. Improved multiple word searches for search within a book: coordination matching 9 8 4 30
3b. Improved multiple word searches for search within a book: true relevance ranking 9 4 8 30
4b. Show number of times search term(s) occurred within the book: within book search 6 10 6 28
4a. Show number of times search term(s) occurred within the book: whole-collection search results 5 10 6 26
5a. Spelling suggestion: “Did you mean?” after low-hits search 6 8 6 26
6. Grouping of multiple copies of same item 6 10 4 26
7. Advanced search: boolean combinations of searching on title/author/subject/full-text 7 8 4 26
1b. Facets: limit by facets on advanced search page 8 10 8 * (34)
8a. Show snippets: whole-collection results 8 4 4 24
8b. Show snippets: within book results 10 2 2 24
9. Provide E-book download as well as pdf 5 6 6 22
10. Sorting of search results 5 6 4 20
11a. Improve non-Roman searching: CJK 5 4 6 20
12. Show related books 5 6 4 20
5b. Spelling suggestion: suggestions as you type 4 4 6 18
13. RSS feeds 4 6 2 16
11b. Improve non-Roman searching: beyond CJK (e.g. Arabic, Hebrew, …) 4 2 4 14
14. Tag cloud on search results page for whole coll search 3 2 2 1

Notes on the table:

  • User value scores: average value of ratings by committee members. 1 = low value, 10 = high value
  • Ease of impelmentation scores: 1 = hard, 10 = easy.
  • UX effort includes not only the effort to design the user interface, but also to do some minimal required design iterations.
  • Weighted total is: 2 * user value score, plus implementation ease, plus UX design ease. This equally weights user value vs. total implementation.
  • (*) The feature “Facets: limit by facets on advanced search page” only makes sense if there is an advanced search page; thus it has been moved down the list to below the advanced search task.
  • All estimates of feasibility assume that the technology will scale and work with Solr’s distributed search feature. Due to the size of our indexes, it is possible that we will run into scalability issues with any feature, which could make the feature difficult or impossible to implement. Where there is reason to believe there could be a scalability issue, this is noted in the feasibility discussion.

Rationale and Feasibility issues

1. Facets

  • On search results page
  • On advanced search page

Weighted total score: 34

Team user value score: 8 (same for both options)

Rationale:

Help users narrow large result sets. Facet counts provide a quick overview of search results and give informative feedback to facilitate narrowing result sets. Facets facilitate browsing-based discoverability.

Design guidelines:

Reduce memory load/Favor recognition over recall. Integrate navigation and search.

Hearst p 20 sect 1.7.3 and chapter 8:

Feasibility Issues:

The search software we use (Solr) comes with support for facets out-of-the-box. There may be some scalability issues. Memory and distributed search issues are probably tractable based on others’ experience with similarly sized collections of MARC records. Assumes only full-text items with bibliographic metadata. UI issues include:

  1. Determining which facets to display, in what order, and with what labels.
  2. Determining how many values for each facet to show by default and, if a “show more” feature is implemented, how many to show when “show more” is activated
  3. Determining if facets should be sticky on the search results page and/or, on the advanced search page. (See: http://searchworks.stanford.edu/?f[geographic_facet][]=France&f[language][]=English&f[pub_date_group_facet][]=More+than+50+years+ago&q=art+history&search_field=search)
  4. Decide whether to implement multi-valued facets (more than one value for a facet can be checked)

Two possible implementations

a. Search results page

Implementation Estimate: 10 (easy), Uncertainty: low

UX Design Estimate: 8 (medium-easy)

b. Advanced search page (Note: this feature would require implementation of an advanced search page)

Implementation Estimate: 10 (easy), Uncertainty: low

UX Design Estimate: 8 (medium-easy)

2. Using MARC metadata in relevance ranking

Weighted total score: 32

Team user value score: 9

Rationale

The occurrence of a user’s query words in the MARC metadata for a volume are indicators that the volume is likely to be relevant. Implementing this feature would be likely to greatly increase the relevance of search results. Some users will do author/title/subject searches in the Full-Text search rather than switching to advanced search and this feature will accommodate the habits of those users.

Design guidelines:

Proper acceptance testing should insure that the search results returned to the end user have been improved for a representative set of affected records.

Feasibility Issues:

This should be relatively easy to implement using the Solr dismax query handler. We should be able to leverage the work already done for the VuFind HathiTrust bibliographic catalog. We assume there are no significant scalability issues based on others’ experience with similarly sized MARC records. This will require some work on tuning the weighting formula, especially for the OCR compared to the metadata fields. Martin noted that XTF worked well with minimal tweaking due to length normalization of MARC records vs full text.

Implementation Estimate: 8 (medium-easy), Uncertainty: medium

UX Design Estimate: 6 (medium)

Note:While it should be relatively easy to get a reasonable default relevance ranking system in place, tuning relevance ranking is a complicated and difficult process which will probably require an ongoing effort.

Weighted total score: 30

Team user value score: 9

As an alternative to showing results in page order, the default display should display results in relevance order. Pages where all the words in the query appear close together on the same page should be ranked higher.

Rationale

It is likely that many users are actually searching for all their query words occurring on the same page rather than anywhere in a book. The current implementation mixes pages where only one query word occurs with pages where all query words occur. This makes it difficult to locate the pages where all query words occur.

Implementation of this feature would facilitate more consistency between the whole-collection search and within-book search.

Design guidelines:

Consistency: similar actions should provide similar results

Feasibility Issues:

There are two possible implementations:

a. “coordination matching” using the existing back-end.

This would rank pages according to the number of matching query words on the page. For a 4-word query, pages with all 4 terms would be first, pages with 3 of the 4 would be second, pages with 2 of the 4 would be third etc. Within each group, pages would be listed in page number order. This would be relatively easy to implement. However, we would need to put a limit on the size of the query that would receive this ranking. For example, it would be computationally difficult to do this kind of ranking for a 10-word query. Setting a limit in a way that will not confuse users requires some UX design.

Implementation Estimate: 8 (medium-easy), Uncertainty: low

UX Design Estimate: 4 (medium-hard)

b. True relevance ranking including both proximity and traditional relevance ranking.

Pages containing all the terms close together would be ranked highest with pages containing all the terms being ranked above pages containing some of the terms.

This would require significant re-architecting of current on-the-fly, within-book indexing. It would require replacement of our XPAT-based engine with Solr. Solr has some issues with snippets which would have to be resolved. This will require a significant development effort. Further research into feasibility issues is needed.

There are other reasons why in the long term it is desirable to move the back end for within-book searching to Solr, including providing consistent searching of CJK and other non-Roman scripts and maintaining consistent feature sets between whole-collection and within-book searching.

Implementation Estimate: 4 (medium-hard), Uncertainty: medium

UX Design Estimate: 8 (medium-easy)

Notes:

At some point, the need for more user-controlled advanced searching within books will need to be addressed. For example, the ability to do Boolean “AND” searches is disabled in the current implementation. (We hope to fix this in the near future.) An additional issue is determining which functionality in the whole-collection search should also be implemented in within-book search.

4. Show number of times word(s) occurred in book

  • Counts for search-entire-collection

Weighted total score:26

Team user value score: 5

  • Counts for search within book

Weighted total score:28

Team user value score: 6

Rationale

Currently in the search within a book, the count is reported for each page of the book but there is no summation for the whole book. The total count could be helpful for literary analysis, for instance if one were looking across the works of a particular author and trying to glean the possibility of significance of a given term.

Note: the initial thought was to put this on the within-book search, but it might also be useful on the results page for whole-collection search.

Design guidelines:

Offer informative feedback.

Feasibility Issues:

We can get these counts for both the search-entire-collection results page and the search-within-the-book results page. However, there is a performance impact which needs to be evaluated. There are also potential UI issues if we show this for multi-word, non-phrase searches. For example, how would we show word counts for a 5-word query for each book on the results page without taking up too much space?

a. Counts for search entire collection

Implementation Estimate: 10 (easy), Uncertainty: low

UX Design Estimate: 6 (medium)

b. Counts for search within book

Implementation Estimate: 10 (easy), Uncertainty: low

UX Design Estimate: 6 (medium)

5. Spelling suggestions

  • “Did you mean”

Weighted total score:26

Team user value score: 6

  • Suggestions as you type

Weighted total score:18

Team user value score: 4

Rationale:

Users may be confused when they get no results or relatively few results from a spelling error or typo, especially users with dyslexia and/or difficulty typing.

Design guidelines:

Offer informative feedback. Favor recognition over recall. Provide easy recovery from errors. Support user control.

Hearst Section: 1.5.4: Show Query Term Suggestions, plus sect 6.2-6.3

Feasibility Issues:

Spelling suggestions are often based on the contents of the works being searched. For various reasons - including the size of our index, dirty OCR, and works in 400 languages - it would not be feasible to build a spelling suggester based on the OCR text. However, we could build a spelling suggester based on the MARC metadata.

There are a number of implementations available for the Solr software we use for the Full-Text search. We need to investigate choices of implementations. There are also user interaction/usability issues, such as how many suggestions to show and whether we provide a “did you mean” suggestion on a results page or provide a dropdown list of several suggestions as the user types a query.

Providing as-you-type suggestions involves significantly more work.

a. “Did you mean”:

Implementation Estimate: 8 (medium-easy), Uncertainty: high

UX Design Estimate: 6 (medium)

b. Suggestions as you type:

Implementation Estimate: 4 (medium-hard), Uncertainty: high

UX Design Estimate: 6 (medium)

6. Grouping of multiple copies of same item

Weighted total score:26

Team user value score: 6

Rationale

Rather than displaying a large number of records for one bibliographic item, it would be useful to group them together.

Design guidelines:

Facilitate chunking. Group like things together.

Feasibility Issues:

It would be feasible to group multiple items belonging to the same bib record that occur on the same result page. However, grouping multiple items that belong to the same bib record but occur on multiple pages of search results could be complicated if the items received different relevance rankings. Items within a group of serial items or multi-volume sets might receive different rankings.

Implementation Estimate: 10 (easy) (to group items on same page)

UX Design Estimate: 4 (medium-hard)

7. Advanced search:Boolean combos MARC/full-text

Weighted total score:26

Team user value score: 7

Rationale

Users of information systems have diverse goals, tasks, expertise, and cognitive styles. Providing an advanced search page relieves pressure to make the main interface complicated in an attempt to accommodate all users with a single interface.

Some users will want to be able to do author, title, or subject searches or do a full-text search limited to an author, title, subject, or journal ISSN.

Note: The group ranked this feature relatively low, because most users will never use the advanced search page.

Design guidelines:

Accommodate diverse users.

Feasibility Issues:

This is relatively easy to implement with the software we are using. There are potentially complex UI issues which might require iteration. Martin points out that showing the query on the results page can be complicated for an advanced search.

Implementation Estimate: 8 (medium-easy) Uncertainty :medium

UX Design Estimate: 4 (medium-hard)

8. Snippets on search results page

  • Whole-collection search

Weighted total score:24

Team user value score: 8

  • Within book search

Weighted total score:24

Team user value score: 10

Rationale:

Showing snippets helps the user to determine if their search was successful and gives an indication of whether it’s worth clicking on the link to the result or the page.

Design Guidelines:

Offer informative feedback. Show informative document surrogates. Highlight search terms.

Hearst: p 8 sect 1.5.2 and p 120-130 sect 5.1-5.3:

Feasibility issues:

  1. Legal and contractual issues

For in-copyright works we will require a solution to prevent human or machine-based harvesting of complete works via the snippets.

  1. Technical issues
    1. Snippets for whole-collection search feasibility issues
      1. Can’t efficiently produce snippet from current software
      2. Consider using ajax and indexing all books on page on-the-fly to get snippets. This would require substantial design effort to provide software to generate and retrieve snippets within 30 seconds.

Implementation Estimate: 4 (medium-hard), Uncertainty :high

UX Design Estimate: 4 (medium-hard)

  1. Snippets for within book search for non-copyrighted material already implemented
  2. Snippets for within book search for copyrighted material will be difficult to implement because we have to design, implement, and test complex logic to prevent a user from downloading the whole text, snippet by snippet. This may require showing only some of the snippets matching the query and thus may reduce the benefit to users.

Implementation Estimate: 2 (hard), Uncertainty :high

UX Design Estimate: 2 (hard)

9. Provide e-book as well as PDF

Weighted total score:22
Team user value score: 5

Rationale

It could be convenient for users to be able to download an e-book from HathiTrust as well as a pdf, e.g. to be able to re-flow easily on a portable device with a small screen.

Design guidelines:

Accommodate diverse users. UX design should offer clear links to both formats.

Feasibility Issues:

The infrastructure currently used for producing pdfs could probably be adapted to produce e-books as well.There may be workflow and metadata issues. Further research is needed.

Implementation Estimate: 6 (medium), Uncertainty: high

UX Design Estimate: 6 (medium)

10. Sorting of search results

Weighted total score:20
Team user value score: 5

Rationale

Users may want to sort their results by publication date, author, or title.

Note: The group ranked this low because it is likely that searching 8 million books will provide such large result sets that sorting will not be useful. (See further discussion below.)

Design guidelines:

Provide informative feedback. Support user control.

Hearst 1.5.3 and 8.3:

Feasibility Issues:

There are a number of user-interface and technical issues. In searching the full text of 8 million books, large result sets of 10,000 to 1 million results are possible. Sorting such large result sets may have performance issues and is unlikely to be useful to the user. Using facets or changing the query to return a smaller result set is probably more useful than sorting large result sets. For example, we can provide a date range facet (and/or an advanced search limit) that would help people narrow their search results to a specific date range.

Implementation Estimate: 6 (medium) Uncertainty: high

UX Design Estimate: 4 (medium-hard)

There are some alternatives that might be useful:

  1. Provide for sorting by author/title/date for result sets smaller than some number N, perhaps 1,000.
  2. Sort only the N most relevant results. The technical feasibility of this needs to be confirmed.

Either of these alternatives would require some UX design to explain the seemingly inconsistent behavior of the software. For alternative No. 1, we must explain to the user why they can only sort when there are less than 1,000 records. For alternative No 2, we must explain to the user that the sorted results include only the most relevant 1,000 records. If the performance penalty for sorting very large result sets is not significant, it might be better simply to allow users to sort without setting any limit, rather than have inconsistent behavior that requires complicated explanation.

Implementation Estimate: 6 (medium) Uncertainty: high

UX Design Estimate: 2 (hard)

11. Improve searching for languages using non-Roman scripts

  • Chinese/Japanese/Korean (CJK)

Weighted total score:20

Team user value score: 5

  • Other non-Latin languages (e.g. Hebrew, Arabic)

Weighted total score:20

Team user value score: 4

Rationale:

The HathiTrust repository contains materials in over 400 languages. To the extent possible, the ability to search in those languages should be supported.

An informal look at collection statistics showed a pattern in both HathiTrust holdings and in user access - being that CJK represented about 10% in both cases and other non-Latin languages less than 2%.

Design guidelines:

See Shneiderman et.al., Design for Universal Usability.

Feasibility Issues:

a. CJK issues

Chinese and Japanese are normally not written with spaces between the words. Users don't normally put spaces between the words when entering queries in Japanese or Chinese in search engines. However, our current implementation does not split CJK into words very well, so searches in Chinese or Japanese will get many false hits unless the user puts quotes around words and separates words with spaces.

When the user follows the link to search “within the book,” they may get no results as the “search within the book” search engine does not segment CJK at all, so all the characters are searched as though they were an English word (in effect, a phrase query).

The current whole-collection search implementation segments Chinese and Japanese characters into unigrams, which results in many false drops. Overlapping bigrams, which provide better results, are supported using a different Solr/Lucene analyzer. There is an issue open to allow overlapping bigrams with the filter we are currently using (ICUTokenizer) https://issues.apache.org/jira/browse/LUCENE-2906. (see also http://www.lucidimagination.com/search/document/252c91d0aa6bbe72/bigrams_for_cjk_with_icutokenizer#252c91d0aa6bbe72)

It might also be useful to provide context-sensitive help (based on query terms in Unicode character ranges) to suggest that the user might get better results if they put words in quotes and separate them with spaces. This technique will work both for the whole-collection search and for “search within a book”

Note: At such time that some contributor or the committers of the software we use implements LUCENE-2906, the ease of implementation will change to 8 or 10 (easy to implement). UX design effort is estimated based on designing and implementing help that explains to users how to enter Chinese and Japanese queries for best results.

Implementation Estimate: 4 (medium-hard), Uncertainty: high

UX Design Estimate: 6 (medium)

b. “Search within this book” issues (other non-Latin scripts)

The "search within this book" can handle only CJK, Latin, Cyrillic, and Greek. The whole-collection search will work with Hindi, Arabic, and Hebrew, (which are among the top 20 languages in HathiTrust), but the "search within a book" will give the user an error message that says: “No pages matched your search for XXX this item,” where "XXX" is your query in a script other than the ones supported. There are similar issues for other non-Roman scripts that are supported by the whole-book search.

This could be handled by replacing the current search engine for the "search within a book" with the same search engine we are using for whole-collection search. However, that will require major work.

Implementation Estimate: 2 (hard), Uncertainty: high

UX Design Estimate: 4 (medium-hard)

Note: Replacing the “search within a book” search engine with the same search engine we are using for whole-collection search would also resolve item number 3, “improve within book search.” There are other reasons to consider this change, including the long-term goal of reducing the amount of work involved in maintaining consistency between the whole-collection search and the “search within a book” search.

Weighted total score:20

Team user value score: 5

Rationale

Assist the user with query reformulation by showing similar books.

Design guidelines:

Hearst sect 6.6:

Feasibility Issues:

Solr has a “More like this” feature built in. However, we need to determine if it works with distributed search. This would require storing termvectors which would increase the size of the index. An alternative would be to build our own “More like this” feature, assuming we already indexed the individual book using a Solr-based index

Implementation Estimate: 6 (medium) Uncertainty: high

UX Design Estimate: 4 (medium-hard)

13. RSS feeds

Weighted total score:16

Team user value score: 4

Rationale

Researchers interested in a specific subject or author might wish to keep abreast of new holdings related to that subject or author as they enter into HathiTrust.

Design guidelines:

Follow industry-standard practices for providing, linking to, and formatting RSS feeds.

Feasibility Issues:

There are a number of design issues, both on the technical level and on the UX design level, that need to be investigated. These include:

  1. privacy and security
  2. changes to indexing to allow differentiation between newly added items and updated/modified items
  3. Scalability concerns
  4. Limiting the feed to a “reasonable” number of results in a way that is understandable to users.

Implementation Estimate: 6 (medium) Uncertainty: high

UX Design Estimate: 2 (hard)

14. Tag cloud on search results page

Weighted total score:10

Team user value score: 3

Notes: If tags clouds were created for each book on the search results page, there would likely be usability issues, as this might result in too much clutter and detract from users easily scanning result sets. However, a tag cloud for the entire search might be useful for suggesting terms for query reformulation. On the within-book search results page, a tag cloud for the book might be useful for suggesting terms to find similar books.

Rationale

Tag clouds provide a visualization and might provide term suggestions for query reformulation

Design guidelines:

Provide informative feedback. Favor recognition over recall.

Feasibility Issues:

There are various user-interface issues as noted above. Creating a tag cloud for the whole-collection search might be computationally expensive. Creating tag clouds for individual books would be feasible if we store termvectors. This might have a performance impact and would definitely increase the size of the index.

Implementation Estimate: 2 (hard) Uncertainty: High

UX Design Estimate: 2 (hard)

References

Hearst, M.A. Search User Interfaces. New York: Cambridge University Press, 2009.
Shneiderman, Ben. Designing the User Interface: Strategies for Effective Human-Computer Interaction. Addison-Wesley. Reading, MA. 1998.