Navigation

Solr Proxy API User Guide

The HTRC Solr Proxy is a thin layer over a Solr service to limit access, audit, and provide additional functionalities. It is running on http://chinkapin.pti.indiana.edu:9994 and we use it in our examples.

HTRC solr proxy have two cores, one for ocr full text search and the other for metadata search including all marc bib fields and htrc-contributed fields (field names with prefix "htrc_"). The ocr core let users to issue a query against ocr field and get a list of result volume IDs back. The metadata core gives users more choices to do various queries on different fields. So before introducing the queries, the below table gives the frequently used metadata fields that HTRC metadata solr core has and how they are processed. The indexed fields are searchable where content of stored fields can be returned to users.

For metadata solr core:

Field name Indexed Stored Explanation 
 id Y Y Field for volume ID
 fullrecord N Y Field for storing the full marc record
 title Y Y Field for book title 
 author Y Y Author for a volume
 oclc Y Y The OCLC Control Number
 rptnum Y Y 
 sdrnum Y Y 
 isbn Y Y The International Standard Book Number (ISBN) for a volume
 issn Y Y The International Standard Serial Number (ISSN) for a volume
 callnumber Y Y The call number for a volume
 sudoc Y Y Superintendent of Documents number for a volume. Please refer to http://www.fdlp.gov/cataloging/856-sudoc-classification-scheme?start=3 
 language Y N Field for the languages in which a volume is written
 htsource Y Y  Field for the sources of a volume, e.g. "Indiana University"
 era Y N The era of the volume
 geographic Y N Brief geographic information for a volume, e.g."pennsylvania"
 country_of_pub  Y N The publication country of the book
 topic  Y N Topic of the volume
 genre Y N Genre the volume belongs to
 publishDate Y Y Publish date of the volume 
 publisher Y Y publisher of the volume
 edition Y Y edition of the volume
 allfields false true all bibliographic fields above except OCR filed are indexed in this fields. This is the default search field.
 htrc_pageCount  Y Y the page count of the volume
 htrc_wordCount  Y Y the word count of the volume (htrc use lucene3.6 standard tokenizer to split words)
 htrc_charCount Y Y the character count of the volume
 htrc_gender Y Y Gender of the authors of this volume. This can be either male or female or both.
 htrc_genderMale/htrc_genderFemale/htrc_genderUnknown Y Y Male/female/gender-unknown authors of this volume, multivalued
 htrc_volumePageCountBin Y Y quartile info of this volume based on its volume page count. Values are S/M/L/XL
 htrc_volumeWordCountBin Y Y quartile info of this volume based on its word count. Values are S/M/L/XL

 

We can notice that some fields are indexed but not stored. These fields do not appear in the search result but are very useful for faceted search, which HTRC blacklight heavily relies on. 

For ocr core, it also has many of these metadata fields but users are not encouraged to send query requests against metadata fields to solr ocr core. These metadata fields in solr ocr core are stale and not to be maintained in the future. In fact, HTRC will remove all the metadata fields from the ocr core so that it will only have two fields, id and ocr, and all metadata queries should go to the metadata core. 

1.1. HTRC Solr basic queries

All the basic queries are allowed. "Update" operations are banned. Detailed instructions can be found at  http://wiki.apache.org/solr/SolrQuerySyntax and http://wiki.apache.org/solr/CommonQueryParameters.

Because HTRC have very large index files, distributed search is used to utilize more system resource. Previously "qt=sharding" needed to be appended to the REST call to make sure the query is sent over to all shards. But now users do not need to worry about that because the default "qt" is "sharding". Users can also specify explicitly what query type they want to overwrite the default "qt" parameter. What follows are instructions for  the most frequently used queries; these queries are sufficient for most uses:

(1) basic term query pattern:

{field name} : {term}

for example, "title: war" returns volume IDs of all volumes whose title field contains the word "war".

(2) Boolean query:  simple concatenation of two queries by "AND" or "OR"

{field name1} : {term1} AND {field name2} : {term2}

{field name1} : {term1}  OR  {field name2} : {term2}

Example:

title:war AND author: Hill

returns volume IDs that have "war" in the title and are written by author named "Hill".

(3) Numeric Range Query:  HTRC solr has only have one numeric field:  publishDate field .

{field name} : {[ value1 TO value2]} //value1 and value2 are included. 

Example:

publishDate : [1990 TO 1999]

will return all volume IDs whose publishDate is between 1990 and 1999. This is inclusive of 1990 and 1999.

(4) Prefix query: "*" is used here for zero or more characters 

{field name}:{ prefix*}

Example:  

title: chil*   

returns all volume IDs whose title field contains a word starting with "chil". If you want all the volumes in this index, just use "*:*".

To issue a basic query to HTRC Solr Proxy, the full RESTful request will be:

http://{hostname}:{port number}/solr/select/?q={basic query}

Example: 

http://chinkapin.pti.indiana.edu:9994/solr/select/?q=title:war

returns volume IDs that have "war" in the title.

You can also set the number of results you want by setting "rows".

Example:

http://chinkapin.pti.indiana.edu:9994/solr/select/?q=ocr:love&rows=5

returns the top 5 hits of the volumes that have "love" in it.

Ranking is done by Solr itself. Apache Solr's scoring and ranking mechanism is based on combination of Boolean Model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.

The solr query parameters are very flexible. For more details, please refer to http://wiki.apache.org/solr/CommonQueryParameters .

1.2. Facet Query

Facet query syntax is:

Example:

http://chinkapin.pti.indiana.edu:9994/solr/select/?q=*:*&facet=on&facet.field=genre&qt=sharding

returns all volume IDs in the index and at the bottom of the result set, all genres will be listed with the number of volumes belonging to each genre in search result. 

For more details of facet parameters, please refer to http://wiki.apache.org/solr/SimpleFacetParameters .

1.3. HTRC Solr full text search

Users can do full text search on ocr field through solr ocr core. Basic query, boolean query, phrase query and wildcard are all supported by ocr core. When doing an ocr full text search, remember to use a slightly different REST call pattern:

http://{hostname}:{port number}/solr/ocr/select/?q={query string}

Otherwise, a 400 response code will be returned because there is no ocr field in meta core. 

<warn>RESPONSE CODE: 400</warn>

A quick example of ocr full text search:

http://chinkapin.pti.indiana.edu:9994/solr/ocr/select/?q=ocr:hathitrust

returns the volumes that have "hathitrust" in their textual content.

1.4. Getting MARC Records

Users can download MARC records given a set of volume IDs by HTRC Solr API. The downloaded is a zip file that contains the MARC records for the specified IDs. In the zip file, each zip entry is a MARC record for a volume and zip entry's name is the volume ID.

The syntax is :

http://{host}:{port}/solr/MARC/?volumeIDs={id_1|id_2|id_3|...|id_n} 

Here IDs are separated by "|" for specifying more than one ID.

Example:  

http://chinkapin.pti.indiana.edu:9994/solr/MARC/?volumeIDs=miua.2916929.0001.001|miua.2088345.0001.001

returns a zip file that contains two entries (MARC records), one for miua.2916929.0001.001 and the other for miua.2088345.0001.001.