Navigation

Solr Proxy API User Guide

The HTRC Solr Proxy is a thin layer over a Solr service to limit access, audit, and provide additional functionalities. It is running on http://chinkapin.pti.indiana.edu:9994 and we use it in our examples.

 Before introducing the queries, the below table gives the frequently used fields that HTRC Solr has and how they are processed. The indexed fields are searchable where content of stored fields can be returned to users.

Field name Indexed Stored Explanation 
 id Y Y Field for volume ID
 ocr Y N Ocr field for full text search
 title Y Y Field for book title 
 author Y Y Author for a volume
 oclc Y Y The OCLC Control Number
 rptnum Y Y 
 sdrnum Y Y 
 isbn Y Y The International Standard Book Number (ISBN) for a volume
 issn Y Y The International Standard Serial Number (ISSN) for a volume
 callnumber Y Y The call number for a volume
 sudoc Y YSuperintendent of Documents number for a volume. Please refer to http://www.fdlp.gov/cataloging/856-sudoc-classification-scheme?start=3 
 language Y N Field for the languages in which a volume is written
 htsource Y Y  Field for the sources of a volume, e.g. "Indiana University"
 era Y N The era of the volume
 geographic Y N Brief geographic information for a volume, e.g."pennsylvania"
country_of_pub  Y N The publication country of the book
topic  Y N Topic of the volume
 genre Y N Genre the volume belongs to
 publishDate Y YPublish date of the volume 
 publisher Y Y publisher of the volume
 edition Y Yedition of the volume
 allfields false true all bibliographic fields above except OCR filed are indexed in this fields. This is the default search field.

 

We can notice that some fields are indexed but not stored. These fields do not appear in the search result but are very useful for faceted search. 

1.1. HTRC Solr basic queries

All the basic queries are allowed. "Update" operations are banned. Detailed instructions can be found at  http://wiki.apache.org/solr/SolrQuerySyntax and http://wiki.apache.org/solr/CommonQueryParameters.

Because HTRC have very large index files, distributed search is used to utilize more system resource. Previously"qt=sharding" needed to be appended to the REST call to make sure the query is sent over to all shards. But now users do not need to worry about that because the default "qt" is "sharding". Users can also specify explicitly what query type they want to overwrite the default "qt" parameter. What follows are instructions for  the most frequently used queries; these queries are sufficient for most uses:

(1) basic term query pattern:

for example, "title: war" returns volume IDs of all volumes whose title field contains the word "war".

(2) Boolean query:  simple concatenation of two queries by "AND" or "OR"

{field name1} : {term1} AND {field name2} : {term2}

{field name1} : {term1}  OR  {field name2} : {term2}

Example:

title:war AND author: Hill

returns volume IDs that have "war" in the title and are written by author named "Hill".

(3) Numeric Range Query:  HTRC solr has only have one numeric field:  publishDate field .

{field name} : {[ value1 TO value2]} //value1 and value2 are included. 

Example:

publishDate : [1990 TO 1999]

 will return all volume IDs whose publishDate is between 1990 and 1999. This is inclusive of 1990 and 1999.

(4) Prefix query: "*" is used here for zero or more characters 

{field name}:{ prefix*}

Example:  

title: chil*   

returns all volume IDs whose title field contains a word starting with "chil". If you want all the volumes in this index, just use "*:*".

To issue a basic query to HTRC Solr Proxy, the full RESTful request will be:

http://{hostname}:{port number}/solr/select/?q={basic query}

Example: 

http://chinkapin.pti.indiana.edu:9994/solr/select/?q=title:war

returns volume IDs that have "war" in the title.

You can also set the number of results you want by setting "rows".

Example:

http://chinkapin.pti.indiana.edu:9994/solr/select/?q=ocr:love&rows=5

returns the top 5 hits of the volumes that have "love" in it.

Ranking is done by Solr itself. Apache Solr's scoring and ranking mechanism is based on combination of Boolean Model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.

The solr query parameters are very flexible. For more details, please refer to http://wiki.apache.org/solr/CommonQueryParameters .

1.2. Facet Query

Facet query syntax is:

http://{hostname}:{port number}/solr/select/?q={basic query}&facet=on&facet.field={field}&qt=sharding

Example:

http://chinkapin.pti.indiana.edu:9994/solr/select/?q=*:*&facet=on&facet.field=genre&qt=sharding

returns all volume IDs in the index and at the bottom of the result set, all genres will be listed with the number of volumes belonging to each genre in search result. As mentioned above, "qt=sharding" can be ignored

For more details of facet parameters, please refer to http://wiki.apache.org/solr/SimpleFacetParameters .

1.3. Getting MARC Records

Users can download MARC records given a set of volume IDs by HTRC Solr API. The downloaded is a zip file that contains the MARC records for the specified IDs. In the zip file, each zip entry is a MARC record for a volume and zip entry's name is the volume ID.

The syntax is :

http://{host}:{port}/solr/MARC/?volumeIDs={id_1|id_2|id_3|...|id_n} 

Here IDs are separated by "|" for specifying more than one ID.

Example:  

http://chinkapin.pti.indiana.edu:9994/solr/MARC/?volumeIDs=miua.2916929.0001.001|miua.2088345.0001.001

returns a zip file that contains two entries (MARC records), one for miua.2916929.0001.001 and the other for miua.2088345.0001.001.