- Access and Services
- Q: How do I obtain an account to access HTRC?
- Q: How do I obtain an account to access HTRC Sandbox?
- Q: How do I access HTRC Sandbox?
- Q: How do I access HTRC Production Stack?
- Q: What are the differences between the Production Stack and the Sandbox?
- Q: What is the HTRC Solr Proxy and how is it different from Apache Solr?
- Q: What is the difference between the HTRC Data API and HathiTrust Data API?
HTRC is the research arm of HathiTrust. It is a partnership between Indiana University (IU) Libraries, the Pervasive Technology Institute, and School of Informatics and Computing at IU, and the University of Illinois, Urbana-Champaign (UIUC) Libraries, and Graduate School of Library and Information Science.
A: We have created a couple of platforms for you to experiment with. The main HTRC services (sometimes referred to as the production stack) gives you a Portal and a Workset Builder.
From the Portal you can log in and run analytic algorithms on a set of predefined collections of volumes. These algorithms, powered by the SEASR toolkit, run against the HathiTrust volumes that are in the public domain (close to 3M).
The Workset Builder is a search interface for the Hathitrust public domain corpus - search results can be saved as a 'workset': a collection of volumes against which the text mining algorithms are run.
In addition to the main services, we also provide a Sandbox stack with the same tools. The sandbox runs against non-Google scanned content (about 260,000 volumes). The advantage of the sandbox is that you can access the index and Data API directly, and so you can write your own algorithms.
A: The HTRC has several overarching paradigms –worksets, algorithms, jobs, and results.
- Worksets are collections of volumes and other data to be processed. Worksets are built using software that functions like many library catalog systems. In the Workset Builder application (often referred to as Blacklight), you will be able to search for, view, and select items that you would like to process.
- Algorithms are research methodologies expressed in executable code; that is, they are programs that will run one or more function against your workset. You can choose from a set of algorithms that have been integrated into the HTRC. You can customize the parameters for each algorithm.
- Jobs: When you hit submit, you are submitting a job. A job is a set of instructions that are executed by one of the computing resources available to the HTRC. You can view the status of the jobs that you have submitted. You can also delete jobs. If you find that you have made an error in your set up, you can delete the job.
- Results: When your job has completed, you can view the results of the job. The results can be viewed in the HTRC. You can also download the results.
A: HTRC currently has the public domain corpus OCR text, along with MARC and METS XML.
A: You may sign up for an account by going to the HTRC Production Portal http://htrc2.pti.indiana.edu and choose "Sign up" from the menu.
A: Please send an email to email@example.com (a list subscribed by HTRC internal staff only) to request for an account, along with your name, your contact information, and indicate that you would like to access the HTRC Sandbox.
A: This table lists the HTRC Sandbox endpoints
|Portal||http://sandbox.htrc.illinois.edu:8080||The portal allows you to browse volume lists and algorithms, execute algorithms, and view results|
|Blacklight||http://sandbox.htrc.illinois.edu:8080/blacklight||The Blacklight search interface allows you to search for volumes, and create volume lists that can be used by algorithms. It provides a GUI interface to our Solr index|
|Data API||https://sandbox.htrc.illinois.edu:25443/data-api/||The HTRC Data API provides access to the corpus data and METS XML via a RESTful web service|
|Solr Proxy||http://sandbox.pti.indiana.edu:9994/solr/||The HTRC Solr Proxy provides access to the Solr index|
A: This table lists the HTRC Production Stack endpoints
|Portal||http://htrc2.pti.indiana.edu||The portal allows you to browse volume lists and algorithms, execute algorithms, and view results|
|Blacklight||http://htrc2.pti.indiana.edu/blacklight||The Blacklight search interface allows you to search for volumes, and create volume lists that can be used by algorithms. It provides a GUI interface to our Solr index|
A: This table outlines the differences between the Production Stack and the Sandbox:
|Purpose||A distributed service oriented cyberinfrastructure to support various digital humanities researches and text analysis of HTRC members||A community asset meant to be open to the community and for interested users to try things out on a smaller scale|
|Number of machines||9||1|
|Corpus||Full public domain set||Non-Google scanned public domain subset|
|Number of volumes||2.7 million||250,000|
|Compute resource||A separate 128-node cluster||Local on the Sandbox|
|Accounts||Personal account||Pre-defined account pool|
|Account reclamation||No||Yes (reclaimed and reassigned after 30 days of inactivity)|
A: The HTRC Solr Proxy is a thin service in front of Apache Solr services for security and auditing purposes. The Solr Proxy filters requests to allow read-only requests to protect our indices from being modified; other than that, it is fully compatible with Apache Solr. Please see Solr Proxy API User Guide
A: This table outlines the differences between the HTRC Data API and HathiTrust Data API
|HTRC Data API||HT Data API|
|Purpose||To serve high-performance large-scale algorithms and programs||To provide public users some volume retrieval capabilities|
|Bulk retrieval of volumes||Yes||No|
|Metadata available||METS||METS, MARC|
A: The HTRC Sloan Cloud supports large-scale non-consumptive research. However, it is still under internal testing.
It will provide the technical infrastructure for large-scale non-consumptive research. This anticipates access to the remaining 70% of the HathiTrust corpus. ("Non-consumptive" research means that, in these instances, researchers treat the digitized text as data, searching or mining it using algorithms, but not actually treating it in the manner of a consumer consuming a resource (such as a reader reading a book).)
A: Please see HTRC Data API Users Guide
A: Worksets are collections of volumes from our collection. There are currently two types of workset: basic and labeled. Basic worksets can be created with the Workset Builder or with the upload CSV functionality, labeled worksets can only be added by uploading a CSV.
Creating worksets with the Workset Builder
The easiest way to create a basic workset is to use the Workset Builder. The Workset Builder allows you to search across our collection.
All the items that you select are kept in the Workset Builder. To review them, click "selected items" in the navigation bar. This is meant as a workspace for building a volume list for the workset, to save a workset of these items: click "Create/Update workset":
When you're saving a workset, note that it can be saved publicly (viewable by all users) or private. After saving a workset, it will be available in the HTRC Portal, for use in analysis or for download.
Building labeled worksets
While a basic workset simply collects volumes in one place, it is possible to add classes to worksets. This allows for use with classification algorithms, such as Naive Bayes.
The CSV can be built in your preferred way. One common approach is to
- build a basic workset in the Workset Builder
- download the basic workset
- open the workset in the HTRC CSV Editor prototype (or a spreadsheet app of one's choosing)
- In the CSV Editor or spreadsheet:
- A 'class' column can be added and filled in
- Additionally CSVs can be appended
- Manual volumes can be added (by looking up the "Volume_id" in the Workset Builder)
- The output of the HTRC CSV Editor or saved spreadsheet can be uploaded to the HTRC Portal
A labeled workset CSV should follow the following style:
- the first line should be a header (or names of each column);
- the first column should be a volume id, and the second column should record the label of the volume.
Below is an example of what the CSV file looks like. Given some volumes, classes are assigned to them based on some criteria. For example, here the labels are the names of the authors of the volumes:
Worksets are uploaded in the HTRC Portal, under Worksets > Upload Workset, or with the '+' button in the workset list view. This is an alternative to the Workset Builder, and currently the only way to add labeled worksets.
As of now, the worksets in the portal and in the csv file display the volumes in different orders. (We are working on a fix to this issue.) You need to be alert to this so that you do not assume that the worksets in the portal and in the csv file would obey the same order. (If you assume that, then you may end up referring to the order displayed in the portal when assigning classes to the volumes specified in your CSV file, which could lead to problems.
One way to find the title/content of a book, while assigning classes to volumes, would be to get it from http://babel.hathitrust.org/cgi/pt?id=mdp.39015033434559;view=1up;seq=1 (by substituting the volume id in this URL with the desired volume ID).
A: Below are links to some very useful documentation:
A: Yes. All of the HTRC services code modules are open source and are available from SourceForge. Go to http://sourceforge.net/p/htrc/code/ to browse the code, or check out directly from SVN using:
svn co svn://svn.code.sf.net/p/htrc/code/
A: Please join the HTRC Usergroup mailing list.
- Please send an email to firstname.lastname@example.org to subscribe to the list, and
- Use email@example.com to post questions
- For questions that you want to discuss with us privately, please write to firstname.lastname@example.org, a list subscribed by HTRC internal staff only
A: To report a bug, please go to http://jira.htrc.illinois.edu/browse/HTRC. You need to create a JIRA login account if you have not done so already. To provide feedback, you may use the "feedback" tab found on the right-hand side of various portal pages to pop up a form.