Creating and Analyzing Worksets


This guide provides the steps needed to get started using HTRC Workset Builder and the HTRC Portal. These tools facilitate creation and analysis of worksets. A workset is a collection of Hathitrust materials that you bring together for the purpose of analysis. For example, you might create a workset that consists of all volumes that have the word “automaton” in full text, published in the year 1900, in the United States. The resulting workset is the unit of analysis. An example analysis might consist of topic modeling the workset using HTRC Portal.


T1000:Users:thomaspadilla:Desktop:Screen Shot 2013-10-11 at 1.17.52 PM.png

T1000:Users:thomaspadilla:Desktop:Screen Shot 2013-10-11 at 1.20.23 PM.png


To date, nearly 3 million public domain volumes form the library you can draw from to build and analyze worksets.


Getting Started


Creating a Workset



  • Refine search by selecting more options



  • Narrow search by clicking facets on the left hand side of search results



  • Select Items for Workset: Click “Select items on page” or “Select all search items”



  • Click “Selected Items”


  • Click “Create/Update Workset”



  • Name it, describe it, make it public or private, tag it, create it


  • Go to HTRC Portal



Notes on Creating a Workset

  • Basic worksets can have labels added to them for use with classification algorithms like Naïve Bayes.

    • Build a workset using the Workset Builder

    • Download the workset

    • Open the workset in a comma separated values (CSV) Editor or spreadsheet

      • The first line should be a header (or names of each column);

      • The first column should be a volume id, and the second column should record the label of the volume.

      • Add a class column (CSV files can be appended)

      • Volumes can be added manually by looking up the "Volume_id" in the Workset Builder

    • Upload the edited workset to HTRC Portal using the Workset menu

      • Note: Worksets in the portal and in the csv file display the volumes in different orders. (We are working on a fix to this issue.) You need to be alert to this so that you do not assume that the worksets in the portal and in the CSV file would obey the same order. (If you assume that, then you may end up referring to the order displayed in the portal when assigning classes to the volumes specified in your CSV file, which could lead to problems.


Analyzing a Workset



  • Click ‘Submit’ to start workset analysis


  • Workset will be: staging, queued, running, or finished



Notes on Analyzing a Workset

  • Multiple worksets can be submitted for analysis.

  • All public worksets are available for analysis in the Worksets menu on the HTRC Portal main page.

  • Depending on the size of the workset, time will vary for the analysis to reach completion.

  • Depending on the algorithm, different options will be presented when preparing the workset for analysis. For example, the Tagcloud_with_cleaning algorithm provides the option to provide a user-created list of stopwords (words that should be excluded from the topic model, e.g. commons words like ‘I’, ‘a’, ‘and’).  

  • Depending on the algorithm, different options will be presented when viewing results. For example, the topic modeling algorithm provides an xml file of topics and an html file that displays topic tag clouds.