Automatic Partner Login
Staff at the University of Michigan have developed functionality that allows users from partner institutions to be “automatically” logged into HathiTrust when following links from local institutional catalogs or other resources. Permanent links to HathiTrust volumes can now be wrapped with a single sign-on URL that automatically passes users through their own institution’s authentication service. Users who are not already authenticated are prompted to do so. Documentation of the new functionality is available at http://www.hathitrust.org/automatic_login. Thanks to Johns Hopkins University for suggesting this enhancement.
Staff at Michigan continued to work on tools that content depositors can use to create and validate locally-created content packages prior to submission to HathiTrust. The tools will available to partner institutions in May.
HathiTrust began ingest of Google-digitized content from the University of Illinois in April, bringing in more than 80,000 volumes.
Working Groups and Committees
The Communications Working Group continued regular activities and development of a briefing for the new Board of Governors. New communication initiatives are awaiting the transition to the new Board.
User Experience Advisory Group
The UX Advisory Group revisited issues related to the labeling of PDF download options in PageTurner. The group’s recommended changes aim to clarify when full PDF downloads are or are not available. The changes are under development and will be implemented in May.
User Support Working Group
The table below contains a summary of the issues received by the User Support Working Group in April.
Non-partner Digital Deposit
|Access and Use||112||195|
Print on Demand
Full-PDF or e-copy requests
Data Availability and APIs
Reuse of content
Problems with login specifically
General Questions about login
Partners setting up login
*See User Support Working Group Issue Types for a description of the types of issues included in each category.
Bibliographic Data Management
California Digital Library created prototype exports of the metadata that will be used to populate HathiTrust’s tab-delimited inventory files (“hathifiles”) and bibliographic catalog. Timing tests for these exports were also conducted. The CDL team continued to reconcile bibliographic records in Zephir with records in the current system at the University of Michigan to ensure all the data is accounted for, addressing record discrepancies and ingest errors as encountered. The team has also begun development of a process to sync rights information in Zephir (the new management system) with the HathiTrust rights database.
University of Michigan staff continued work on modifications to the HathiTrust PageTurner to display JATS XML. jPach’s Norm module (see descriptions of all jPach modules) can now extract 15 common components of a journal article, plus embedded media, from a DOCX file and create valid JATS with references to associated media files. A specification for a Submission Information Package for jPach content is nearly complete and will be posted to the jPach website soon. Work has begun on developing wireframes for the Dashboard module. A timeline for the project is available on the HathiTrust jPach project page.
The HTRC completed the agreements necessary with Google to receive Google-digitized public domain volumes from the HathiTrust repository and make them available for computational purposes. With the Google agreements and a Memo of Understanding with HathiTrust in place, the HTRC is actively working with staff at Michigan to bring in the complete set of more than 2.9 million public domain volumes in HathiTrust. Preparation for the transfer includes setup of disk storage and compute nodes at Indiana University (IU), which is being done in collaboration with IU Research Technologies. All computation on HathiTrust volumes will becarried out on HTRC machines; the HTRC itself will not make content available for download. Users interested in receiving texts should follow the directions at http://www.hathitrust.org/datasets.
HTRC was represented at the recent Committee on Institutional Cooperation Digital Humanities Summit in Nebraska. Many attendees were already aware that the HTRC was a digital scholarship initiative of HathiTrust; brochures were on hand to provide a deeper level of detail.
The HTRC has created Meandre workflow components (Meandre is part of the SEASR infrastructure) that retrieve texts from the HTRC using the HTRC data API, spell-check the texts, correct OCR errors, and then perform topic modeling on the texts. The HTRC has demonstrated this functionality, creating topic models of all pages returned from the data API from single-word queries on a full-text index of volumes. For example, a search for “dickens” in the non-Google digitized public domain corpus returns more than 100 topics with associated keywords. The diagrams below show tag clouds of keywords for the topics “lady” and “men”.
IMLS Quality Grant
Project staff completed whole-volume review of the first 1,000-volume sample (1,000 English language, pre-1923 volumes digitized by Google), and over 70% of the second 1,000-volume sample (1,000 English language, post-1923 volumes digitized by Google). Approximately 150 volumes (15%) from each of the two samples were coded by two reviewers for quality assurance. The project team decided to perform whole-volume review on the same volumes sampled earlier in the project for page-level review in order to allow for comparison and more in-depth analysis of the data, and yield a better understanding of error within the volumes.
As of the end of April, staff had completed page-level review of approximately 50% of the fourth 1,000-volume sample (1,000 non-Roman language volumes including Korean, Chinese, Japanese, Arabic, Hebrew and Cyrillic).
In the months to come, the focus of the project team will shift away from data collection to data analysis and reporting, and use-case studies research. More information about this research is forthcoming. In May, the team will focus on developing a sub-study to better identify and describe errors in illustrative content. The project website has been updated to report initial findings. See the links under “Quality Review” at http://hathitrust-quality.projects.si.umich.edu/results.htm.
University of Michigan staff deployed the security enhancements described in the Update on March 2012 Activities, and the Data API now supports the use of 0Auth 1.0-signed requests. As outlined in the March update, there will be a transition period, ending October 1, 2012, during which signed access to the Data API will be possible but not required. After October 1, all requests to the Data API will need to be properly signed with an access key provided by HathiTrust. HathiTrust has created a Web client that employs a user’s login credentials as a proxy for these keys to facilitate non-programmatic uses. Complete documentation of the security enhancements, methods of obtaining keys, signing requests, and accessing the Web client is forthcoming.
Also effective October 1, the host “services.hathitrust.org” will no longer exist for the Data API. The new host will be “babel.hathitrust.org”, the same host as the PageTurner and other HathiTrust services. Calls to the Data API will therefore need to use URLs such as the following (note the additional “cgi” in the path):
HathiTrust released the second phase of advanced full-text search functionality in April. Users can now combine up to four different fields connected by the “AND” or “OR” operators. Search parameters are retained when users click on the “Revise this advanced search” on the search results page. The advanced search interface also allows complex Boolean expressions in the query box, for example:
If a user enters unbalanced parenthesis, quotes or operators, for example
the application strips out the operators and does a default Boolean AND search and provides a message informing the user.
Several bugs in advanced search were also fixed:
- The reset button now actually clears the form.
- Queries with the characters “<,>", or "&” are now handled correctly.
- The words “and” and “or” are now only interpreted as Boolean operators if the query is in lower case or mixed case and the operators (AND|OR) are all upper case.
The HathiTrust PageTurner now displays a version for items in the repository (at the bottom of the left column when viewing an item). The version is the date the item was last updated. Items are updated when improvements such as higher quality or more complete scans have been made.
Web Hosting Infrastructure Changes
HathiTrust’s Drupal-based informational website was successfully moved from Michigan library web hosting infrastructure to the existing dedicated HathiTrust web hosting infrastructure. Work continued on the move of HathiTrust’s VuFind-based bibliographic catalog, which is expected to be completed in early May.
No outages were reported in April 2012.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact email@example.com.
As of May 1:
|Library of Congress||0||89,416|
|North Carolina State University||0||3,196|
|University of North Carolina - Chapel Hill||0||8,088|
|New York Public Library||20||259,557|
|Penn State University||28||43,308|
|University of California||202||3,329,971|
|The University of Chicago||7,251||20,457|
|University of Illinois||80,642||96,146|
|University of Michigan||5,011||4,534,989|
|University of Minnesota||86||95,150|
|University of Wisconsin||1||534,871|
|University of Virginia||1||48,922|
Public Domain (~28%)
* Includes volumes opened through copyright review and rights holder permissions
Jeremy York, Access Services in the Age of Mass Digitization. IviesPlus Conference, University of Chicago, April 20, 2012.
Rebuild the Large Scale Search Solr/Lucene index with CJK (Chinese, Japanese, Korean) indexing improvements; to be completed in May or June