[Download PDF ]
The elections for the HathiTrust Board of Governors are taking place the first two weeks of March. There are twelve candidates in the running, and six will be elected. Each institution or consortia is able to vote for up to six individuals, and each vote will be assigned the full voting weight of the partner. The 12 candidates are as follows:
Staff at Michigan added 5 new fields to HathiTrust’s tab-delimited inventory files : publication date, publication location, language, bibliographic format, and an indication of whether or not a volume has been identified as a U.S. federal government document. A description of the new fields, which are included in the inventory files as of March 1, is available at http://www.hathitrust.org/hathifiles_description .
Effective May 1, support for legacy Data API URLs in the following form will be removed:
After May 1, URLs should be submitted according to the current Data API schema  without the “api” path element:
Michigan staff resumed work on Data API security enhancements, which was postponed in August as staff prepared systems to support access capabilities for users who have print disabilities and access to orphan works. The specification for the new Data API security is available at http://bit.ly/jozHQK.
HathiTrust is now posting reports of public domain and open access volumes in HathiTrust that are available for print on demand. The reports can be found at http://www.hathitrust.org/pod_reports  and will be released on the first of every month beginning in April.
HathiTrust began working with the University of Utah and continued conversations with Northwestern University on ingest of locally-digitized volumes. Staff at Michigan completed ingest of a second set of open access volumes from the Utah State University Press.
Michigan Staff received bibliographic metadata for approximately 180,000 volumes from the University of Illinois. These volumes were part of growth projections that are made yearly by partners, on which annual storage purchases are made. Ingest of the Illinois volumes will begin after the 2012 additional storage is in place, likely in April.
The Executive Committee and SAB have approved the Collection Committee’s recommendations for the treatment of duplicates in HathiTrust; the final report will be posted online shortly. The report discusses various categories of duplicates that exist in the repository and attempts to assess their scope and cost, while also noting some of the difficulties in the precise identification of duplicates. The report recommends that HathiTrust retain all duplicate copies ingested into the repository for the time being, with periodic reassessment. Some categories of duplicates are recommended for permanent retention (e.g. early published books). The SAB has requested that the Committee make further recommendations about the criteria that should be applied in future assessments and identify the future costs and risks of retaining duplicates in the corpus.
The group provided feedback on several interface- or usability-related projects: the addition of a volume version (date of last ingest) in the PageTurner interface, a potential change to the default view in PageTurner, and the best way to encourage the creation of high-quality public collections. Work on all three of these projects will continue in March.
The User Support Working Group decided to postpone submission of its report on recommendations to the Executive Committee until later in the year. This will give more time for changes that have or might be implemented as a result of the group’s recent evaluation process to be assessed and incorporated into more formal recommendations.
The table below contains a summary of the issues received by the User Support Working Group in February.
Non-partner Digital Deposit
|Access and Use||131||79|
Print on Demand
Full-PDF or e-copy requests
Data Availability and APIs
Reuse of content
Problems with login specifically
General Questions about login
Partners setting up login
*See User Support Working Group Issue Types  for a description of the types of issues included in each category.
California Digital Library has nearly completed testing of HathiTrust records received from the University of Michigan in Zephir, the new management system, and is making significant progress on reconciling records ingested in both systems. Staff from Michigan and CDL finalized the minimum record submission standard to be used prospectively for records submitted by partners to HathiTrust. The standard will be integrated into the HathiTrust ingest checklist. CDL and Michigan also worked to address issues related to integration planning and reconciliation of records in Michigan’s system and in Zephir. CDL performed a dry run load test on ingest of records into Zephir.
Staff at Michigan completed the first iteration of a tool that is able to create valid JATS XML from simple DOCX files, and continued development on PageTurner to render JATS XML. Staff clarified the goals of the project to include implementation of a publishing system (allowing management of an editorial workflow) in addition to mechanisms for ingest, display, and discoverability of born-digital journal materials in the HathiTrust repository. More information is available at http://www.hathitrust.org/htpub. 
The HathiTrust Research Center released a beta version 0.1 of the HTRC Data API. The API is a RESTful API through which the HTRC Solr index and volume store are accessed. It cannot be used to download volumes, but can be used to move data to a location where computation takes place. It can also be used to search the Solr index for a set of volume IDs and pass the volume IDs to a service for access and computation. Access to the API will require OpenID authentication and appropriate authorization. The Data API is installed on two sandbox machines, one at UIUC and another at IU, for internal testing. Both sandbox installations work against a small subset of non-Google scanned volumes.
The HTRC technical team prototyped Blacklight (http://projectblacklight.org ), an open source library catalog search and retrieval system, for deployment in the HTRC. Blacklight is designed to support data that is both full text and bibliographic, it is built on Solr, the same technology used to index HTRC data, and Blacklight supports faceted searches, a known need of researchers. The test implementation of Blacklight was deployed on a shared server at UIUC. In the next quarter, the HTRC expects to use the customization options to configure the look and feel of the interface and perhaps extend the functionality to show snippets of the text to help researchers refine their results. Any new functionality that is developed will be shared back with the larger Blacklight community. Blacklight is expected to be a significant component of the public face of the HTRC.
Members of the HTRC performed a study recently on quantifying OCR errors in the HathiTrust corpus. Scholars are interested in doing quality text analysis, but results can be confounded by OCR errors. Information on which books (or pages) in the collection have significant rates of OCR errors could help. The HTRC explored a couple of approaches to OCR error detection and have results for one approach that uses machine-generated and expert-evaluated rules. Starting with a large dictionary of correctly spelled words, HTRC members identified outlier words that were in the HathiTrust corpus but not in the dictionary. As a check on identified words, the rules by which outliers were detected were verified by a human expert. Using this approach, HTRC formulated 48,308 rules that identified outlier words and provided corrections. HTRC members applied the rules to 256,000 non-Google digitized volumes from HathiTrust, which took 4 hours using the National Center for Supercomputing Applications’ Ember supercomputer. The results showed that the probability of a word having an OCR error (detected by the rule set) was 0.20%. The average number of errors per page was 0.57. The average number of errors per volume was 156. The probability that a page had one or more errors on it was 11%. The probability that any volume had one or more errors was 84.9%. Overall, 217,754 of the 256,416 volumes had one or more OCR errors and 7,745,034 of the 69,297,000 pages had one or more errors.
Project staff completed page-level review of a third production sample, consisting of 1,000 volumes digitized by the Internet Archive. More than 85,000 pages were reviewed in all. Approximately 9,400 of these (about 10%) were coded by two reviewers for quality assurance purposes. The focus of project work shifted then to finalizing training materials and data collection systems and procedures for whole-volume error review (review for errors that apply to an entire volume, such as missing, out-of-order, or duplicate pages). Project staff reviewed approximately 300 test volumes in a new whole-volume review interface to surface issues in using the interface and applying the new error model, and to develop an initial training manual. Whole-volume review began in mid-February on the same volumes reviewed in the first page-level production sample (1,000 English language, public domain, Google-digitized volumes).
Although the primary focus of work shifted to whole-volume review, physical review of the volumes sampled in the first production run continued in February. 870 of the 1,000 volumes in the sample were obtained and reviewed by volunteer graduate students at the University of Michigan. Students also began physical review of 600 Michigan volumes included in the second page-level sample (1,000 English language, Google-digitized volumes published post-1923). More than 400 of these volumes were reviewed by month’s end.
Michigan staff made a number of adjustments to the PageTurner application. These included fixing a bug in the RDFa emitted in the PageTurner bibliographic metadata that had prevented license information from being included appropriately; enhancing the access control mechanism for items that are public domain in the United States to better detect whether a user is on U.S. soil when access is proxied; updating the back-end process by which user feedback is submitted from HathiTrust applications (including PageTurner, the HathiTrust bibliographic catalog, Full-text Search, Collection Builder) to the central HathiTrust ticketing system; and implementing a process to detect cases where multiple tickets are submitted on identical HathiTrust items or records.
Michigan staff began work on the next iteration of advanced full-text search, which will allow users to build queries with greater Boolean complexity and enhance the ability to revise advanced searches. Staff made progress as well on plans to improve search results relevance ranking. This work is planned to begin after the next release of advanced full-text search.
California Digital Library staff completed dictionary-building work for the spelling suggester feature. The code can now build a language-sensitive dictionary of unigrams and bigrams from any Lucene index, automatically choosing a frequency cut-off to constrain the size of the dictionary. Focus will now shift to implementing fast-lookup and suggestion ranking.
Page viewing of volumes classified as "Public Domain in the United States" was unavailable on Tue 2-7 from approximately 5:30-9:45pm EST due to a software problem.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact firstname.lastname@example.org .
As of March 1:
|Library of Congress||0||89,411|
|North Carolina State University||0||3,196|
|University of North Carolina - Chapel Hill||0||8,087|
|New York Public Library||40||259,506|
|Penn State University||316||43,262|
|University of California||36,848||3,329,011|
|The University of Chicago||1,198||12,897|
|University of Illinois||0||14,503|
|University of Michigan||13,190||4,525,854|
|University of Minnesota||1,787||92,368|
|University of Wisconsin||4,995||533,573|
|University of Virginia||1,525||48,921|
*Volume count does not include archival and image materials in the Minnesota Digital Library project
Public Domain (~28%)
*Includes volumes opened through copyright review and rights holder permissions
Tom Burton-West, "HathiTrust Large Scale Search: Scalability meets Usability ". Code4Lib, Feburary 7, 2012.
Jeremy York, "HathiTrust: Issues and Challenges in Preserving the Published Record ". Amigos Online Conference, February 8, 2012.