[1]
Monthly updates provide information on HathiTrust's progress towards established objectives [2]. If you would like to receive email notifications for the monthly updates, please sign up for our Updates Google Group [3]. For Repository metadata downloads, click here [4].
[Download PDF] [5]
You can follow HathiTrust on Twitter [6] or Facebook [7], or subscribe to receive email updates [3] (via Google Groups).
HathiTrust released a new service that allows designated proxies at partner institutions to provide access to in-copyright works in HathiTrust to users at their institutions who are certified as having a print disability. See http://www.hathitrust.org/accessibility [8] for more information.
The HathiTrust website, including all web applications, was updated with a unified design and feature set, improving the overall look and functionality of the site. Details are available at http://www.hathitrust.org/hathitrust_new_look [9].
The Board of Governors held an in-person meeting in Chapel Hill, North Carolina on April 28. The Board had a very full agenda, including a review of the Constitutional Convention ballot initiatives [10] and work related to the Program Steering Committee (including the appointment of members).
The survey [11]released last month to institutions that are depositing locally-digitized content will close May 15. The purpose of the survey is to gauge the readiness of institutions to deposit locally-digitized content, and help set a timeline for the development of a next installment of tools to assist in validating and packaging content prior to deposit. HathiTrust continued to provide support for the existing ingest tools and ingest procedures.
The User Experience group continued to review elements of HathiTrust Web applications identified through user feedback and other means as being in need of improvement. The group also helped troubleshoot and make decisions in support of the new interface redesign.
A summary of issues received by the User Support Working Group is given in the table at the end of the update.
Staff from the California Digital Library (CDL) and the University of Michigan agreed on revised requirements and workflow changes to accommodate bibliographic rights determinations being made at the University of Michigan rather than in Zephir (the new system). CDL staff began development reflecting the revised requirements. Staff also updated the project timeline [12]. A parallel phase, to be completed prior to moving to Zephir, is now anticipated to run at the end of the summer.
A summary of the determinations from HathiTrust copyright review activities in April is given below. See CRMS-US [13] and CRMS-World [14] for more information.
|
|
March |
Overall |
||
|
Public Domain Determinations |
All Determinations |
Public Domain Determinations |
All Determinations |
|
|
CRMS-US |
3,565 |
7,576 | 131,880 | 245,305 |
|
CRMS-World |
2,691 | 5,179 | 24,154 | 44,746 |
|
Total |
6,256 | 12,755 | 156,034 | 290,051 |
HathiTrust Research Center project members assembled in Urbana, Illinois to launch Phase II of HTRC’s efforts. Following the successful release of HTRC software and services on March 30, 2013 (marking the completion of Phase I), HTRC is focusing its limited resources in Phase II along two lines: 1) community engagement and community-driven enhancements to HTRC software and services; and 2) the HTRC Sloan Cloud for non-consumptive research. To the former, the HTRC is pleased to welcome Miao Chen as Assistant Director of Education and Outreach. Miao recently received her PhD in information sciences from Syracuse University. She will be helping to organize the outreach efforts. A significant upcoming event is the 2nd HTRC UnCamp, scheduled for early September. The event will be held at the University of Illinois in Urbana-Champaign, Illinois. Details will follow. The HTRC Sloan Cloud, a project in development that is funded by a grant from the Sloan Foundation, is the technical infrastructure for carrying out text analysis using general university compute resources securely in way that does not violate terms of copyright.
University of Michigan staff determined functional requirements for relating journal-level MARC records to article-level MARC records, and journal-level “aboutware”.
HathiTrust institutions performed the following work related to applications and Web interfaces:
Staff investigated ways of improving the sorting of serial publications.
Staff implemented a mechanism for automatically deleting registered Data API keys that have not been activated.
Staff explored ways to improve relevance ranking in full-text search results and fixed two bugs in searching materials with Chinese, Japanese, or Korean characters: searches did not always work when quotes were used around characters, and there was a problem in recognizing the Boolean “AND” operator.
Staff implemented a new code-debugging scheme for testing access by rights attribute.
HathiTrust installed new storage at the Indiana and Michigan sites that will both accommodate 2013 volume projections and replace storage scheduled for retirement. Storage due for retirement will be taken offline starting in late May or June.
Pages did not display in HathiTrust on Thursday, April 4 from 3:30 - 4:15pm due to a software problem that was identified and corrected.
Daily full-text indexing was suspended from Wednesday, April 24 to Tuesday, April 30 due to a complication with a storage system software upgrade. The problem was resolved and normal indexing processes resumed.
As of May 1:
| April | Overall | |
| Boston College | 0 | 2,179 |
| Columbia University | 0 | 65,033 |
| Cornell University | 1,279 | 419,804 |
| Duke University | 0 | 4,523 |
| Harvard University | 27 | 236,068 |
| Indiana University | 15 | 195,227 |
| Library of Congress | 1 | 89,724 |
| North Carolina State University | 0 | 3,196 |
| Northwestern University | 18,167 | 33,561 |
| New York Public Library | 28,662 | 288,342 |
| Penn State University | 13,171 | 58,596 |
| Princeton University | 1 | 251,702 |
| Purdue University | 0 | 44,692 |
| Universidad Complutense | 0 | 111,982 |
| University of California | 577 | 3,388,025 |
| The University of Chicago | 1,377 | 30,285 |
| University of Florida | 0 | 2,068 |
| University of Illinois | 3 | 109,314 |
| University of Michigan | 1,793 | 4,636,751 |
| University of Minnesota | 1,590 | 106,275 |
| University of North Carolina, Chapel Hill | 0 | 16,588 |
| University of Wisconsin | 18 | 555,725 |
| University of Virginia | 0 | 50,815 |
| Utah State | 0 | 117 |
| Yale University | 0 | 23,678 |
| Total | 66,681 | 10,724,270 |
Public Domain (~31%)
| Total* | 51,815 | 3,373,522 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | April | March |
| Content | 379 | 382 |
|
Quality |
368 | 373 |
|
Non-partner Digital Deposit |
1 | 1 |
|
Collections |
9 | 8 |
| Cataloging | 140 | 87 |
| Access and Use | 145 | 149 |
|
Copyright |
71 | 77 |
|
Permissions |
13 | 16 |
|
Takedown |
0 | 0 |
|
Print on Demand |
0 | 1 |
|
Inter-library loan |
4 | 4 |
|
Full-PDF or e-copy requests |
21 | 32 |
|
Datasets |
4 | 5 |
|
Data Availability and APIs |
1 | 2 |
|
Reuse of content |
5 | 3 |
| Web applications | 31 | 11 |
|
Functionality problems |
7 | 4 |
|
Problems with login specifically |
1 | 1 |
|
General Questions about Login |
1 | 2 |
|
Partners setting up login |
0 | 2 |
|
Usability issues |
7 | 1 |
|
Feature requests |
2 | 0 |
| Partner Ingest | 3 | 13 |
| General | 76 | 87 |
|
Partnership |
12 | 9 |
|
Infrastructure |
2 | 0 |
|
Miscellaneous |
62 | 78 |
| Total | 774 | 729 |
[Download PDF] [27]
HathiTrust has requested nominations from partner institutions for the HathiTrust Program Steering Committee (PSC). The responsibilities of the Program Steering Committee are described in Article VII, Section 3 of the HathiTrust [28] Bylaws [28]. Among the first areas of work to be undertaken by the PSC are the ballot initiatives [10] passed at the 2011 Constitutional Convention, including expanding access to US government documents and creating infrastructure for shared monograph storage initiatives. Any member of a partner institution may submit nominations for the PSC until April 22 via the form at http://goo.gl/ [29]TV0CN [29].
The HathiTrust Research Center reached a development benchmark in its release of production infrastructure to support data mining and textual analysis of volumes in HathiTrust.
The infrastructure includes an entrance portal, search and collection-building tools (using Blacklight [30]), and access to SEASR analysis algorithms that can be run against the HathiTrust public domain corpus (more than 3 million volumes). In addition to the production services, the HTRC offers a development “sandbox”. The sandbox runs against non-Google scanned content (about 260,000 volumes) and provides a test-bed for interested researchers to experiment with writing their own algorithms for use in the HTRC infrastructure.
The production release concludes the first six month period in Phase 2 of development of the HTRC (Oct 2012-March 2014). Phase 2 will also include the development of the HTRC-Sloan-Cloud – infrastructure that will include additional mechanisms to allow secure, non-consumptive access to the entire HathiTrust corpus – and systems to accommodate the full 10.6 million HathiTrust volumes in the HTRC. For more information on HTRC services and testing of the production infrastructure, please join our HTRC-usergroup-l listserv at https://list.indiana.edu/ [31]sympa/subscribe/htrc-usergroup-l [31].
HathiTrust is pleased to announce the hiring of Valerie Glenn to the Government Documents Registry Analyst position [32]. Valerie has served as a Federal Depository Librarian at both the University of Alabama and the University of North Texas, and has managed a variety of projects and activities related to government documents. Valerie brings deep expertise to a two-year initiative to begin to construct a comprehensive registry of U.S. federal government documents. This work is part of a larger HathiTrust effort to expand access to US government documents [33]. More information about the project is available at http://www.hathitrust.org/ [34]usgovdocs_registry [34].
HathiTrust distributed a survey created by Syracuse University to gather information about institutions’ experiences with HathiTrust. The survey includes questions about print disabilities services, special collections, digital humanities, use of HathiTrust, and technical implementation issues. The survey is available at http://www.surveymonkey.com/s/ [35]9ZZ9KMW [35] until April 26. We encourage all partner institutions to participate. Results will be summarized and made available.
During a March meeting, the Board of Governors reviewed the HathiTrust budget and planned a longer agenda for an in-person meeting in April.
HathiTrust prepared a survey to send to institutions that have indicated they intend to deposit locally-digitized materials. The purpose of the survey is to gauge interest in, and aid in determining a development timeline for, enhanced tools to assist in validating and packaging materials prior to submission to HathiTrust. The survey will be sent out in mid April. HathiTrust also provided support to several institutions making preparations to deposit locally-digitized content.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The User Experience Advisory Group was pleased to welcome a new member: Matt Morgan, Director of the Website, NYPL Office of Strategic Planning. The group continued to review elements of HathiTrust Web applications identified through user feedback and other means as being in need of improvement.
A summary of issues received by the User Support Working Group is given in the table at the end of the update.
Staff from the California Digital Library (CDL) and the University of Michigan discussed implications of a new requirement that automated bibliographic rights determinations must occur at the University of Michigan rather than at the University of California. The teams expect to have revised requirements finalized in April. CDL staff are determining the impact that the change will have on the development timeline.
A summary of the determinations from HathiTrust copyright review activities in March is given below. See CRMS-US [13] and CRMS-World [14] for information.
|
|
March |
Overall | ||
|
Public Domain Determinations |
All Determinations |
Public Domain Determinations |
All Determinations |
|
|
CRMS-US |
3,376 | 7,267 | 127,958 | 236,977 |
|
CRMS-World |
3,082 | 5,590 | 21,289 | 39,212 |
|
Total |
6,458 | 12,855 | 149,247 | 276,189 |
Staff at the University of Michigan discussed modifications that are planned to be made to the Collection Builder application in order to use it as a means to navigate from articles in a single journal to the journal’s “aboutware” (information about editorial boards, submission policies, etc.). Staff also discussed issues of discovering journal aboutware through the HathiTrust catalog, full-text search and Collection Builder interfaces, and user pathways for navigating between journal-level catalog records, article-level catalog records, and aboutware. More information about mPach is available at http://www.hathitrust.org/mpach [37].
HathiTrust institutions performed the following work related to applications and Web interfaces:
Staff corrected issues in the display of authors and titles, added an option to remove collection items to a batch Collection Builder tool, and discussed ways of supporting very large collections. Staff also worked on the development of new features to be implemented as part of the Website Redesign (see below).
Staff planned a new back-end strategy for recording content digitization sources and associated access parameters, which are expressed in HathiTrust interfaces.
Staff continued research to improve relevance ranking.
Staff re-engineered a tool for testing and debugging volume access controls.
Staff continued work to implement a redesign of HathiTrust Web interfaces, using a unified framework for application code. Release of the new design is expected in April. Other improvements to be made in conjunction with the redesign include:
Screenshots of some of the redesigned pages are given at the end of the update.
HathiTrust began to install new and replacement storage hardware at the Michigan repository instance as part of its regular purchase and replacement cycle. Installation of new storage and retirement of storage to be replaced will continue in April.
HathiTrust purchased and received new production web servers and new development web and index servers to replace servers scheduled to be retired. The new development servers will make use of virtualization to improve resource utilization and availability, and to reduce acquisition and operational costs. In concert with this upgrade, which is planned for the second quarter of 2013, the Linux distribution in use for the entire server infrastructure is being changed from Red Hat to Debian, to provide better and more manageable infrastructure for deploying Ruby-based applications.
No outages were reported in March.
As of April 1:
| March | Overall | |
| Boston College | 337 | 2,179 |
| Columbia University | 0 | 65,033 |
| Cornell University | 1,563 | 418,525 |
| Duke University | 0 | 4,523 |
| Harvard University | 14 | 236,041 |
| Indiana University | 10 | 195,212 |
| Library of Congress | 0 | 89,723 |
| North Carolina State University | 0 | 3,196 |
| Northwestern University | 2,446 | 15,394 |
| New York Public Library | 8 | 259,680 |
| Penn State University | 10 | 45,425 |
| Princeton University | 1 | 251,701 |
| Purdue University | 10 | 44,692 |
| Universidad Complutense | 0 | 111,982 |
| University of California | 1,152 | 3,387,448 |
| The University of Chicago | 368 | 28,908 |
| University of Florida | 0 | 2,068 |
| University of Illinois | 5 | 109,311 |
| University of Michigan | 5,134 | 4,634,958 |
| University of Minnesota | 435 | 104,685 |
| University of North Carolina, Chapel Hill | 0 | 16,588 |
| University of Wisconsin | 752 | 555,707 |
| University of Virginia | 10 | 50,815 |
| Utah State | 0 | 117 |
| Yale University | 0 | 23,678 |
| Total | 12,255 | 10,657,589 |
Public Domain (~31%)
| Total* | 12,076 | 3,321,707 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | March | February |
| Content | 382 | 430 |
|
Quality |
373 | 421 |
|
Non-partner Digital Deposit |
1 | 1 |
|
Collections |
8 | 6 |
| Cataloging | 87 | 82 |
| Access and Use | 149 | 96 |
|
Copyright |
77 | 51 |
|
Permissions |
16 | 5 |
|
Takedown |
0 | 1 |
|
Print on Demand |
1 | 0 |
|
Inter-library loan |
4 | 0 |
|
Full-PDF or e-copy requests |
32 | 16 |
|
Datasets |
5 | 7 |
|
Data Availability and APIs |
2 | 3 |
|
Reuse of content |
3 | 3 |
| Web applications | 11 | 15 |
|
Functionality problems |
4 | 3 |
|
Problems with login specifically |
1 | 0 |
|
General Questions about Login |
2 | 3 |
|
Partners setting up login |
2 | 2 |
|
Usability issues |
1 | 0 |
|
Feature requests |
0 | 4 |
| Partner Ingest | 13 | 1 |
| General | 87 | 74 |
|
Partnership |
9 | 12 |
|
Infrastructure |
0 | 0 |
|
Miscellaneous |
78 | 62 |
| Total | 729 | 698 |
| Title | #Visits* |
|
The United States Strategic Bombing Survey: over-all report (European war) [38] |
5,006 |
| 3,545 | |
| 2,619 | |
|
Perfume and flavor materials of natural origin, by Steffen Arctander [39] |
2,044 |
|
1,680 |
|
|
Bradshaw's handbook for tourists in Great Britain & Ireland, sec.1 1866. [41] |
1,112 |
| 1,015 | |
| 995 | |
|
Noblesa catalana : cavallers y burgesos honrats de Rossello y Cerdanya, v.2., by Philippe Lazerme. [44] |
727 |
|
Coffee processing technology, v.1., by Michael Sivetz and H. Elliott Foote. [45] |
550 |
* Approximate due to a system configuration change.

[Download PDF] [50]
HathiTrust institutions voted unanimously to accept bylaws [28] put forward by the Board of Governors. The Board will have its first meeting following the acceptance of the bylaws in April.
The HathiTrust Research Center (HTRC) is planning a release of its software infrastructure on March 31, 2013. The release is of two separate cyberinfrastructure stacks, the “HTRC sandbox” and “HTRC production stack”, each loaded with the latest and greatest service components and tools. The “HTRC sandbox” is an open test bed for community experimentation. It contains a subset of public domain volumes. The “HTRC production stack” hosts the full HathiTrust public domain corpus. Both the sandbox and production stack offer the suite of services and tools, with bug fixes and other improvements, that debuted during the September 2012 UnCamp. Looking ahead, a June 2013 release of the HTRC production stack will include the HTRC-Sloan-Cloud in early support of “non-consumptive research”. Non-consumptive research is research in which the content of digital works is read or operated on (“consumed”) in an automated way only; no portions of works are displayed or available to be read by researchers themselves during computation or in computational results.
We welcome anyone to join HTRC email lists for general announcements [51], technical discussion [31] about the HTRC, or announcements specific to the 2012 UnCamp [52] (see http://www.hathitrust.org/htrc [53] for details on each list).
HathiTrust provided support to several institutions on use of the HathiTrust ingest tools [54] and issues related to image conversion and metadata.
HathiTrust ingested new content from Columbia University, the Library of Congress, Penn State, the University of California, the University of Illinois, and the University of North Carolina-Chapel Hill. HathiTrust also loaded bibliographic records from Boston College in preparation for content deposit.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The User Experience Advisory Group continued to review elements of HathiTrust Web applications identified through user feedback and other means as being in need of improvement.
A summary of issues received by the User Support Working Group is given in the table at the end of the update.
Staff from the California Digital Library (CDL) and the University of Michigan continued to plan for the transition to Zephir for HathiTrust bibliographic data management. CDL staff began loading records into the Zephir production database.
A summary of the determinations from HathiTrust copyright review activities in February is given below.
|
|
February | Overall | ||
|
Public Domain Determinations |
All Determinations |
Public Domain Determinations |
All Determinations |
|
|
CRMS-US |
3,438 | 7,169 | 125,185 | 231,165 |
|
CRMS-World |
2,067 | 4,048 | 19,087 | 35,756 |
|
Total |
5,505 | 11,217 | 144,272 | 266,921 |
Staff at the University of Michigan made architectural and procedural decisions related to the creation, storage, and revision of “aboutware” (information about editorial boards, submission policies, etc.) for journals deposited via mPach. Staff finalized journal article ingest procedures, and continued discussions about optimal handling of non-textual content that is embedded in articles or submitted as supplementary material. Staff also reviewed and updated the project timeline [37].
HathiTrust received a revised estimated shipping date for high-performance storage that will be used to improve full-text search services. The systems are expected to arrive in March for installation and testing.
Michigan staff made changes to optimize synchronization of the full-text search index from the HathiTrust repository instance in Michigan to its mirror site in Indiana. At the same time, staff performed a full update of the index to be sure that all volumes in the repository were indexed appropriately. Staff developed a process to automatically update the full-text index when print holdings information submitted by partner institutions is updated or added. Staff also continued work to improve relevance ranking of search results.
HathiTrust updated the way URL parameters are sent to Google Analytics in order to improve usage reporting. The change will especially affect full-text searches performed within individual volumes.
HathiTrust finalized and deployed processes for producing PDFs that are optimized for printing on Expresso Books Machines.
HathiTrust began processes to replace production web servers at the Indiana data center, and development index and web servers at the Michigan data center. The server upgrades are planned to begin in late March or April.
Staff developing HathiTrust Web applications continued to implement a unified framework for application code. This work is being done as part of a larger effort to redesign the look and feel of HathiTrust’s Web interfaces. The redesign is scheduled for release in April.
HathiTrust was unavailable for some users from Tuesday, February 26 at 11:45pm to Monday, February 27 at 5:45am due to a network problem affecting the entire University of Michigan campus.
As of March 1:
| February | Overall | |
| Boston College | 0 | 1,842 |
| Columbia University | 643 | 65,033 |
| Cornell University | 305 | 416,962 |
| Duke University | 0 | 4,523 |
| Harvard University | 39 | 236,027 |
| Indiana University | 46 | 195,202 |
| Library of Congress | 1 | 89,723 |
| North Carolina State University | 0 | 3,196 |
| Northwestern University | 0 | 12,948 |
| New York Public Library | 94 | 259,672 |
| Penn State University | 653 | 45,415 |
| Princeton University | 46 | 251,700 |
| Purdue University | 45 | 44,682 |
| Universidad Complutense | 54 | 111,982 |
| University of California | 1,099 | 3,386,296 |
| The University of Chicago | 7 | 28,540 |
| University of Florida | 0 | 2,068 |
| University of Illinois | 4,201 | 109,306 |
| University of Michigan | 11,126 | 4,629,824 |
| University of Minnesota | 5 | 104,250 |
| University of North Carolina, Chapel Hill | 755 | 16,588 |
| University of Wisconsin | 3,924 | 554,955 |
| University of Virginia | 6 | 50,805 |
| Utah State | 0 | 117 |
| Yale University | 0 | 23,678 |
| Total | 23,049 | 10,645,334 |
Public Domain (~31%)
| Total* | 11,723 | 3,308,664 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | February | January |
| Content | 430 | 428 |
|
Quality |
421 | 414 |
|
Non-partner Digital Deposit |
1 | 0 |
|
Collections |
6 | 10 |
| Cataloging | 82 | 99 |
| Access and Use | 96 | 148 |
|
Copyright |
51 | 85 |
|
Permissions |
5 | 17 |
|
Takedown |
1 | 0 |
|
Print on Demand |
0 | 1 |
|
Inter-library loan |
0 | 0 |
|
Full-PDF or e-copy requests |
16 | 23 |
|
Datasets |
7 | 10 |
|
Data Availability and APIs |
3 | 0 |
|
Reuse of content |
3 | 4 |
| Web applications | 15 | 27 |
|
Functionality problems |
3 | 2 |
|
Problems with login specifically |
0 | 0 |
|
General Questions about Login |
3 | 4 |
|
Partners setting up login |
2 | 4 |
|
Usability issues |
0 | 1 |
|
Feature requests |
4 | 3 |
| Partner Ingest | 1 | 4 |
| General | 74 | 55 |
|
Partnership |
12 | 20 |
|
Infrastructure |
0 | 0 |
|
Miscellaneous |
62 | 35 |
| Total | 698 | 761 |
[Download PDF] [64]
HathiTrust institutions approved bylaws put forward by the Board of Governors for voting in January. The bylaws are available at http://www.hathitrust.org/documents/hathitrust-bylaws-201302.pdf [28].
HathiTrust hosts a page of resources [65] including handouts and informational sheets created by the Communications Working Group, and links to information about HathiTrust that institutions have posted for their own constituencies. If you have posted resources about HathiTrust (including videos, library guides, etc.) that are not listed, please let us know (feedback@issues.hathitrust.org [66]), and feel free to use and share those that are available.
HathiTrust hosted a conference call with members of several partner institutions that have been working with HathiTrust’s ingest tools [54], to discuss development options for the next iteration of the tools. A summary of the meeting, including next steps, is posted in a Google Group forum on HathiTrust Ingest [67]. Individuals interested in issues surrounding ingest of locally-digitized content into HathiTrust are welcome to join. HathiTrust continued to respond to inquiries about the ingest tools and ingest of locally-digitized materials.
HathiTrust began ingest of new batches of volumes from the University of Florida, the University of Illinois, and the University of North Carolina at Chapel Hill. HathiTrust also loaded bibliographic records for new volumes from Columbia University and Penn State.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The User Experience Advisory Group began to review elements of HathiTrust Web applications that were previously identified as in need of improvement and to review feature requests that have been submitted to the User Support Working Group.
A summary of issues received by the User Support Working Group is given in the table at the end of the update.
California Digital Library (CDL) continued to work with staff at the University of Michigan to test data exports from Zephir and to plan for the upcoming transition to Zephir for HathiTrust bibliographic metadata management. CDL staff made modifications to support rights determinations on bibliographic records that conform to the Resource Description and Access (RDA) standard. CDL is in the process of implementing a new backup strategy for Zephir to ensure the service will be accessible in the event of an outage. An updated timeline for the project is posted at http://www.hathitrust.org/htmms [12].
A summary of the determinations from HathiTrust copyright review activities in January is given below.
| January | Overall | ||
Public Domain Determinations | All Determinations | Public Domain Determinations | All Determinations | |
CRMS-US | 2,433 | 5,028 | 118,442 | 216,831 |
CRMS-World | 2,198 | 3,689 | 14,202 | 24,710 |
Total | 4,631 | 8,717 | 132,644 | 241,541 |
Staff at the University of Michigan met to discuss possible changes to the HathiTrust PageTurner, item-level search, and item-level search results to accommodate materials submitted via mPach. Staff began working with sample analytic MARC records and METS objects to determine the scope of anticipated changes. Staff also reviewed the mPach ingest process and discussed workflows for processing issue- and journal-level metadata.
Staff at Michigan continued to work on relevance ranking for full-text search results. Experiments performed in January found that some very short documents were being ranked too highly by Solr’s default ranking algorithm. Preliminary tests indicated that the issue may be resolved by using Solr’s new BM25, DFR and Information Based ranking settings. Experiments will continue in February.
Michigan staff received official relevance judgments for full-text search data submitted to the 2012 INEX Prove-IT Book Track [68] (see also the paper "Practical Relevance Ranking for 10 Million Books [69]"). After evaluating the results, staff submitted new data using updated relevance judgments.
Staff at CDL made significant progress on a spelling suggestion feature for full-text search. Staff improved the relevance of results through the creation of a new dictionary of suggestions that uses a bigram index for an entire shard’s worth of documents (about 800,000 volumes). The HathiTrust full-text index is broken into “shards” in Solr’s search architecture. Staff also tuned the algorithm that provides suggestions and made significant changes to suggestion scoring parameters, significantly increasing the quality of results. CDL will now look toward deployment of the new feature.
Michigan staff made adjustments to the full-text search indexing process to better support experimental indexing runs and to synchronize data between the full-text search index and information in the HathiTrust print holdings [70] database.
HathiTrust completed the purchase of high-performance storage for full-text search. The new systems are expected to be received in late February for installation and testing.
HathiTrust began to implement application-level changes to support access to materials by designated representatives at partner institutions on behalf of users at those institutions who have print disabilities. Designated representatives will need to register and to access HathiTrust using their Shibboleth login from a fixed IP address. Further details on the service will be forthcoming.
HathiTrust completed stylistic changes to messages in mobile PageTurner that appear when special access to materials is granted (e.g., access to volumes that fall under Section 108 conditions or to users who have print disabilities).
HathiTrust completed projections for new and replacement storage needed for HathiTrust in 2013.
Staff at Michigan drafted implementation guidelines for a unified Web application framework for HathiTrust. The framework will simplify the execution of a redesign of the HathiTrust website, which is expected to be completed in April.
HathiTrust was unavailable from 5:00-5:40pm EST on Monday, January 21 due to an error in a software release.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org. [71]
As of February 1:
| January | Overall | |
| Boston College | 0 | 1,842 |
| Columbia University | 0 | 64,390 |
| Cornell University | 1,222 | 416,657 |
| Duke University | 0 | 4,523 |
| Harvard University | 3 | 235,988 |
| Indiana University | 83 | 195,156 |
| Library of Congress | 0 | 89,722 |
| North Carolina State University | 0 | 3,196 |
| Northwestern University | 226 | 12,948 |
| New York Public Library | 4 | 259,578 |
| Penn State University | 30 | 44,762 |
| Princeton University | 3 | 251,654 |
| Purdue University | 8 | 44,637 |
| Universidad Complutense | 27 | 111,928 |
| University of California | 1,942 | 3,385,197 |
| The University of Chicago | 1,813 | 28,533 |
| University of Florida | 60 | 2,068 |
| University of Illinois | 218 | 105,105 |
| University of Michigan | 8,862 | 4,618,698 |
| University of Minnesota | 33 | 104,245 |
| University of North Carolina, Chapel Hill | 7,745 | 15,833 |
| University of Wisconsin | 651 | 551,031 |
| University of Virginia | 0 | 50,799 |
| Utah State | 0 | 117 |
| Yale University | 0 | 23,678 |
| Total | 22,930 | 10,622,285 |
Public Domain (~31%)
| Total* | 18,311 | 3,296,941 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | January | December |
| Content | 428 | 274 |
Quality | 414 | 268 |
Non-partner Digital Deposit | 0 | 3 |
Collections | 10 | 6 |
| Cataloging | 99 | 52 |
| Access and Use | 148 | 95 |
Copyright | 85 | 59 |
Permissions | 17 | 9 |
Takedown | 0 | 0 |
Print on Demand | 1 | 0 |
Inter-library loan | 0 | 0 |
Full-PDF or e-copy requests | 23 | 11 |
Datasets | 10 | 5 |
Data Availability and APIs | 0 | 0 |
Reuse of content | 4 | 2 |
| Web applications | 27 | 16 |
Functionality problems | 2 | 5 |
Problems with login specifically | 0 | 2 |
General Questions about Login | 4 | 1 |
Partners setting up login | 4 | 3 |
Usability issues | 1 | 1 |
Feature requests | 3 | 0 |
| Partner Ingest | 4 | 1 |
| General | 55 | 48 |
Partnership | 20 | 10 |
Infrastructure | 0 | 0 |
Miscellaneous | 35 | 38 |
| Total | 761 | 486 |
[Download PDF [78]]
2012 brought to a close the initial 5-year charter period that HathiTrust was granted by its founding institutions. 5 years later, the collaborative is stronger than ever. More than 70 academic and research institutions from around the world participate in HathiTrust, supporting a digital repository of 10.6 million volumes and a host of shared activities, all geared toward the provision of greater access to the scholarly and cultural record, more secure preservation, and greater research opportunities for our constituencies than we have ever had before. As we launch into a new year, and a new stage of HathiTrust, it is worthwhile to reflect on our progress and achievements in 2012. These include:
A recap of activities in these areas and more can be read below.
About HathiTrust
HathiTrust is an international partnership of academic and research institutions dedicated to ensuring the preservation and accessibility of the vast record of human knowledge. The partnership owns and operates a digital repository containing millions of public domain and in copyright volumes, digitized from partnering institution libraries and other sources. The preserved volumes are made available in accordance with copyright law as a shared scholarly resource for students, faculty, and researchers at the partnering institutions, and as a public good to the world community. For more information, visit HathiTrust.org [79].
Details on each item can be found in the monthly updates from 2012, available at http://www.hathitrust.org/updates [80].
In a decisive victory for libraries and Fair Use, a lawsuit brought against HathiTrust and several participating libraries by the Authors Guild et al. was dismissed. Information [81] about the lawsuit, including responses and analysis from around the Web, can be found on the HathiTrust website.
HathiTrust grew from 66 to 78 partner institutions in 2012. New institutions include:
HathiTrust partners contributed 623,613 volumes to the repository in 2012. 566,044 of these are in the public domain. The University of Florida and Boston College were new contributors in 2012. Many others contributed additional content, as shown in the table near the end of the update.
Over the course of 2012, HathiTrust interacted with nearly a dozen institutions regarding ingest of locally-digitized content. We released a first iteration of ingest tools to aid institutions in validating and packaging locally-digitized content prior to submission to HathiTrust. We revised documentation [54] surrounding the tools based on feedback from institutions, and we also began to explore with institutions what the next iteration of the tools would look like. If you are using the tools now, think you might in the future, or are interested in more information, we encourage you to join our HathiTrust Ingest Google Group [67] to participate in discussions.
HathiTrust took bold steps in establishing a new governance model, seating a new Board of Governors [82], establishing an Executive Committee and Executive Committee officers, and drafting a set of bylaws. The bylaws will be put forward to the partnership for voting in early 2013.
The Collections Committee completed a report [83] on handling of duplicate volumes in HathiTrust, recommending that HathiTrust retain all duplicate copies for the time being, with periodic assessment.
The Communications Working Group released announcements related to HathiTrust’s achievement of 10 million volumes [84], the new Board of Governors [85], and the Authors Guild lawsuit [86]. The group also produced a new Resources [65] page for HathiTrust, launched a Pinterest [87] account, coordinated a survey of partners to receive input on the next iteration of partner training sessions, and, in collaboration with the UX Advisory Group, created a blog post on collections in HathiTrust [88].
The User Experience Advisory Group consulted on improvements to the HathiTrust PageTurner, including the addition of a version date for volumes, updated messages regarding download of PDFs, and a new landing page for volumes that are restricted from reading due to copyright, but are nevertheless full-text searchable. The UX Advisory Group also provided feedback on a new site-wide redesign currently in development.
The User Support Working Group submitted recommendations to the Board of Governors on User Support going forward from 2012. A summary of the User Support issues received in 2012 is given at the end of the review.
HathiTrust completed the first phase of improvements to enhance the accessibility of HathiTrust Web applications. With a few minor exceptions that will be addressed in the second phase, HathiTrust interfaces are now compliant with Web Content Accessibility Guidelines (WCAG) 2.0 [89], Level A.
HathiTrust accepted and released a new policy on bibliographic corrections: http://www.hathitrust.org/bib_metadata_correction [90].
HathiTrust initiated a project [34] to build a comprehensive registry of U.S. federal government documents.
HathiTrust began offering lawful access to digital copies of works that are out of print, when print copies owned by partner institutions are brittle or missing. More information is available at http://www.hathitrust.org/out-of-print-brittle [91].
HathiTrust made progress toward the migration of bibliographic data management from the University of Michigan to the California Digital Library’s Zephir system. Major activities in 2012 involved improving record loading processes in Zephir, syncing information between Zephir and other HathiTrust systems, exporting data from Zephir for use in the HathiTrust catalog and “hathifiles [4]”, development of new bibliographic metadata standards [92], development and testing of bibliographic record submission processes with current HathiTrust depositors, and progress toward a Zephir service-level agreement. Migration to Zephir is expected to occur in 2013.
In the early part of the year, the HTRC completed the agreements necessary to receive public domain data from the HathiTrust Repository. It also began to install systems for discovering, retrieving, correcting, and performing computation on OCR text of digital volumes. HTRC systems had their first public demonstration at an enthusiastic and widely successful HTRC “UnCamp” in September, attended by 130 researchers, developers, and librarians from HathiTrust member and non-member institutions. Resources from the UnCamp, including presentations, session materials, twitter analysis, and pictures, are available on the HTRC wiki [93]. A video produced from the event is available at http://www.hathitrust.org/ htrc [53].
The grant project team concluded all data gathering activities, including digital review of four 1,000-volume samples of volumes from HathiTrust, physical review of nearly all volumes in one sample and more than half of the volumes in a second sample (to investigate correlation between physical condition and digitization quality). Two of the samples underwent review more than once, as a new methodology was introduced to discover “whole-volume” errors such as missing and duplicate pages. In the coming months, as part of a no-cost extension, members of the team will conduct user studies to evaluate the results of the quality review performed the sampled volumes. Initial findings from studies undertaken in the grant can be found at the links below. More results will be posted on the project website [94] as analysis concludes and as articles containing the results are published throughout the coming months.
mPach is a system under development by the University of Michigan Library to publish open access born-digital journal content, along with accompany data and media files, directly into HathiTrust for perpetual access and preservation. Work in 2012 focused on refining the project’s design principles and requirements [99] and system architecture [100], establishing a timeline [37] for the project, and designing and developing mPach modules [101] and associated workflows to a) create archival XML in JATS format from DOCX files and b) deliver the resulting XML and supplementary files through HathiTrust applications.
Development in 2012 included the following:
All HathiTrust papers and presentation can be accessed at http://www.hathitrust.org/papers [110].
Copyright determinations conducted as part of CRMS-US [13] and CRMS-World [14].
| December | Overall | ||
Public Domain Determinations | All Determinations | Public Domain Determinations | All Determinations | |
CRMS-US | 41,268 | 79,817 | 119,822 | 219,874 |
CRMS-World | 13,445 | 23,519 | 14,202 | 28,795 |
Total | 54,713 | 103,336 | 135,777 | 248,669 |
As of January 1, 2013:
| December | Overall | |
| Boston College | 1,842 | 1,842 |
| Columbia University | 214 | 64,390 |
| Cornell University | 31,745 | 415,435 |
| Duke University | 1 | 4,523 |
| Harvard University | 182,545 | 235,985 |
| Indiana University | 8,161 | 195,073 |
| Library of Congress | 311 | 89,722 |
| North Carolina State University | 0 | 3,196 |
| Northwestern University | 7,073 | 12,722 |
| New York Public Library | 121 | 259,574 |
| Penn State University | 1,815 | 44,732 |
| Princeton University | 1,972 | 251,651 |
| Purdue University | 43,741 | 44,629 |
| Universidad Complutense | 3,233 | 111,901 |
| University of California | 95,601 | 3,383,255 |
| The University of Chicago | 16,112 | 26,720 |
| University of Florida | 2,008 | 2,008 |
| University of Illinois | 90,384 | 104,887 |
| University of Michigan | 105,235 | 4,609,836 |
| University of Minnesota | 13,973 | 104,212 |
| University of North Carolina, Chapel Hill | 1 | 8,088 |
| University of Wisconsin | 23,046 | 550,380 |
| University of Virginia | 3,403 | 50,799 |
| Utah State | 71 | 117 |
| Yale University | 4 | 23,678 |
| Total | 632,613 | 10,599,355 |
Public Domain (~31%)
| Total* | 566,044 | 3,278,630 |
| Issue Type | 2012 (Jan-Dec) | 2011 (Mar-Dec) |
| Content | 1,038 | 962 |
Quality | 971 | 905 |
Non-partner Digital Deposit | 10 | 6 |
Collections | 57 | 45 |
| Cataloging | 807 | 238 |
| Access and Use | 969 | 898 |
Copyright | 811 | 500 |
Permissions | 158 | 151 |
Takedown | 11 | 11 |
Print on Demand | 8 | 12 |
Inter-library loan | 24 | 12 |
Full-PDF or e-copy requests | 198 | 175 |
Datasets | 38 | 25 |
Data Availability and APIs | 9 | 14 |
Reuse of content | 25 | 27 |
| Web applications | 220 | 229 |
Functionality problems | 61 | 66 |
Problems with login specifically | 9 | 19 |
General Questions about Login | 21 | 21 |
Partners setting up login | 21 | 23 |
Usability issues | 20 | 30 |
Feature requests | 24 | 37 |
| Partner Ingest | 40 | 25 |
| General | 832 | 316 |
Partnership | 126 | 83 |
Infrastructure | 4 | 4 |
Miscellaneous | 702 | 229 |
| Total | 3,830 | 2,668 |
[Download PDF] [118]
One of the lawful uses of in-copyright works HathiTrust has been pursuing is to provide access on an institutional basis to works that fall under United States Copyright Law Section 108 conditions: works in HathiTrust that are not available on the market at a fair price, and for which print copies owned by HathiTrust member institutions are damaged, deteriorating, lost or stolen. As a part of becoming a member, institutions are required to submit information about their print holdings for fee calculation purposes. We have also been requesting information about the holdings status and condition of works, to facilitate uses of works where permissible by law (specifications for HathiTrust holdings data are available at http://www.hathitrust.org/print_holdings [70]).
As of December 2012, we are using the holdings status and condition information submitted by United States member institutions, in combination with information about the market availability of works stored in the HathiTrust rights database, to determine whether or not access to applicable in-copyright works in HathiTrust is allowed. The specific terms of access are as follows:
A general scenario for how out of print determinations are made and communicated to HathiTrust is available in the HathiTrust rights database documentation: http://www.hathitrust.org/rights_database#op [119]. Additional information on the service is available at http://www.hathitrust.org/out-of-print-brittle [91].
The Board of Governors completed a draft of HathiTrust bylaws, which was distributed to partner institutions in early December for comment. The Board is working on a final version with consideration for partner comments. The final version will be put forward to partners for voting in January.
The Research Center released an informational video, following on the UnCamp that was held earlier in the fall of 2012. The video can be accessed at http://www.hathitrust.org/htrc [53].
This month we are including a new metric in our newsletter: the most accessed works in HathiTrust by pageview count. A table of volumes is included at the end of the update.
Staff at the University of Michigan met to discuss the next steps for HathiTrust’s ingest tools [54], created to aid institutions in validating and packaging locally-digitized content prior to deposit in HathiTrust. A conference call is planned in January, which will include members of several partner institutions that have been working with the existing tools, to discuss possibilities and options for the future. HathiTrust continued discussions about deposit of locally-digitized materials with the University of Illinois, and responded to questions from McGill University.
HathiTrust ingested new content from Penn State University and loaded records for content from the University of Florida and University of North Carolina-Chapel Hill. Ingest of volumes from Florida and UNC, and additional volumes from Penn State, is expected to occur in January.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
A summary of issues received by the User Support Working Group is given in the table at the end of the update.
California Digital Library (CDL) continued to work with staff at the University of Michigan on preliminary testing of data exports from Zephir, the new HathiTrust bibliographic management system under development by CDL. CDL and Michigan staff continued to plan for the upcoming period when Zephir and the bibliographic management system at Michigan will be run in parallel, prior to the full transition to Zephir.
A summary of the determinations from HathiTrust copyright review activities in December is given below. The numbers this month reflect a different methodology for aggregating statistics. In previous months, the number of Reviews was given, and the number of volumes reviewed that were Opened. In the majority of cases, volumes are reviewed more than once (by more than one person). This meant that the number of Reviews reported was larger than the number of actual volumes reviewed. Similarly, the number of volumes Opened represented volumes that may have been determined in more than one review to be in the public domain. The table below provides a more accurate representation of the number of volumes where a determination was made, and what the determination was. We will use this representation going forward.
|
|
December | Overall | ||
|
Public Domain Determinations |
All Determinations |
Public Domain Determinations |
All Determinations |
|
|
CRMS-US |
2,433 |
5,028 | 118,442 | 216,831 |
|
CRMS-World |
2,198 | 3,689 | 14,202 | 24,710 |
|
Total |
4,631 | 8,717 | 132,644 | 241,541 |
The project team will present a research poster at ALA Midwinter in Seattle, during the Preservation Administrators Interest Group Meeting on Saturday, January 26. The poster will focus on digitization error related to material characteristics of a book. The project team continues to focus on more complex analyses of the data collected in the past year and also on presentation of the findings. Additional findings and results will be posted on the project website later this month: http://hathitrust-quality.projects.si.umich.edu [120].
Staff at the University of Michigan revised the list of modules [101] for mPach, to reflect recent changes in the planned system architecture. An extensive conceptual workflow for ingest of an mPach Submission Information Package into HathiTrust has been devised and will be finalized soon. Michigan staff finalized plans for modifications to the HathiTrust Data API to support the retrieval via the API of JATS XML, derivative formats, and supplemental materials that may be associated with a JATS XML article.
Staff at the University of Michigan released a bug fix for the Solr edismax query parser and a new index into production in late December (See the Update on November Activities [121] for details.). These changes will significantly improve the precision of CJK (Chinese, Japanese, and Korean) search results.
Michigan staff began preliminary analysis of HathiTrust document length statistics. The results of the analysis will aid in designing tests of length normalization features for the new relevance ranking algorithms available in Solr 4.0 [122]. Staff built a test index using the new relevance ranking algorithms available in Solr 4.0 (DFR, BM25. IB). Experiments using the test index will begin in January.
Staff at Michigan made a final selection of high-performance storage for full-text search and completed pricing negotiations (see the Update on November Activities [123] for background). Purchase of the storage is expected to be complete in January, with installation and testing to follow soon after in late January or early February.
Michigan staff completed the removal of sensitive information from source-controlled HathiTrust application code to designated system-level locations. Staff also completed the separation of privileges for accessing application databases. Different classes of applications now connect as different database users with different privileges.
Michigan staff began to implement improvements to the display of special access messages (e.g., for works that are out of print and brittle) in the mobile version of PageTurner.
The PageTurner scroll view now advances by full pages when the navigation controls are used (e.g., next page button), rather than advancing by half of a page at a time.
The HathiTrust feedback form now detects content and metadata-related feedback submissions by CRMS (Copyright Review Management System) reviewers, pre-filling problem tickets with CRMS-specific information to simplify the management of support requests.
No outages were reported in December.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org. [71]
As of January 1:
| December | Overall | |
| Boston College | 26 | 1,842 |
| Columbia University | 0 | 64,390 |
| Cornell University | 72 | 415,435 |
| Duke University | 0 | 4,523 |
| Harvard University | 0 | 235,985 |
| Indiana University | 177 | 195,073 |
| Library of Congress | 0 | 89,722 |
| North Carolina State University | 0 | 3,196 |
| Northwestern University | 15 | 12,722 |
| New York Public Library | 0 | 259,574 |
| Penn State University | 207 | 44,732 |
| Princeton University | 1 | 251,651 |
| Purdue University | 104 | 44,629 |
| Universidad Complutense | 0 | 111,901 |
| University of California | 1,196 | 3,383,255 |
| The University of Chicago | 57 | 26,720 |
| University of Florida | 974 | 2,008 |
| University of Illinois | 843 | 104,887 |
| University of Michigan | 7,258 | 4,609,836 |
| University of Minnesota | 373 | 104,212 |
| University of North Carolina, Chapel Hill | 0 | 8,088 |
| University of Wisconsin | 106 | 550,380 |
| University of Virginia | 0 | 50,799 |
| Utah State | 0 | 117 |
| Yale University | 0 | 23,678 |
| Total | 11,409 | 10,599,355 |
Public Domain (~31%)
| Total* | 9,401 | 3,278,630 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | December | November |
| Content | 274 | 304 |
|
Quality |
268 | 298 |
|
Non-partner Digital Deposit |
3 | 0 |
|
Collections |
6 | 4 |
| Cataloging | 52 | 86 |
| Access and Use | 95 | 95 |
|
Copyright |
59 | 43 |
|
Permissions |
9 | 4 |
|
Takedown |
0 | 0 |
|
Print on Demand |
0 | 0 |
|
Inter-library loan |
0 | 0 |
|
Full-PDF or e-copy requests |
11 | 15 |
|
Datasets |
5 | 4 |
|
Data Availability and APIs |
0 | 1 |
|
Reuse of content |
2 | 2 |
| Web applications | 16 | 13 |
|
Functionality problems |
5 | 4 |
|
Problems with login specifically |
2 | 0 |
|
General Questions about Login |
1 | 2 |
|
Partners setting up login |
3 | 0 |
|
Usability issues |
1 | 0 |
|
Feature requests |
0 | 3 |
| Partner Ingest | 1 | 3 |
| General | 48 | 141 |
|
Partnership |
10 | 18 |
|
Infrastructure |
0 | 0 |
|
Miscellaneous |
38 | 123 |
| Total | 486 | 642 |
See http://www.hathitrust.org/papers [110] for all papers, presentations, and reports.
[Download PDF] [136]
HathiTrust has received a number of inquiries recently about corrections to bibliographic data. HathiTrust’s general policy on bibliographic data correction is available at http://www.hathitrust.org/bib_metadata_correction [90]. We consider the definitive records for volumes in HathiTrust (which are generally volumes digitized from print originals) to be those held by the depositing institutions. When institutions submit corrections to print records in HathiTrust, these corrections are not automatically propagated to WorldCat. Institutions must update the print records in WorldCat separately.
OCLC creates records in WorldCat for electronic versions of works as they become available in HathiTrust (OCLC uses the hathifiles [4]to identify when new volumes enter the repository and then derives digital master records from the print records identified by the OCLC numbers in the hathifiles). These electronic versions are solely OCLC’s responsibility and under its control. Institutions do not need to, and should not try to update records for electronic versions. We are working with OCLC to refine the process by which records for e-versions are updated to stay in sync with HathiTrust records, and records for print versions that institutions update. We will be providing more information on this in future updates. For the present, if you notice a problem with a record in WorldCat for a HathiTrust volume, please notify us at feedback@issues.hathitrust.org [66].
HathiTrust completed changes that will incorporate the “in print” status of volumes (whether or not a volume is in print), as well as holding status and condition information provided by partners in their print holdings data [70], in volume access determinations.
Staff from Texas A&M University contacted HathiTrust to discuss deposit of locally-digitized volumes related to Texas agricultural history. HathiTrust provided ingest support to the University of Iowa, University of Illinois, and University of Utah, including elaboration of content specifications, help in running image validation tools, and assistance in diagnosing errors. The information page [54] about the tools HathiTrust provides for packaging and validating locally-digitized materials has been revised and includes a link to an updated HathiTrust Deposit Form [137], which in turn includes guidelines and specifications for deposit.
The University of North Carolina submitted a sample of bibliographic metadata in anticipation of an upcoming deposit. The University of Florida began deposit of Internet Archive-digitized volumes and anticipates depositing 26,250 items over the next several months. HathiTrust ingested two additional batches of content (totaling nearly 400 volumes) from Penn State, with two more batches to be ingested in December. The University of Illinois deposited more than 800 volumes as part of an ongoing project.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The User Experience Advisory Group was pleased to welcome a new member, Nadaleen Tempelman-Kluit, to the group. Nadaleen is an Instructional Design Librarian at New York University.
A summary of issues received by the User Support Working Group is given in the table at the end of the update.
California Digital Library (CDL) continued to work with staff at the University of Michigan to test processes for exporting bibliographic data from Zephir for use in HathiTrust services. CDL improved the speed at which data could be exported from Zephir. CDL and Michigan continued to plan for the time when the current bibliographic management system at Michigan and the new system (Zephir) will run in parallel. This will occur prior to HathiTrust moving to Zephir as the bibliographic management system for HathiTrust.
A summary of copyright review activities in November is given below.
|
|
November | Overall | ||
|
Opened |
Reviewed |
Opened |
Reviewed | |
|
CRMS-US |
4,177 |
8,404 | 178,872 | 338,463 |
|
CRMS-World |
4,933 | 8,699 | 15,181 | 30,965 |
|
Total |
9,110 | 17,103 | 194,053 | 369,428 |
The project team continued to plan user studies to evaluate and contextualize findings of the grant project. Grant principal investigator Paul Conway traveled to the University of Minnesota to launch the first user study, which will investigate thresholds for error tolerance in digitized volumes among library collections managers. Focus group meetings and other activities for this study will continue through the first quarter of 2013. The team submitted its second narrative report to IMLS, summarizing activities in the past year. The report will be posted soon on the project website [138].
Staff at the University of Michigan continued work on a mockup of changes needed to the PageTurner interface to support navigation of XML-based articles. Staff began to develop functionality to render JATS articles in PDF (for download purposes). Staff also engaged in discussions about the mPach article ingest workflow and proposed modifications to HathiTrust’s Collections feature to facilitate navigation among journal articles.
This past June [139], staff at Michigan discovered a bug [140]in the Solr edismax processer that rendered search precision improvements for CJK (Chinese, Japanese, and Korean) materials smaller than expected. In November, conversations between staff at Michigan and Stanford about issues with CJK support lead Michigan to contact to Solr/Lucene developer and committer Robert Muir for advice. Muir (unaffiliated with Michigan or Stanford), an expert on multilingual issues, wrote and committed a code patch that fixed the bug. Staff at Michigan implemented the code patch and have seen orders of magnitude improvements (as an example the query [東京スカイツリー] (Tokyo Sky Tree) produced about 450,000 hits without the patch and 16 hits after the patch). HathiTrust is very grateful for this assistance. Michigan staff made further improvements to indexing, which will be used in a full re-indexing of the full-text index in December. Staff also produced a sample bigram index, which will be used in ongoing work at California Digital Library on a spelling suggestion feature.
Staff at Michigan reviewed proposals received in response to an RFP issued in October for high-performance storage for full-text search, and are in the process of selecting the final systems to negotiate pricing. Installation and testing of the high-performance storage is tentatively scheduled for January.
HathiTrust made a number of updates to Web applications, including:
Programmers for HathiTrust Web applications convened to develop a strategy for implementing a single Cascading Style Sheet (CSS) framework across all applications. A single framework will increase interface consistency and simplify future development, including a planned redesign of the HathiTrust home page and common portions of application interfaces.
On Saturday, November 3, search within a volume was unavailable to some users from 3:00-8:30am and full-text search was unavailable to some users from 6:00-8:00am due to a temporary disk space shortage on a search server at one HathiTrust site.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of December 1:
| November | Overall | |
| Boston College | 0 | 1,816 |
| Columbia University | 204 | 64,390 |
| Cornell University | 3,209 | 415,363 |
| Duke University | 0 | 4,523 |
| Harvard University | 0 | 235,985 |
| Indiana University | 156 | 194,896 |
| Library of Congress | 0 | 89,722 |
| North Carolina State University | 0 | 3,196 |
| Northwestern University | 144 | 12,707 |
| New York Public Library | 0 | 259,574 |
| Penn State University | 390 | 44,525 |
| Princeton University | 0 | 251,650 |
| Purdue University | 70 | 44,525 |
| Universidad Complutense | 0 | 111,901 |
| University of California | 3,665 | 3,382,059 |
| The University of Chicago | 7 | 26,663 |
| University of Florida | 1,034 | 1,034 |
| University of Illinois | 3,033 | 104,044 |
| University of Michigan | 5,608 | 4,602,578 |
| University of Minnesota | 304 | 103,839 |
| University of North Carolina, Chapel Hill | 0 | 8,088 |
| University of Wisconsin | 3,472 | 550,274 |
| University of Virginia | 0 | 50,799 |
| Utah State | 0 | 117 |
| Yale University | 0 | 23,678 |
| Total | 21,296 | 10,587,946 |
Public Domain (~31%)
| Total* | 17,122 | 3,269,229 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | November | October |
| Content | 304 | 310 |
|
Quality |
298 | 297 |
|
Non-partner Digital Deposit |
0 | 1 |
|
Collections |
4 | 6 |
| Cataloging | 86 | 111 |
| Access and Use | 95 | 112 |
|
Copyright |
43 | 58 |
|
Permissions |
4 | 11 |
|
Takedown |
0 | 1 |
|
Print on Demand |
0 | 1 |
|
Inter-library loan |
0 | 4 |
|
Full-PDF or e-copy requests |
15 | 13 |
|
Datasets |
4 | 2 |
|
Data Availability and APIs |
1 | 0 |
|
Reuse of content |
2 | 0 |
| Web applications | 13 | 21 |
|
Functionality problems |
4 | 8 |
|
Problems with login specifically |
0 | 0 |
|
General Questions about Login |
2 | 0 |
|
Partners setting up login |
0 | 0 |
|
Usability issues |
0 | 1 |
|
Feature requests |
3 | 1 |
| Partner Ingest | 3 | 9 |
| General | 141 | 61 |
|
Partnership |
18 | 14 |
|
Infrastructure |
0 | 0 |
|
Miscellaneous |
123 | 47 |
| Total | 642 | 624 |
See http://www.hathitrust.org/papers [110] for all papers, presentations, and reports.
[Download PDF] [144]
The HathiTrust Board of Governors has identified officers for the Executive Committee as follows:
Chair: Brian Schottlaender
Chair-elect/Treasurer: Sarah Michalak
Past Chair: Paul Courant
Chair of the Program Steering Committee: Bob Wolven
Executive Director (ex officio): John Wilkin
More information about the Board of Governors, including the charge and full membership is available at http://www.hathitrust.org/board_of_governors [82].
The Graduate School of Library and Information Science (GSLIS) and the Illinois Informatics Institute (I3) at the University of Illinois are actively recruiting outstanding doctoral candidates interested in research assistantships with the HathiTrust Research Center (HTRC) to develop the HTRC infrastructure, create mechanisms for outreach and engagement with scholarly communities, and cross-pollinate ideas among HTRC stakeholders. View the full announcement [145] for more information.
HathiTrust altered the semantics of the “out-of-print and brittle” (“opb”) designation in the HathiTrust Rights Database to “out-of-print” (“op”) only, as outlined in last month’s update [146]. Volumes with the “op” designation began appearing in the tab-delimited Hathifiles on November 2. All “op” volumes will be updated in the Hathifiles on November 12. Rights Database documentation, including a sample scenario [147], has been updated to reflect the change.
HathiTrust answered questions from staff at the University of Missouri, University of Utah, and University of Washington about ingest of locally-digitized content, including questions about the new ingest tools [54] for packaging content prior to submission to HathiTrust.
Penn State and Columbia University provided bibliographic records for new sets of Internet Archive-digitized volumes to be ingested. Content from Columbia University is from its Medical Heritage Library. The University of North Carolina contacted HathiTrust staff to begin deposit of a second batch of Internet Archive-digitized volumes. The Getty Research Institute resumed discussions regarding deposit of its IA-digitized materials.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The User Experience Advisory Group provided feedback on a new landing page for Limited (search-only) volumes in HathiTrust and a prototype of a new PageTurner design created by University of Michigan staff.
A summary of issues received by the User Support Working Group is given in the table at the end of the update.
California Digital Library (CDL) staff worked with staff at the University of Michigan to test data exports from Zephir that will be used in HathiTrust services such as bibliographic and full-text search. The testing examined issues of performance in data transfer, as well as the structure of the exports.
CDL staff completed testing of the Zephir bibliographic record submission process with the majority of institutions that are contributing records to HathiTrust on an ongoing basis. CDL and HathiTrust staff met to discuss the process for communicating with institutions about submission of bibliographic data and content once the cutover to Zephir occurs.
A summary of copyright review activities in October is given below.
|
|
October | Overall | ||
|
Opened |
Reviewed |
Opened |
Reviewed | |
|
CRMS-US |
4,177 |
8,404 | 178,872 | 338,463 |
|
CRMS-World |
4,933 | 8,699 | 15,181 | 30,965 |
|
Total |
9,110 | 17,103 | 194,053 | 369,428 |
Members of the project team continued preparations to launch the first of two user studies related to content quality. The first study will use image review exercises and focus groups to examine thresholds of error tolerance in digital volumes for library collection managers. Staff from the University of Michigan and University of Minnesota will participate in the study.
The project team analyzed outcomes of its meeting with imaging scientist Don Williams, which took place in September, and enhanced its catalog of commonly identified illustration errors based on information from the meeting.
The team worked to finalize a data curation profile and produce final datasets of the data collected during the grant project. More information on the project is available on the project website [138].
Staff at the University of Michigan completed a prototype of the Prepper module (see a list of all modules [101]), as well as enhancements to PageTurner to display journal articles encoded in JATS XML, in time for a presentation and demo at the 2012 DLF Forum.
HathiTrust fixed a bug that prevented authentication for users who had certain character entity references (e.g., “é”) in their Shibboleth displayName attribute. HathiTrust also implemented functionality to map users from multiple authentication Identity Providers (IdPs) to a single partner institution. This functionality comes into play when multiple campuses or organizations are members under the aegis of a single institutional.
HathiTrust completed final development work associated with supporting OAuth signatures on requests to the Data API. HathiTrust also began work on version 2 of the Data API, and tested new features that will support the delivery of PDFs for print-on-demand purposes, and include improved URI syntax to better support new formats such as JATS XML for mPach.
Staff at the University of Michigan conducted a series of tests to gather technical requirements for an RFP for a new high-performance storage system to improve the response time of full-text search, increase the volume of searches the system can handle, and accommodate the extra load that new relevance ranking features would introduce. The tests resulted in specific numerical requirements that were incorporated as minimum specifications into the RFP, which was completed and released to ten suppliers in October, with proposals due back in early November. Evaluation and final pricing negotiation is expected to continue through November and December, with system installation to take place in early 2013.
Michigan staff made changes to full-text search, as well as the HathiTrust bibliographic catalog, to improve faceting on the Author field for works with multiple authors.
Staff continued research geared toward improving relevance ranking and indexing of works in Chinese, Japanese, and Korean.
Imgsrv is the web application that serves derivatives of HathiTrust’s master images to Web applications such as the PageTurner. HathiTrust made changes to the way Imgsrv constructs PDFs for download to optimize for size. When possible, the original JP2 and TIFF images stored in the repository are included in the PDF. If there is a risk that the final PDF will be over 2GB, a lower resolution derivative is extracted from JP2 images and compressed as a JP2; TIFF images are scaled down and compressed as JPEGs.
In conjunction with recommendations from the UX Advisory Group, the default view in HathiTrust was changed to “scroll” view. HathiTrust also improved processes for caching images and made modifications to the landing page for the limited (search-only) works.
Over the last several months, University of Michigan UX department staff have been working on new designs for the HathiTrust home page and application interfaces. In October, developers at Michigan began to explore options for a consolidated framework of Cascading Style Sheets (CSS) across HathiTrust applications.
No outages were reported in October.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of October 1:
| October | Overall | |
| Boston College | 0 | 1,816 |
| Columbia University | 2 | 64,184 |
| Cornell University | 3,317 | 408,837 |
| Duke University | 0 | 4,523 |
| Harvard University | 2 | 235,985 |
| Indiana University | 7,057 | 194,740 |
| Library of Congress | 0 | 89,722 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,088 |
| Northwestern University | 5,342 | 12,563 |
| New York Public Library | 3 | 259,574 |
| Penn State University | 4 | 44,135 |
| Princeton University | 6 | 251,650 |
| Purdue University | 3,989 | 44,455 |
| Universidad Complutense | 2 | 111,901 |
| University of California | 4,522 | 3,378,394 |
| The University of Chicago | 1,739 | 26,656 |
| University of Illinois | 1 | 101,011 |
| University of Michigan | 14,426 | 4,596,970 |
| University of Minnesota | 919 | 103,535 |
| University of Wisconsin | 1,014 | 546,802 |
| University of Virginia | 9 | 50,799 |
| Utah State | 0 | 117 |
| Yale University | 0 | 23,678 |
| Total | 42,354 | 10,566,650 |
Public Domain (~30%)
| Total* | 42,354 | 3,252,107 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | October | September |
| Content | 310 | 248 |
|
Quality |
297 | 242 |
|
Non-partner Digital Deposit |
1 | 0 |
|
Collections |
6 | 2 |
| Cataloging | 111 | 80 |
| Access and Use | 112 | 116 |
|
Copyright |
58 | 71 |
|
Permissions |
11 | 5 |
|
Takedown |
1 | 2 |
|
Print on Demand |
1 | 0 |
|
Inter-library loan |
4 | 4 |
|
Full-PDF or e-copy requests |
13 | 11 |
|
Datasets |
2 | 3 |
|
Data Availability and APIs |
0 | 0 |
|
Reuse of content |
0 | 1 |
| Web applications | 21 | 12 |
|
Functionality problems |
8 | 4 |
|
Problems with login specifically |
0 | 0 |
|
General Questions about Login |
0 | 0 |
|
Partners setting up login |
0 | 0 |
|
Usability issues |
1 | 0 |
|
Feature requests |
1 | 0 |
| Partner Ingest | 9 | 3 |
| General | 61 | 55 |
|
Partnership |
14 | 10 |
|
Infrastructure |
0 | 0 |
|
Miscellaneous |
17 | 45 |
| Total | 624 | 514 |
See http://www.hathitrust.org/papers [110] for all papers, presentations, and reports.
[Download PDF] [152]
In a decision that will have broad repercussions across libraries, on October 10, 2012 Judge Baer dismissed the lawsuit filed just over a year ago by the Authors Guild et al. against HathiTrust and several participating libraries. HathiTrust has released an official statement [86] on the ruling. Information [81] about the lawsuit, as well as relevant analysis and reactions [153] from around the Web are available on the HathiTrust website.
The HathiTrust Research Center held its first annual “UnCamp” in Bloomington, IN on September 10-11. 130 researchers, developers, and librarians from HathiTrust member and non-member institutions gathered in Indiana University’s new CyberInfrastructure Building for presentations, demos, and hands-on sessions with the emerging Research Center tools. These included tools both to perform research on the HathiTrust corpus and to create new or customized algorithms and processes for research. Responses to the UnCamp have been very enthusiastic, giving energy to efforts to enable computational access to the incredible body of works in HathiTrust. More information on the UnCamp, including presentations, resources, reactions and responses via tweets, and more can be found on the HathiTrust Research Center Wiki [93]. See the press release [154] also from the University of Illinois.
HathiTrust has initiated a project to build a comprehensive registry of U.S. federal government documents. The Registry is an emerging effort in a broader undertaking by HathiTrust partners to improve access to U.S. federal government documents. Further information and background on the project is available on the Registry project page [34]. A two-year term Government Documents Registry Analyst position [155] for the project was posted in September.
In the coming weeks, HathiTrust will begin making infrastructural changes to incorporate information about the holdings status and condition of volumes at partner institutions into access services. The changes will apply in particular to access on library premises to in-copyright works that fall under Section 108 provisions of the U.S. Copyright Act. One of the infrastructural changes will be altering the semantics of the “out-of-print and brittle” (“opb”) designation in HathiTrust’s rights database to “out-of-print” (“op”) only. This change will be made on November 1, 2012, and will be reflected in HathiTrust interfaces, and services such as the Hathifiles where rights information is made available.
HathiTrust coordinated with the University of Florida on upcoming deposit of volumes, and ingested a new batch of volumes from Penn State.
HathiTrust ingested a new set of volumes from Utah State University Press and began conversations with the University of Delaware about processes and requirements for deposit of locally-digitized content. HathiTrust also corresponded with the University of Iowa about use of the new tools [54] for validating and packaging locally-digitized materials for deposit. Institutions with questions about the new tools should contact feedback@issues.hathitrust.org [66].
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The Communications Working Group continued to follow developments in HathiTrust governance, and to evaluate how the communications function in HathiTrust might be improved once the new governance structure is in place. The survey [156] for HathiTrust training and information sessions has closed, and HathiTrust will use the results as a basis for upcoming informational events. If you did not have a chance to submit feedback and would like to, please email responses to the survey to feedback@issues.hathitrust.org [66].
The User Experience Advisory Group continued discussions about a new home page design and provided feedback on mockups created by the University of Michigan.
A summary of issues received by the User Support Working Group is given in the table at the end of the update.
California Digital Library (CDL) is in the final phase of development to bring Zephir into parity with the existing bibliographic management system at the University of Michigan. Once Zephir is in operation, institutions will submit bibliographic records for volumes they plan to deposit to Zephir, and Zephir will produce exports of bibliographic data that will be used in HathiTrust Web services. In October, as part of preparations for integration testing with HathiTrust systems, CDL staff will begin preliminary testing of the Zephir outputs to evaluate system performance and confirm the structure of outputs (that they have the correct metadata fields, etc.). CDL has been contacting institutions that are contributing records to HathiTrust on an ongoing-basis to test the process for submitting bibliographic records to Zephir. If your institution is not contributing content to HathiTrust currently but you would like to test the new submission process, please contact feedback@issues.hathitrust.org [66].
A summary of copyright review activities in September is given below.
| September | Overall | |||
|
Opened |
Reviewed |
Opened |
Reviewed | |
|
CRMS-US |
4,700 |
9,176 | 174,695 | 330,059 |
|
CRMS-World |
3,656 | 7,191 | 10,248 | 22,266 |
|
Total |
8,356 | 16,637 | 184,943 | 352,325 |
The project team finalized a catalog of commonly-seen illustration errors in HathiTrust volumes for a sub-study on illustrative error. Donald Williams, a renowned research imaging scientist, analyzed the errors and met with members of the project team to explain the sources of the errors and possibilities for correction.
The project team continued work on the design of user studies to evaluate project findings, collection of data to support the user studies, and administration of the user studies themselves. Team members also discussed ways that quality review interfaces developed during the grant might be modified to support the certification of individual volumes. For more information on the project, please visit the project website [138].
Staff at the University of Michigan created a mockup of PageTurner changes that will be needed to navigate the XML-based journal articles that will be submitted via mPach. Work also continued on modifications to PageTurner to display JATS XML and embedded media and on refinements to the METS specification for mPach Submission Information Packages. Staff completed wireframes and began coding the Dashboard module (see the list of mPach modules [101] for more information). Michigan staff members will present on mPach at the 2012 DLF Forum.
HathiTrust has completed the first phase of improvements to enhance accessibility of HathiTrust Web applications. With a few minor exceptions that will be addressed in the second phase, HathiTrust interfaces are now compliant with Web Content Accessibility Guidelines (WCAG) 2.0 [89], Level A. The second phase will target compliance with WCAG 2.0 Level AA and include usability testing by users who have print disabilities.
As of October 1, all requests to the Data API must be signed with an access key provided by HathiTrust. Details are available at http://www.hathitrust.org/data_api [103].
The Data API is being configured to deliver watermarked image derivatives in JPEG and PNG formats at a range of resolutions. The API currently delivers un-watermarked master images from the repository in TIFF and JP2. Enhancements to the Data API Web client were made to support image derivatives when they become available through Data API, and development-level debugging.
University of Michigan staff modified the full-text search indexing process to prevent volumes from being indexed on more than one shard (section) of the full-text Solr index. Staff also began testing full-text search using Solr 4.0 Beta. Solr 4.0 offers new ranking algorithms that may provide better relevance ranking for long documents (e.g., books). A paper [69] by Michigan developer Tom Burton-West on full-text search relevance ranking in HathiTrust was published in the INEX 2012 pre-proceedings as part of the CLEF Labs Working Notes [157].
Following several months of informal research, Michigan staff began focused investigation into high-performance storage systems to improve full-text search response time and substantially increase search throughput capacity. An RFP for a new high-performance storage system will be issued in October.
Imgsrv is the web application that serves derivatives of HathiTrust’s master images to Web applications such as the PageTurner. HathiTrust has enhanced Imgsrv to deliver HTML derivatives of born-digital content in support of mPach and JATS XML.
HathiTrust implemented interface improvements designed by Michigan’s User Experience department for cases where special access to HathiTrust materials is available, such as access by users who have print disabilities. The improvements include dismissible notifications when special access is in effect, and updated explanatory text when special access that might be expected is not available (special access cases are described in HathiTrust’s Access and Use Policies [158]). Special access is currently only available as a pilot at the University of Michigan. Extension of special access to other member institutions is still planned. More information will be forthcoming.
HathiTrust’s embeddable Pageturner is now based on the mobile Pageturner interface, which offers improved presentation and greater functionality.
HathiTrust has updated the version information displayed in the PageTurner to include the time a volume a was removed from HathiTrust. Volumes may be removed from HathiTrust at the request of the rights holder, or in cases where the volume is wholly unusable or a superior copy is available.
From 1:00pm on Tuesday, September 25 to 8:30am on Friday, September 28, some bibliographic data failed to display in HathiTrust due to an outage of the system at Michigan that manages bibliographic data for HathiTrust.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of September 1:
| September | Overall | |
| Boston College | 0 | 1,816 |
| Columbia University | 0 | 64,184 |
| Cornell University | 82 | 408,837 |
| Duke University | 0 | 4,523 |
| Harvard University | 0 | 235,983 |
| Indiana University | 0 | 187,683 |
| Library of Congress | 0 | 89,722 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,088 |
| Northwestern University | 7 | 7,221 |
| New York Public Library | 0 | 259,571 |
| Penn State University | 113 | 44,131 |
| Princeton University | 0 | 251,644 |
| Purdue University | 2,418 | 40,466 |
| Universidad Complutense | 0 | 111,899 |
| University of California | 796 | 3,373,872 |
| The University of Chicago | 238 | 24,917 |
| University of Illinois | 9 | 101,010 |
| University of Michigan | 22,241 | 4,582,544 |
| University of Minnesota | 115 | 102,616 |
| University of Wisconsin | 2,993 | 545,788 |
| University of Virginia | 0 | 50,790 |
| Utah State | 27 | 117 |
| Yale University | 0 | 23,678 |
| Total | 29,039 | 10,524,296 |
Public Domain (~30%)
| Total* | 24,016 | 3,211,760 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | September | August |
| Content | 248 | 286 |
|
Quality |
242 | 279 |
|
Non-partner Digital Deposit |
0 | 1 |
|
Collections |
2 | 3 |
| Cataloging | 80 | 142 |
| Access and Use | 116 | 119 |
|
Copyright |
71 | 62 |
|
Permissions |
5 | 15 |
|
Takedown |
2 | 0 |
|
Print on Demand |
0 | 1 |
|
Inter-library loan |
4 | 8 |
|
Full-PDF or e-copy requests |
11 | 21 |
|
Datasets |
3 | 7 |
|
Data Availability and APIs |
0 | 1 |
|
Reuse of content |
1 | 4 |
| Web applications | 12 | 22 |
|
Functionality problems |
4 | 8 |
|
Problems with login specifically |
0 | 1 |
|
General Questions about Login |
0 | 1 |
|
Partners setting up login |
0 | 4 |
|
Usability issues |
0 | 0 |
|
Feature requests |
0 | 2 |
| Partner Ingest | 3 | 4 |
| General | 55 | 74 |
|
Partnership |
10 | 9 |
|
Infrastructure |
0 | 0 |
|
Miscellaneous |
45 | 65 |
| Total | 514 | 647 |
Tom Burton-West, "Practical Relevance Ranking for 10 Million Books [69]", INEX 2012 pre-proceedings, CLEF Labs Working Notes, September 2012.
HathiTrust UnCamp presentations and resources [93] (via HathiTrust Research Center Wiki), September 10-11, 2012.
Heather Christenson and John Wilkin, "Intellectual Property Rights and the HathiTrust Collection" (forthcoming), UNESCO - The Memory of the World in the Digital Age: Digitization and Preservation, September 26, 2012.
Jeremy York, "A Preservation Infrastructure Built to Last: Preservation, Community, and HathiTrust [159]", UNESCO - The Memory of the World in the Digital Age: Digitization and Preservation, September 26, 2012.
See http://www.hathitrust.org/papers [110] for all papers, presentations, and reports.
[Download PDF [160]]
In the Update on July Activities [161] we distributed a short survey to receive feedback on our next series of HathiTrust information and training sessions. We have received many responses. The deadline for completing the survey is September 21. If you have not already, please take a moment to provide input on the kinds of sessions you would like to attend or lead, and the form you would prefer these sessions to take (e.g., a webinar series, in-person meeting, or a combination of the two). The survey is available at http://tinyurl.com/8n3k9nr [162].
Beginning October 1, all requests to the Data API will need to be signed with an access key provided by HathiTrust. Access keys for programmatic uses of the Data API can be obtained at http://babel.hathitrust.org/cgi/kgs/request [163]. HathiTrust has also created a Web client [164] that employs a user’s login credentials as a proxy for an access key to facilitate non-programmatic uses. Complete documentation of the security enhancements, methods of obtaining keys, how to sign requests, and how to access the Web client is available at http://www.hathitrust.org/data_api [103].
Also effective October 1, the host “services.hathitrust.org” will no longer exist for the Data API. The new host will be “babel.hathitrust.org”, the same host as the PageTurner and other HathiTrust services. Calls to the Data API will therefore need to use URLs such as the following (note the additional “cgi” in the path):
http://babel.hathitrust.org/cgi/htd/meta/mdp.39015019203879 [165]
rather than
http://services.hathitrust.org/htd/meta/mdp.39015019203879 [166]
Later this year, HathiTrust will begin accepting the “library-walk-in” Shibboleth attribute from partner institutions to provide certain member privileges to guest users who do not have an institutional login. For instance, “Library-walk-in” users will have the ability to download full-PDFs of all public domain materials in HathiTrust. Partners who wish to use HathiTrust library-walk-in functionality must confirm in writing that they are asserting the library-walk-in affiliation only for users physically present in a library building at the time of session initiation. Please see Shibboleth Login [107] for more information about Shibboleth in HathiTrust.
HathiTrust ingested nearly all of a set of approximately 2,000 volumes from Boston College, and loaded bibliographic records for additional volumes that will be deposited by the University of Illinois. The University of Florida submitted sample bibliographic records to be analyzed in preparation for content ingest.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The Communications Working Group did not meet in August, taking its first break since the group’s formation in May 2010. As the group awaits the solidification of the new HathiTrust governance, group members plan to address the results of their survey on training [162], and look ahead to fall activities and meetings.
The User Experience Advisory Group continued discussions about a new home page design and provided feedback on mockups created by the University of Michigan.
A summary of the issues received by the User Support Working Group is provided at the end of the update.
California Digital Library (CDL) and University of Michigan staff agreed on a data workflow for updating rights information in the HathiTrust rights database when CDL takes responsibility for managing HathiTrust bibliographic data. The CDL team is refining and improving the performance of bibliographic data exports needed to support HathiTrust operations. Analysis continued to address issues with a small percentage of poor quality records.
Michigan staff successfully tested the bibliographic record submission process for Zephir (the new management system) and commented on corresponding submission guidelines. In the coming month CDL will be contacting institutions that are currently, or were in the past, contributors of content to HathiTrust to test the new process for submitting records. The test will be aimed primarily at current content contributors, but all contributors will be invited. Please contact feedback@issues.hathitrust.org [66] if your institution is not contributing content currently but you would like to test.
A summary of copyright review activities in August is given below. For further information on these activities please see CRMS-US [167] and CRMS-World [14].
|
|
August | Overall | ||
|
Opened |
Reviewed |
Opened |
Reviewed | |
|
CRMS-US |
5,773 |
11,793 | 169,995 | 320,883 |
|
CRMS-World |
2,423 | 5,615 | 6,592 | 15,075 |
|
Total |
8,196 | 17,408 | 176,587 | 335,958 |
The HTRC made preparations for its first “UnCamp”, held in Bloomington, Indiana on September 10-11. A full report on the gathering will be forthcoming.
Project staff continued work to finalize the quality review datasets. This included reviewing datasets for completeness, accuracy, and missing data, and performing reliability and validation testing on data for volumes that were double-coded for quality assurance purposes.
The IMLS grant advisory board met for its second time in mid-August. The project team presented its findings to-date and advisory board members provided input on work to be completed in the final stages of the project, as well as on research directions in the future. Over the next several months the project team will focus on completing the design of user studies to further investigate quality in relation to the usefulness of digitized volumes, collecting data to support the user studies, and conducting the user studies themselves.
Efforts continue to develop a framework for certifying the quality of volumes in HathiTrust. This includes the development of a modified data collection Web interface based on the interfaces used in the grant thus far.
For more information on the project, please visit the project website [120].
The mPach team at the University of Michigan updated the project timeline on the HathiTrust project page [37]. Work continued on modifications to the HathiTrust PageTurner to display JATS XML, and on refinements to the METS specification for mPach Submission Information Packages. Michigan staff made progress on enhancements to the Norm tool (part of content preparation), specifically enhancements to normalize bulleted lists, figures with captions, and tables. Wireframes are nearly complete for the Dashboard module (see the list of mPach modules [101] for more information on mPach modules). Michigan staff will be presenting on mPach at the 2012 DLF Forum.
Staff at the University of Michigan continued work to improve general accessibility for HathiTrust Web applications.
Michigan staff extended functionality of the Data API to serve full PDFs of volumes for print-on-demand services on Espresso Book Machines (EBM) via the ExpressNet sales network. Staff also augmented Data API usage monitoring to explicitly track signed requests, and made enhancements that will enable the Data API to deliver dynamically-generated image derivatives (such as PNG images as opposed to TIFF or JP2 images).
Development and testing for the metadata upgrade reported in the Update on July Actvities [168] has been completed, and the upgrade will begin in October.
Michigan staff continued to investigate the Solr edismax parser bug that is preventing CJK searching from working properly. Staff confirmed that the bug also affects Solr 4.0 and submitted sample documents and queries demonstrating the problem to the Solr JIRA issue tracking system: see https://issues.apache.org/jira/browse/SOLR-3589 [140]. Staff investigated possible workarounds for this issue, and conducted e-mail discussions with several Blacklight developers who are working on CJK issues.
Staff also made changes to the automated full-text search indexing process so that failures caused by server errors are automatically re-queued.
The INEX (Initiative for the Evaluation of XML Retrieval) Book Track accepted a paper by Michigan developer Tom Burton-West on full-text search relevance ranking in HathiTrust. The paper will be published in the INEX 2012 Pre-proceedings as part of the CLEF Labs Working Notes.
Michigan staff made changes that will make it easier to support new formats in the PageTurner interface. The mPach project will make use of the changes to add support for JATS XML.
HathiTrust was unavailable on Monday, August 13 from 7:30-8am EDT for a security-related database reorganization.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of August 1:
| August | Overall | |
| Boston College | 1,816 | 1,816 |
| Columbia University | 0 | 64,184 |
| Cornell University | 5,307 | 408,755 |
| Duke University | 0 | 4,523 |
| Harvard University | 1,637 | 235,983 |
| Indiana University | 14 | 187,683 |
| Library of Congress | 1 | 89,722 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,088 |
| Northwestern University | 6 | 7,214 |
| New York Public Library | 8 | 259,571 |
| Penn State University | 35 | 44,018 |
| Princeton University | 781 | 251,644 |
| Purdue University | 10,361 | 38,048 |
| Universidad Complutense | 71 | 111,899 |
| University of California | 26,493 | 3,373,076 |
| The University of Chicago | 2,240 | 24,679 |
| University of Illinois | 823 | 101,001 |
| University of Michigan | 8,533 | 4,560,303 |
| University of Minnesota | 2,105 | 102,501 |
| University of Wisconsin | 3,559 | 542,795 |
| University of Virginia | 1,868 | 50,790 |
| Utah State | 0 | 90 |
| Yale University | 0 | 23,678 |
| Total | 65,658 | 10,495,257 |
Public Domain (~30%)
| Total* | 60,100 | 3,187,744 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | August | July |
| Content | 286 | 326 |
|
Quality |
279 | 318 |
|
Non-partner Digital Deposit |
1 | 0 |
|
Collections |
3 | 4 |
| Cataloging | 142 | 113 |
| Access and Use | 119 | 112 |
|
Copyright |
62 | 66 |
|
Permissions |
15 | 16 |
|
Takedown |
0 | 1 |
|
Print on Demand |
1 | 4 |
|
Inter-library loan |
8 | 6 |
|
Full-PDF or e-copy requests |
21 | 16 |
|
Datasets |
7 | 4 |
|
Data Availability and APIs |
1 | 0 |
|
Reuse of content |
4 | 3 |
| Web applications | 22 | 27 |
|
Functionality problems |
8 | 3 |
|
Problems with login specifically |
1 | 0 |
|
General Questions about Login |
1 | 1 |
|
Partners setting up login |
4 | 0 |
|
Usability issues |
0 | 12 |
|
Feature requests |
2 | 2 |
| Partner Ingest | 4 | 2 |
| General | 74 | 108 |
|
Partnership |
9 | 7 |
|
Infrastructure |
0 | 0 |
|
Miscellaneous |
65 | 101 |
| Total | 647 | 688 |
[Download PDF [169]]
HathiTrust has offered webinars in the past to orient new partners and provide updates on HathiTrust services and initiatives. For our next series, or program, we are considering offering information sessions led by staff throughout the partnership on topics of interest. To begin to plan for these sessions, we would like to receive feedback from members of partner institutions on 4 questions related to session topics, venue, and participation. A form with the questions is available at http://tinyurl.com/8n3k9nr [162]. Although feedback is especially sought from partner institutions, others may provide input. If there is sufficient interest we will consider offering sessions open to anyone, whether affiliated with HathiTrust partner institutions or not. Responses are requested by September 21, 2012.
University of Michigan developers, under the guidance of Michigan’s User Experience Department, are in the process of reviewing and making improvements to the accessibility of HathiTrust Web applications. The first phase of the work, which began in July, involves ensuring compliance of HathiTrust interfaces with the Web Content Accessibility Guidelines (WCAG) 2.0 [89], Level A. The second phase will target compliance with WCAG 2.0 Level AA, and include usability testing by users who have print disabilities. It is expected as part of this work that Michigan staff will begin to draft policies and guidelines to ensure that future coding for HathiTrust applications maintains these standards.
Staff from partner institutions with Web accessibility expertise who are interested in being involved with this initiative are encouraged to contact Suzanne Chapman (suzchap@umich.edu [170]).
Over the last several of years, staff from HathiTrust partner institutions have been manually reviewing the copyright status of volumes published in the United States from 1923 to 1963, as part of CRMS-US, an IMLS funded grant project. The grant has come to an end but review by partner institutions continues, and has been expanded through a second IMLS grant, CRMS-World, that is targeting review of non-US-published works, beginning with English-language works published in the United Kingdom, Canada, Australia, and Spain. Through this process tens of thousands of works have been discovered to be in the public domain and opened in HathiTrust to viewers world-wide. Reports on the number of volumes reviewed and opened as of early August are shown below and will be included in future updates.
|
|
Reviewed |
Opened |
|---|---|---|
|
CRMS-US |
309,090 |
164,222 |
|
CRMS-World |
9,459 |
4,169 |
|
Total |
318,549 |
168,391 |
Michigan staff provided support as needed for the new ingest tools [54] that have been made available. Staff who have questions about using the tools, or who would like to initiate deposit of materials should contact feedback@issues.hathitrust.org [66].
HathiTrust began ingest of a first set of Internet Archive-digitized volumes from Penn State and a second set of Internet Archive-digitized volumes from the University of Illinois.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
A summary of the issues received by the User Support Working Group in July is provided at the end of the update.
California Digital Library (CDL) staff continued working with staff at the University of Michigan to develop processes to sync rights determination information between Zephir and the HathiTrust rights database. CDL also worked with Michigan to test new bibliographic submission guidelines and a new workflow for submitting bibliographic metadata to Zephir via FTPS. HathiTrust members currently depositing content will soon be asked to participate in a test of this new submission process, which will be put in place when the cutover from the bibliographic management system at the University of Michigan to Zephir takes place. Zephir development is in its final phase, and in the coming months Michigan and CDL will move to an integration phase that will involve extensive testing and operation of the two systems in parallel before a final cutover.
The HTRC UnCamp is a little over a month away. The UnCamp is part informational, part community building, part boot-camp, and part unconference, designed to show the research capabilities that the HathiTrust Research Center can offer and garner feedback from a broad range of interested users. An exciting list of speakers includes John Wilkin, Executive Director of HathiTrust, Colin Allen, Professor of History and Director of the Cognitive Science Program at Indiana University, and Ted Underwood, Associate Professor of English at the University of Illinois. Details on the UnCamp can be found at http://d2i.indiana.edu/htrc/uncamp2012/ [171].
The UnCamp marks the 12-month point in the development of the HTRC, and contributes to a milestone set out in the MOU between the HTRC and HathiTrust of a demonstration of HTRC functionality 12 months into development. The HTRC is scheduled to transition into production in Spring 2013, at the 18-month mark from its inception, so the UnCamp is timely for gathering community input.
The HathiTrust Research Center is pleased to have recently received two allocation awards from XSEDE [172]for computational resources: one for exploratory research and the other in support of educational and outreach activities.
Project staff focused work in July on finalizing the quality review datasets assembled over the course of the grant for analysis, and on developing a framework for reviewing and certifying the quality of volumes in HathiTrust. Staff also finished assembling a catalog of frequently-observed errors in illustrative content for in-depth analysis by an expert in digital conversion errors. The second meeting of the grant Advisory Board and project collaborators is scheduled for the end of August. The project team will present findings at the meeting and solicit feedback and input on direction from the attendees.
University of Michigan staff completed a diagram [100] of the mPach system architecture, which has been added to the mPach website [173]. Staff also refined the schema for mapping bibliographic data from JATS XML (the format to be used for encoded text) to MARC records, which are required for ingest into HathiTrust. Work continued on wireframes for the Dashboard module (see a description of all modules [101]), on the profile for the METS file that will accompany digital objects in Submission Information Packages, and on rendering XML articles in the HathiTrust PageTurner.
University of Michigan staff made changes to the Collection Builder Web application to allow longer collection titles and descriptions, and notify users when entries exceed the allowed lengths.
Michigan staff worked to update documentation of the Data API to reflect changes reported in the Update on April Activities [174], as well as other recent changes. The documentation will be made available in August. Documentation of the interactions of existing clients with the Data API is also being developed. Michigan staff have begun planning to extend the Data API to support requests for derivative forms of content including scaled, individual, page images and PDFs. These features will initially serve quality review and print-on-demand applications but are expected to have other uses as well. Access will be subject to Data API authorization requirements.
Programmers at the California Digital Library added language-aware relevance ranking to the search spelling suggestion feature under development for full-text search. Staff also built a regression framework for testing algorithm changes and began to make changes using heuristics to improve spelling suggestion quality for a test set of 100 HathiTrust queries. Over the coming weeks, additional changes to the scoring system for suggestions are expected to further improve the quality of suggestion results.
Michigan staff completed the second phase of planned improvements to indexing and searching of Chinese, Japanese, and Korean (CJK) languages in HathiTrust. The second phase involved re-indexing all volumes in the repository with a new schema to provide better searching over bibliographic data for CJK materials. Staff made corresponding changes in HathiTrust Web application code to take advantage of the new schema. Staff observed that the bug discovered in the first phase of work (reported in the Update on June 2012 Activities [139]) caused the level of improvement to be less than expected in the second phase as well. Work will continue to address this issue. In the meantime, the improvements made to full-text indexing in June reduced the time needed to index all 10.4 million volumes by nearly half – to approximately one week – despite the fact that there were interruptions.
University of Michigan staff made preparations to begin the first repository-wide upgrade of metadata for HathiTrust objects. The upgrade applies primarily to PREMIS metadata, though metadata in other areas of the HathiTrust METS file will be affected as well. In conjunction with this upgrade, HathiTrust will begin moving toward a formalized model of publicly communicating planned full-repository changes.
Deletions from the HathiTrust repository are rare, occurring in instances where
Michigan staff are in the final stages of implementing an automated process (though the process must be initiated manually) to remove items from repository storage, as well as catalog and full-text indexes. In cases where volumes are deleted, a “tombstone” is created for provenance purposes and to maintain permanent links for references users may have created.
HathiTrust may have been unavailable for some users on Monday, July 16 from 8:15pm to 8:30pm due to a database locking problem at one repository instance, and from Monday, July 16 at 11:45pm to Tuesday, July 17 at 8:15am due to a problem with a web server at one instance.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
Jeremy York: HathiTrust And TRAC [175]. Digital Preservation 2012 (annual meeting of the National Digital Information Infrastructure and Preservation Program and the National Digital Stewardship Alliance), Washington, D.C., July 25, 2012).
Jeremy York: HathiTrust: On TRAC [176]. ICPSR course on Applied Data Science, University of Michigan, July 26, 2012.
See http://www.hathitrust.org/papers [110] for all papers, presentations, and reports.
As of July 1:
| July | Overall | |
|---|---|---|
| Columbia University | 0 | 64,184 |
| Cornell University | 3,492 | 403,448 |
| Duke University | 0 | 4,523 |
| Harvard University | 0 | 234,346 |
| Indiana University | 5 | 187,669 |
| Library of Congress | 305 | 89,721 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,088 |
| Northwestern University | 1 | 7,208 |
| New York Public Library | 3 | 259,563 |
| Penn State University | 661 | 43,983 |
| Princeton University | 14 | 250,863 |
| Purdue University | 0 | 27,687 |
| University of California | 6,355 | 3,346,583 |
| The University of Chicago | 408 | 22,439 |
| University of Illinois | 4,027 | 100,178 |
| Universidad Complutense | 0 | 111,828 |
| University of Michigan | 5,302 | 4,551,770 |
| University of Minnesota | 96 | 100,396 |
| University of Wisconsin | 25 | 539,236 |
| University of Virginia | 0 | 48,922 |
| Utah State | 0 | 90 |
| Yale University | 0 | 23,678 |
| Total | 20,694 | 10,429,599 |
Public Domain (~30%)
| Total* | 22,027 | 3,127,644 |
* Includes volumes opened through copyright review and rights holder permissions
| Issue Type | July | June |
|---|---|---|
| Content | 326 | 237 |
|
Quality |
318 | 228 |
|
Non-partner Digital Deposit |
0 | 0 |
|
Collections |
4 | 6 |
| Cataloging | 113 | 31 |
| Access and Use | 112 | 123 |
|
Copyright |
66 | 69 |
|
Permissions |
16 | 20 |
|
Takedown |
1 | 2 |
|
Print on Demand |
4 | 0 |
|
Inter-library loan |
6 | 0 |
|
Full-PDF or e-copy requests |
16 | 20 |
|
Datasets |
4 | 0 |
|
Data Availability and APIs |
0 | 0 |
|
Reuse of content |
3 | 2 |
| Web applications | 27 | 19 |
|
Functionality problems |
3 | 2 |
|
Problems with login specifically |
0 |
3 |
|
General Questions about Login |
1 | 0 |
|
Partners setting up login |
0 | 1 |
|
Usability issues |
12 | 0 |
|
Feature requests |
2 | 6 |
| Partner Ingest | 2 | 3 |
| General | 108 | 72 |
|
Partnership |
7 | 14 |
|
Infrastructure |
0 | 0 |
|
Miscellaneous |
101 | 59 |
| Total | 688 |
485 |
[Download PDF [177]]
HathiTrust has updated its bibliographic metadata specifications and minimum bibliographic metadata requirements in preparation for moving to Zephir (under development by California Digital Library) as the bibliographic metadata management system for HathiTrust. The requirements are in effect immediately for institutions that have not previously deposited content in HathiTrust. Institutions that have already deposited content are requested to meet the minimum requirements, but it is not required (we will continue to accept bibliographic metadata as it has been submitted). The primary difference from previous requirements is that records from new depositors must at a minimum include a MARC Leader, 008 field, title field, and OCLC number in order to be loaded. Further details are provided at http://www.hathitrust.org/bib_specifications [92]. The HathiTrust Ingest Checklist [178] page has been revised in conjunction with these changes.
The University of Michigan made the first iteration of tools available to aid institutions in transforming, validating, and packaging digital content for deposit in HathiTrust. The tools can be downloaded at http://www.hathitrust.org/ingest_tools [54]. Notifications of updated versions of the tools will be sent to a Google Groups email list [67], and we recommend anyone who will be using the tools to subscribe.
by Jeremy York
I am pleased to announce the appointment of Angelina Zaytsev to the position of HathiTrust Project Librarian. Angelina has worked for HathiTrust part-time for the last year, assisting in the coordination of numerous HathiTrust activities including ingest of content from partner institutions, processing permissions to open access to materials, general user support, and multiple duties “as assigned”. Angelina will continue these duties while taking on a greater role in managing and coordinating projects for HathiTrust.
The HathiTrust Research Center opened registration [179] for the HTRC UnCamp, to be held in Bloomington, Indiana on September 10-11, 2012. More information can be found at http://www.hathitrust.org/htrc_uncamp2012 [180].
Michigan staff completed the majority of development necessary to support a new rights status in HathiTrust Web applications. The status will apply to works that were restored to being in copyright in the United States by the General Agreement on Tariffs and Trade (GATT), but are now in the public domain in the rest of the world. An increasing number of these volumes are being identified as part of CRMS-World [14], the IMLS-funded continuation of the CRMS project [181].
HathiTrust continued working with Boston College and began working with Penn State and the University of Illinois on ingest of volumes digitized by the Internet Archive.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_ groups [36] for more information.
The User Experience Advisory Group continued discussions about a new home page design and provided feedback on mockups created by the University of Michigan.
The table below shows a summary of the issues received by the User Support Working Group in June.
| Issue Type | June | May |
|---|---|---|
| Content | 237 | 168 |
Quality | 228 | 159 |
Non-partner Digital Deposit | 0 | 0 |
Collections | 6 | 3 |
| Cataloging | 31 | 51 |
| Access and Use | 123 | 129 |
Copyright | 69 | 64 |
Permissions | 20 | 12 |
Takedown | 2 | 2 |
Print on Demand | 0 | 1 |
Inter-library loan | 0 | 0 |
Full-PDF or e-copy requests | 20 | 22 |
Datasets | 2 | 4 |
Data Availability and APIs | 0 | 4 |
Reuse of content | 2 | 2 |
| Web applications | 19 | 12 |
Functionality problems | 2 | 3 |
Problems with login specifically | 3 | 0 |
General Questions about Login | 0 | 0 |
Partners setting up login | 1 | 3 |
Usability issues | 0 | 0 |
Feature requests | 6 | 1 |
| Partner Ingest | 3 | 2 |
| General | 72 | 81 |
Partnership | 14 | 6 |
Infrastructure | 0 | 1 |
Miscellaneous | 59 | 74 |
| Total | 485 | 443 |
California Digital Library (CDL) staff completed a portion of development needed to sync Zephir with rights information in the HathiTrust rights database. Staff also reloaded records for HathiTrust items into Zephir as part of an iterative process to be sure rights and other necessary administrative metadata are being properly loaded. CDL staff completed documentation of Zephir metadata ingest and workflow guidelines and will be working with HathiTrust project staff to add the information to the HathiTrust website as the launch of Zephir gets nearer.
University of Michigan staff rewrote the list of modules [101]to be included in mPach, a package of tools Michigan is developing for publishing open access journal content in HathiTrust. Staff divided modules into three categories: Editorial Workflow and Peer Review, Content Preparation, and HathiTrust. Work continues to adapt PageTurner to handle full-text XML content and to develop wireframes for the Dashboard module. Staff began developing code to validate the Submission Information Package.
Project staff completed review of the 4th and final 1,000-volume sample, consisting of non-Roman language materials. This concluded the data collection phase of the project. The project team finalized plans for a specialized study of errors in digitized illustrations and began to assemble and review select illustrations from each of the Library of Congress classifications. The study is designed to be an in-depth investigation into errors in digitized illustrations; it is not meant to describe or characterize the extent of errors in HathiTrust as a whole.
Jackie Bronicki presented current findings of the project at the ALA Annual Meeting in June. The new interface for the project website, which has been undergoing a redesign in the last couple of months, was released at the same time.
The project team also made progress on a framework for certifying the quality of volumes in HathiTrust.
Staff at Michigan completed the first phase of work to improve indexing and searching of CJK (Chinese, Japanese, and Korean) languages. The first phase involved re-indexing all 10.4 million volumes in the repository using the new CJKBigramFilter available in Solr 3.6, and a custom unigram filter. The new index was put into production in mid-June. The improvements in search precision for CJK queries turned out to be smaller than anticipated. Investigation revealed the cause to be a bug in Solr’s edismax query processor, and a bug report [140] was filed in the Solr JIRA bug-tracking system. Michigan staff are investigating both temporary workarounds and a long-term fix to the bug.
Michigan staff indexed the INEX Book Track corpus [182] and conducted a series of relevance ranking experiments. From the experiments, 6 “runs” were chosen and submitted to the INEX Book Track “Prove It” task. Three of the runs were designed to simulate users searching the HathiTrust full-text index and three were baseline runs to measure the impact of the default HathiTrust search configuration on queries of different types and lengths. The results from the INEX Book Track will be used to tune relevance ranking in HathiTrust’s repository-wide and single-volume, full-text search.
California Digital Library refined the algorithm used to score spelling suggestions based on queries extracted from HathiTrust log files and improved the way suggestions are made when stop words and words that are inappropriately combined are present in the query. The next step will be to experiment with making suggestions in different languages.
Michigan removed a long-standing bottleneck in the full-text indexing process, effectively doubling throughput. Under ideal conditions, staff believe it should be possible now to index approximately 100,000 documents per hour.
No outages were reported in June. HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of July 1:
| June | Overall | |
|---|---|---|
| Columbia University | 0 | 64,184 |
| Cornell University | 85 | 399,956 |
| Duke University | 0 | 4,523 |
| Harvard University | 34,607 | 234,346 |
| Indiana University | 0 | 187,664 |
| Library of Congress | 0 | 89,416 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,088 |
| Northwestern University | 4 | 7,207 |
| New York Public Library | 1 | 259,560 |
| Penn State University | 0 | 43,322 |
| Princeton University | 0 | 250,849 |
| Purdue University | 2,906 | 27,687 |
| University of California | 3,446 | 3,340,228 |
| The University of Chicago | 1,210 | 22,031 |
| University of Illinois | 0 | 96,151 |
| Universidad Complutense | 1 | 111,828 |
| University of Michigan | 7,100 | 4,546,468 |
| University of Minnesota | 830 | 100,300 |
| University of Wisconsin | 3 | 539,211 |
| University of Virginia | 0 | 48,922 |
| Utah State | 0 | 90 |
| Yale University | 0 | 23,678 |
| Total | 50,193 | 10,408,905 |
Public Domain (~29%)
| Total* | 72,332 | 3,105,587 |
* Includes volumes opened through copyright review and rights holder permissions
See http://www.hathitrust.org/papers [110] for all papers, presentations, and reports.
[Download PDF [184]]
HathiTrust is an international partnership of academic and research institutions dedicated to ensuring the preservation and accessibility of the vast record of human knowledge. The partnership owns and operates a digital repository containing millions of public domain and in-copyright volumes, digitized from partnering institution libraries and other sources. The preserved volumes are made available in accordance with copyright law as a shared scholarly resource for students, faculty, and researchers at the partnering institutions and as a public good to the world community. For more information, visit HathiTrust.org [185].
In the first half of 2012, HathiTrust continued to expand our partnership, to further develop and refine our services, and to benefit from grant funded evaluations and explorations. In this period we’ve also seen momentum building for our HathiTrust Research Center, and the important milestone of the establishment of a new Board of Governors. The following provides detail on the richness of our many activities and accomplishments.
Details on each item can be found in the monthly updates from 2012, available at http://www.hathitrust.org/updates [80].
Two new partners joined HathiTrust in the first half of 2012:
HathiTrust Partners contributed nearly 400,000 volumes to HathiTrust from January – June 2012, raising the total number of total volumes to 10.4 million (view our Ten Million and Counting [84] blog post and timeline). More than 3 million of this total (about 30%) are in the public domain.
HathiTrust began or continued conversations with several institutions regarding direct ingest of locally-digitized content:
The University of Michigan created a first iteration of tools that partners can use to package their content to HathiTrust specifications prior to submission.
HathiTrust began conversations with the Getty Research Center, Penn State University, and the University of Florida regarding ingest of volumes from the Internet Archive.
HathiTrust ingested large numbers of volumes from the University of Illinois (~80,000 volumes) and Harvard University Library (~150,000 volumes).
Deposits from all institutions are shown in the table below.
| Volumes Added | Since Jan 2012 | Total Volumes |
| Columbia University | 8 | 64,184 |
| Cornell University | 16,181 | 399,871 |
| Duke University | 1 | 4,523 |
| Harvard University | 146,299 | 199,739 |
| Indiana University | 752 | 187,664 |
| Library of Congress | 5 | 89,416 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 1 | 8,088 |
| Northwestern University | 1,554 | 7,203 |
| New York Public Library | 106 | 259,559 |
| Penn State University | 405 | 43,322 |
| Princeton University | 1,170 | 250,849 |
| Purdue University | 23,894 | 24,781 |
| University of California | 49,128 | 3,336,782 |
| The University of Chicago | 10,213 | 20,821 |
| University of Illinois | 81,468 | 96,151 |
| Universidad Complutense | 3,159 | 111,827 |
| University of Michigan | 34,767 | 4,539,368 |
| University of Minnesota | 9,231 | 99,470 |
| University of Wisconsin | 11,874 | 539,208 |
| University of Virginia | 1,526 | 48,922 |
| Utah State | 44 | 90 |
| Yale University | 4 | 23,678 |
| Total | 392,140 | 10,358,712 |
Public Domain (~30%)
|
Total* |
320,629 | 3,033,255 |
* Includes volumes opened through copyright review and rights holder permissions
HathiTrust conducted elections for a new Board of Governors [82] in March and established the Board, composed of both elected and appointed members, [186] in April. The proposal to create a Board of Governors was one of the proposals accepted by partners at the HathiTrust Constitutional Convention [187] in October 2011 (view all [10]proposals [10]). The Board took the reins from an Executive Committee, which was established by the founding HathiTrust partners. A report on the Board’s first meeting [188] is posted in the Update on May 2012 Activities.
The Collections Committee released its report on duplicate volumes [83] in HathiTrust, recommending that HathiTrust retain all duplicate copies ingested into the repository for the time being, with periodic reassessment. The Committee also made progress on a process for responding to requests and offers to include additional materials in HathiTrust.
The Communications Working Group produced a Resources [65]page for HathiTrust, containing overview documents, handouts, and guides created by HathiTrust partner libraries, the Communications Working Group, and non-partner sources. The working group released blog posts on HathiTrust's achievement of ten million volumes [84], full-text search enhancements [189], and, in collaboration with University of Michigan staff and the UX Advisory group, creating collections in HathiTrust [88]. The Communications group launched a Pinterest [87]account for HathiTrust, and submitted a briefing for the new Board of Governors.
The UX Advisory Group made recommendations on improvements to the PageTurner application, including the addition of the version date and new labeling to clarify when full-PDF download is available. The group collaborated with staff at Michigan and the Communications Working Group on a blog post about creating collections in HathiTrust [88], and began to focus attention on a project to redesign the HathiTrust home page.
A summary of the issues received by the User Support Working group is show in the table below. The working group made several improvements to its workflow for handling inquiries - those related to content quality especially, but in other areas as well. The group worked on recommendations for a future structure and process for responding to user inquiries, which is one of the responsibilities specified in its charge [190].
| Issue Type | Total |
| Content | 831 |
|
Quality |
778 |
|
Non-partner Digital Deposit |
5 |
|
Collections |
28 |
| Cataloging | 194 |
| Access and Use | 639 |
|
Copyright |
377 |
|
Permissions |
75 |
|
Takedown |
6 |
|
Print on Demand |
2 |
|
Inter-library loan |
2 |
|
Full-PDF or e-copy requests |
83 |
|
Datasets |
11 |
|
Data Availability and APIs |
7 |
|
Reuse of content |
10 |
| Web applications | 89 |
|
Functionality problems |
27 |
|
Problems with login specifically |
3 |
|
General Questions about Login |
15 |
|
Partners setting up login |
13 |
|
Usability issues |
6 |
|
Feature requests |
6 |
| Partner Ingest | 15 |
| General | 578 |
|
Partnership |
40 |
|
Infrastructure |
4 |
|
Miscellaneous |
534 |
| Total | 2,346 |
California Digital Library (CDL) staff loaded all records that are present in the current bibliographic system at the University of Michigan into Zephir, the new HathiTrust bibliographic management system, which is now in final stages of development. The CDL team performed load testing during the ingest of records, and worked to address discrepancies between records in the two systems. Staff created prototype exports of data that will be used to support the HathiTrust bibliographic catalog and "Hathifiles" inventory files. CDL worked with Michigan to finalize a record submission standard, and began to develop documentation and guidelines for submitting bibliographic records to Zephir, and documentation of the reports to be provided to institutions when records are loaded. Details about the submission standard, and additional information to be requested when records are submitted to HathiTrust, will be forthcoming.
The HTRC completed all the agreements necessary to receive Google-digitized materials from the HathiTrust repository. Staff from Indiana University worked with staff at the University of Michigan to begin transferring OCR text files for the more than 3 million public domain volumes in HathiTrust to the HTRC.
The HTRC released a report on its activities [191] from October 2011 to March 2012, detailing a variety of significant technical accomplishments, outreach activities, and strategic initiatives. The HTRC will be holding an “Uncamp [180]” at Indiana University this September. Please visit the HTRC webpage [53] and view the report above for further information about HTRC activities.
The IMLS Quality grant team completed page-level review (sampling within each volume) of three 1,000-volume samples from HathiTrust and reported initial findings (see the links under Quality Review on the results page [138] of the project website). The team developed a new whole-volume review interface to facilitate detection of errors that affect the entire volume (such as missing, duplicate, and out-of-order pages) as well as the severity of page-level errors. Project staff reviewed the first two 1,000-volume samples in this new interface in order to be able to compare results with page-level review.
Project staff completed physical review of ~90% of volumes in the first 1,000-volume sample and 60% of the second 1,000-volume sample to investigate correlation of physical book characteristics with errors in digitized volumes.
The grant team is beginning a sub-study to better describe errors in illustrative content in digitized volumes, and has begun to shift focus to the final, user research portion of the grant.
Staff at the University of Michigan worked on modifications to the HathiTrust PageTurner to display JATS XML and developed the first iteration of a tool that creates valid JATS XML from simple DOCX files. Staff also worked on specifications for a Submission Information Package for mPach content, began development of wireframes for the mPach Dashboard module (see a description of all mPach modules [101]), and composed design principles and requirements [192], as well as a project timeline [37].
Staff at the University of Michigan released several new advanced search features, including operations to search bibliographic metadata in combination with full-text search, limit results to specific publication years, languages, and original formats, revise advanced searches, and search with greater Boolean complexity. These features are described in the Update on April 2012 Activities [193] and a Perspectives from HathiTrust blog post [189].
Michigan staff undertook work to improve indexing of volumes in Chinese, Japanese and Korean, and improve relevance-ranking of results.
Staff at California Digital Library made significant progress on the development of a spelling-suggester feature for full-text search.
Staff at the University of Michigan developed functionality to allow users from partner institutions to be “automatically” logged in [102] to HathiTrust when following links from local institutional catalogs or other resources.
Michigan staff added 5 new fields to HathiTrust’s tab-delimited inventory files (view the files [4]or a description [194]). The new fields include publication date, publication location, language, bibliographic format, and an indication of whether or not a volume has been identified as a U.S. federal government document.
Staff at Michigan developed security enhancements that, beginning October 1, will require developers to use OAuth 1.0 access keys to access the Data API and sign URLs passed to the API with a secret key. Staff also developed a Web client that employs a user’s login credentials as proxy for the keys (users can sign up for a University of Michigan “Friend Account” [195] to login). Users can register for keys or use the Web client by visiting http://babel.hathitrust.org/cgi/htdc [164]. It is currently possible to use the keys and Web client; use will be required beginning October 1, 2012.
Also beginning October 1, 2012, the host “services.hathitrust.org” will be taken out of service. Calls to the Data API will need to use URLs such as the following (note the additional “cgi” in the path):
http://babel.hathitrust.org/cgi/htd/meta/mdp.39015019203879rather than
http://services.hathitrust.org/htd/meta/mdp.39015019203879On May 1, support for legacy Data API URLs in the following form was removed:
http://services.hathitrust.org/api/htd/pathinfo-argumentsURLs should be submitted to the API according to the current Data API schema [103] without the “api” path element
http://services.hathitrust.org/htd/pathinfo-argumentsMichigan staff deployed a Data API security monitoring and reporting script that runs on a daily basis.
University of Michigan staff implemented processes to track accesses to in-copyright works in cases where access is permitted. The new processes provide a means for HathiTrust to detect problematic activity such as bulk downloading operations, which may, for example, indicate a compromised user account.
Michigan staff made a number of adjustments and improvements to the PageTurner application and interface. These included:
Michigan staff replaced two Web servers in the Michigan repository instance and moved to a new system of load balancing between the Indiana and Michigan repository instances. Load balancing is used routinely to mask maintenance or upgrade processes that require individual servers or an entire site to be taken offline.
Michigan staff installed new storage at the Indiana and Michigan sites. The storage was purchased to accommodate partner projections for content in 2012 and replace storage scheduled for retirement.
Reports of volumes in HathiTrust that are available for print on demand are available at http://www.hathitrust.org/pod_reports [196]. A new report will be posted on the first of each month.
Michigan staff moved HathiTrust’s Drupal-based informational website and VuFind-based catalog from their initial hosting environments on Michigan library infrastructure to dedicated HathiTrust infrastructure. This move consolidates, and will greatly simplify HathiTrust Web development.
All papers and presentations are listed at http://www.hathitrust.org/papers [197].
You can follow HathiTrust on Facebook [7] and Twitter [198].
[Download PDF [199]]
Read our new blog post [88] about building HathiTrust collections.
The new Board of Governors met in Chicago in conjunction with the ARL membership meeting in May. The group spent some time before the meeting identifying priorities, focusing primarily on the organizational work of the Board. The Board quickly formed an Executive Committee, as stipulated in the Constitutional Convention ballot proposal [200]. The new Executive Committee members include Paul Courant, Carol Diedrichs, Laine Farley, Sarah Michalak and Bob Wolven. Another group chaired by Pat Steele was charged with initiating the process to assemble by-laws. This group will also attend to issues such as the duration of the appointment of the Executive Committee, and expects to conclude its work by the end of November. A third group will be formed to focus on the development of a Charter.
The Board of Governors will meet by teleconference for the next several months, targeting one meeting per month, as the process of developing by-laws moves forward. In these meetings the Board plans to review HathiTrust’s past work, which will include a review of the HathiTrust budget as well as HathiTrust’s committees and working groups. Although it was not able to discuss HathiTrust’s existing committees and working groups in detail in Chicago, the Board expressed a deep appreciation for the work the Strategic Advisory Board, the Collections Committee, and the current operational working groups and committees have done. The Board asked that the existing groups continue their work (with the Board’s enthusiastic support) until and while a review of committees can take place.
We are pleased to announce the new HathiTrust Resources and Guides [65] page, where we bring together overviews, instructional materials, and guides created by HathiTrust partner libraries, the Communications Working Group, and beyond. Materials posted on the page include reusable handouts, a detailed guide to using HathiTrust, lively blogs and dynamic videos. Please use, repurpose, and enjoy!
Have you created HathiTrust user guides or instructional materials? We encourage you to submit them to feedback@issues.hathitrust.org [66].
It is now possible to embed HathiTrust volumes in web pages. Code snippets to do this can be found at http://www.hathitrust.org/embed [106].
Staff at the University of Michigan completed development of the first iteration of tools to help depositors create and validate content packages prior to submission to HathiTrust. The tools will be made available in early June to several partner institutions that are working on ingest of locally-digitized materials.
HathiTrust ingested approximately 150,000 additional public domain volumes from Harvard University Library.
As noted in Top News, the Communications Working Group released a new Web page [65] featuring HathiTrust instructional materials from across the partnership, including guides developed for public services use. In addition, the group submitted a briefing to the new Board of Governors with recommendations for carrying out communications activities in the future. The working group also launched a Pinterest account [87] for HathiTrust.
The User Experience Advisory Group focused its attention on a project being undertaken by University of Michigan staff to redesign the HathiTrust home page (www.hathitrust.org). The group will begin consulting regularly on this project in June.
The table below contains a summary of the issues received by the User Support Working Group in April.
| Issue Type | May | April |
|---|---|---|
| Content | 168 | 231 |
Quality | 159 | 222 |
Non-partner Digital Deposit | 0 | 1 |
Collections | 3 | 4 |
| Cataloging | 51 | 33 |
| Access and Use | 129 | 112 |
Copyright | 64 | 76 |
Permissions | 12 | 7 |
Takedown | 2 | 2 |
Print on Demand | 1 | 0 |
Inter-library loan | 0 | 2 |
Full-PDF or e-copy requests | 22 | 10 |
Datasets | 4 | 4 |
Data Availability and APIs | 4 | 1 |
Reuse of content | 2 | 1 |
| Web applications | 12 | 14 |
Functionality problems | 3 | 4 |
Problems with login specifically | 0 | 1 |
General Questions about Login | 0 | 4 |
Partners setting up login | 3 | 3 |
Usability issues | 0 | 0 |
Feature requests | 1 | 0 |
| Partner Ingest | 2 | 5 |
| General | 81 | 129 |
Partnership | 6 | 5 |
Infrastructure | 1 | 0 |
Miscellaneous | 74 | 124 |
| Total | 443 | 519 |
*See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
Staff at California Digital Library (CDL) refined the code for loading bibliographic records into Zephir (the new bibliographic management system) and reloaded all HathiTrust records in the test environment. Work continued to code a process to sync rights information in Zephir with the HathiTrust rights database. The CDL team is developing documentation and guidelines for submitting bibliographic records to Zephir, and documentation of reports to be provided to institutions when records are loaded.
University of Michigan staff continued work on modifications to the HathiTrust PageTurner to display JATS XML. Staff began development of wireframes for the Dashboard module and are close to the completion of a specification for mapping JATS metadata elements to MARC fields to create analytic records for journal articles.
Who should attend? The HTRC UnCamp is targeted to the digital humanities tool developers, researchers and librarians of HathiTrust member institutions, and graduate students. Attendance will be capped at 60 participants, so plan to register early!
Travel funds and Registration. HTRC anticipates funding a small number of travel grants that can be used by an attendee to bring along a graduate student or for a HathiTrust member librarian/technologist to bring along a researcher from their organization who is interested in engaging with our research center. The Uncamp will have a minimal registration fee so as to make the Uncamp as affordable as possible for you to attend.
All of the data collection for English language volumes was completed in May, including double-review of subsets of volumes for quality assurance. Review of volumes in the grant’s final 1,000-volume sample, which includes volumes from 6 major non-Roman languages (Chinese, Japanese, Korean, Arabic, Cyrillic and Hebrew), is still in progress. At the end of May, staff had reviewed 77,115 of the total 95,086 pages sampled for review. Data collection is expected to be complete in mid-June.
Work in June will focus on analysis of the collected data, as well as research, development, and data collection for use case studies, which will comprise the final portion of the grant. Staff will also undertake a specialized study of errors in digitized illustrations to try to more accurately describe the types of errors that are observed and their impact on use.
Current findings of the project will be presented at the ALA Annual Meeting in June 2012. The project website [120] is being updated with a new graphic design; further initial findings [138] will be forthcoming. Please see the website for details on the volumes samples, error models, and other grant activities.
Staff at the University of Michigan made minor changes to the Data API and the Data API’s key service and Web client to better manage user privileges. A Data API security monitoring and reporting script was also deployed that runs on a daily basis.
Michigan staff undertook work to improve indexing and searching of CJK languages (a discussion of the issues is available on the large-scale search blog [202]). All 10+ million volumes are being re-indexed using the new CJKBigramFilter [203]available in Solr 3.6, and a custom filter that will create a separate unigram index of Han characters (to support queries consisting of a single Han character). Staff revised the Solr indexing schema to eliminate unused fields and filters and to take advantage of upgraded Solr 3.6 filters. Staff also made changes in development to the full-text search and “search within a book” Web applications in preparation for the improved CJK indexing. Testing and production release of the application enhancements and newly-created index are anticipated in early June.
Staff at Michigan downloaded and began to index the INEX Book Track [182] “Prove It” task corpus to use as a testbed to investigate various relevance ranking issues in HathiTrust full-text search.
Staff at California Digital Library (CDL) completed development of fast lookup data structures in the language-sensitive dictionary that will support a spelling-suggestion feature in full-text search (last reported on in the Update on February Activities [204]). Staff used probabilistic techniques to fit the massive dictionary into RAM, allowing very fast lookup of bigram and unigram data. Staff also ported code from the CDL-developed XTF system that ranks spelling suggestions to the new structure, though the code is not yet fully functional. Next steps include modifying the ranking algorithm to take advantage of data from the language-sensitive dictionary, and evaluating and revising the algorithm to produce quality suggestions.
Staff at Michigan deployed fixes to the code that allows users to embed PageTurner views in Web pages using an iframe. Staff also added improved wording and an explanatory link to the PageTurner interface, recommended by the UX Advisory group in April, to clarify when full-PDF download of volumes in HathiTrust is or is not available.
Michigan staff completed the steps necessary to retire all storage that was scheduled for replacement in 2012. Staff had completed the installation of replacement and additional storage at the Michigan and Indiana sites in March.
HathiTrust’s VuFind-based bibliographic catalog was successfully moved from University of Michigan Library Web hosting infrastructure to HathiTrust’s Web hosting infrastructure. This completes a migration project that also involved HathiTrust’s Drupal-based informational website and will greatly simplify future Web development.
Full-text search in HathiTrust was unavailable on Wednesday, May 9 from 6:00-8:30am EDT due to a problem with an index server. Shibboleth authentication to HathiTrust was unavailable on Monday, May 21 from 9:23-9:28am EDT due to a problem with a helper service required by Shibboleth.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of June 1:
| May | Total | |
|---|---|---|
| Columbia University | 0 | 64,184 |
| Cornell University | 3,334 | 399,871 |
| Duke University | 0 | 4,523 |
| Harvard University | 146,064 | 199,739 |
| Indiana University | 26 | 187,664 |
| Library of Congress | 0 | 89,416 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,088 |
| Northwestern University | 0 | 7,203 |
| New York Public Library | 2 | 259,559 |
| Penn State University | 14 | 43,322 |
| Princeton University | 10 | 250,849 |
| Purdue University | 1 | 24,781 |
| University of California | 6,811 | 3,336,782 |
| The University of Chicago | 364 | 20,821 |
| University of Illinois | 5 | 96,151 |
| Universidad Complutense | 0 | 111,827 |
| University of Michigan | 4,379 | 4,539,368 |
| University of Minnesota | 4,320 | 99,470 |
| University of Wisconsin | 4,337 | 539,208 |
| University of Virginia | 0 | 48,922 |
| Utah State | 0 | 90 |
| Yale University | 0 | 23,678 |
| Total | 169,667 | 10,358,712 |
Public Domain (~28%)
| Total* | 153,218 | 3,033,255 |
* Includes volumes opened through copyright review and rights holder permissions
See http://www.hathitrust.org/papers for all papers, presentations, and reports.
You can follow HathiTrust on Twitter [198] or subscribe to receive the monthly update by email [3] (via Google Groups).
[Download PDF [209]]
Staff at the University of Michigan have developed functionality that allows users from partner institutions to be “automatically” logged into HathiTrust when following links from local institutional catalogs or other resources. Permanent links to HathiTrust volumes can now be wrapped with a single sign-on URL that automatically passes users through their own institution’s authentication service. Users who are not already authenticated are prompted to do so. Documentation of the new functionality is available at http://www.hathitrust.org/automatic_login [102]. Thanks to Johns Hopkins University for suggesting this enhancement.
Staff at Michigan continued to work on tools that content depositors can use to create and validate locally-created content packages prior to submission to HathiTrust. The tools will available to partner institutions in May.
HathiTrust began ingest of Google-digitized content from the University of Illinois in April, bringing in more than 80,000 volumes.
The Communications Working Group continued regular activities and development of a briefing for the new Board of Governors. New communication initiatives are awaiting the transition to the new Board.
The UX Advisory Group revisited issues related to the labeling of PDF download options in PageTurner. The group’s recommended changes aim to clarify when full PDF downloads are or are not available. The changes are under development and will be implemented in May.
The table below contains a summary of the issues received by the User Support Working Group in April.
| Issue Type | April | March |
| Content | 231 | 203 |
Quality | 222 | 193 |
Non-partner Digital Deposit | 1 | 0 |
Collections | 4 | 9 |
| Cataloging | 33 | 49 |
| Access and Use | 112 | 195 |
Copyright | 76 | 137 |
Permissions | 7 | 17 |
Takedown | 2 | 1 |
Print on Demand | 0 | 0 |
Inter-library loan | 0 | 2 |
Full-PDF or e-copy requests | 10 | 19 |
Datasets | 4 | 2 |
Data Availability and APIs | 1 | 2 |
Reuse of content | 1 | 6 |
| Web applications | 14 | 11 |
Functionality problems | 4 | 4 |
Problems with login specifically | 1 | 1 |
General Questions about login | 4 | 3 |
Partners setting up login | 3 | 3 |
Usability issues | 0 | 0 |
Feature requests | 0 | 0 |
| Partner Ingest | 0 | 5 |
| General | 129 | 101 |
Partnership | 5 | 7 |
Infrastructure | 0 | 0 |
Miscellaneous | 124 | 94 |
| Total | 519 | 559 |
*See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
California Digital Library created prototype exports of the metadata that will be used to populate HathiTrust’s tab-delimited inventory files (“hathifiles [4]”) and bibliographic catalog. Timing tests for these exports were also conducted. The CDL team continued to reconcile bibliographic records in Zephir with records in the current system at the University of Michigan to ensure all the data is accounted for, addressing record discrepancies and ingest errors as encountered. The team has also begun development of a process to sync rights information in Zephir (the new management system) with the HathiTrust rights database.
University of Michigan staff continued work on modifications to the HathiTrust PageTurner to display JATS XML. jPach’s Norm module (see descriptions of all jPach modules [210]) can now extract 15 common components of a journal article, plus embedded media, from a DOCX file and create valid JATS with references to associated media files. A specification for a Submission Information Package for jPach content is nearly complete and will be posted to the jPach website [192] soon. Work has begun on developing wireframes for the Dashboard module. A timeline for the project is available on the HathiTrust jPach project page [211].
The HTRC completed the agreements necessary with Google to receive Google-digitized public domain volumes from the HathiTrust repository and make them available for computational purposes. With the Google agreements and a Memo of Understanding with HathiTrust in place, the HTRC is actively working with staff at Michigan to bring in the complete set of more than 2.9 million public domain volumes in HathiTrust. Preparation for the transfer includes setup of disk storage and compute nodes at Indiana University (IU), which is being done in collaboration with IU Research Technologies. All computation on HathiTrust volumes will becarried out on HTRC machines; the HTRC itself will not make content available for download. Users interested in receiving texts should follow the directions at http://www.hathitrust.org/datasets [212].
HTRC was represented at the recent Committee on Institutional Cooperation Digital Humanities Summit in Nebraska. Many attendees were already aware that the HTRC was a digital scholarship initiative of HathiTrust; brochures were on hand to provide a deeper level of detail.
The HTRC has created Meandre workflow components (Meandre is part of the SEASR [213]infrastructure) that retrieve texts from the HTRC using the HTRC data API, spell-check the texts, correct OCR errors, and then perform topic modeling on the texts. The HTRC has demonstrated this functionality, creating topic models of all pages returned from the data API from single-word queries on a full-text index of volumes. For example, a search for “dickens” in the non-Google digitized public domain corpus returns more than 100 topics with associated keywords. The diagrams below show tag clouds of keywords for the topics “lady” and “men”.

Project staff completed whole-volume review of the first 1,000-volume sample (1,000 English language, pre-1923 volumes digitized by Google), and over 70% of the second 1,000-volume sample (1,000 English language, post-1923 volumes digitized by Google). Approximately 150 volumes (15%) from each of the two samples were coded by two reviewers for quality assurance. The project team decided to perform whole-volume review on the same volumes sampled earlier in the project for page-level review in order to allow for comparison and more in-depth analysis of the data, and yield a better understanding of error within the volumes.
As of the end of April, staff had completed page-level review of approximately 50% of the fourth 1,000-volume sample (1,000 non-Roman language volumes including Korean, Chinese, Japanese, Arabic, Hebrew and Cyrillic).
In the months to come, the focus of the project team will shift away from data collection to data analysis and reporting, and use-case studies research. More information about this research is forthcoming. In May, the team will focus on developing a sub-study to better identify and describe errors in illustrative content. The project website has been updated to report initial findings. See the links under “Quality Review” at http://hathitrust-quality.projects.si.umich.edu/results.htm [138].
University of Michigan staff deployed the security enhancements described in the Update on March 2012 Activities [214], and the Data API now supports the use of 0Auth [215]1.0-signed requests. As outlined in the March update, there will be a transition period, ending October 1, 2012, during which signed access to the Data API will be possible but not required. After October 1, all requests to the Data API will need to be properly signed with an access key provided by HathiTrust. HathiTrust has created a Web client [164] that employs a user’s login credentials as a proxy for these keys to facilitate non-programmatic uses. Complete documentation of the security enhancements, methods of obtaining keys, signing requests, and accessing the Web client is forthcoming.
Also effective October 1, the host “services.hathitrust.org” will no longer exist for the Data API. The new host will be “babel.hathitrust.org”, the same host as the PageTurner and other HathiTrust services. Calls to the Data API will therefore need to use URLs such as the following (note the additional “cgi” in the path):
http://babel.hathitrust.org/cgi/htd/meta/mdp.39015019203879
rather than
http://services.hathitrust.org/htd/meta/mdp.39015019203879
HathiTrust released the second phase of advanced full-text search functionality in April. Users can now combine up to four different fields connected by the “AND” or “OR” operators. Search parameters are retained when users click on the “Revise this advanced search” on the search results page. The advanced search interface also allows complex Boolean expressions in the query box, for example:
(dog OR cat) AND (food OR drink) [216]
If a user enters unbalanced parenthesis, quotes or operators, for example
dog OR OR cat [217]
the application strips out the operators and does a default Boolean AND search and provides a message informing the user.
Several bugs in advanced search were also fixed:
The HathiTrust PageTurner now displays a version for items in the repository (at the bottom of the left column when viewing an item). The version is the date the item was last updated. Items are updated when improvements such as higher quality or more complete scans have been made.
HathiTrust’s Drupal-based informational website was successfully moved from Michigan library web hosting infrastructure to the existing dedicated HathiTrust web hosting infrastructure. Work continued on the move of HathiTrust’s VuFind-based bibliographic catalog, which is expected to be completed in early May.
No outages were reported in April 2012.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of May 1:
| April | Total | |
| Columbia University | 1 | 64,184 |
| Cornell University | 4,181 | 396,537 |
| Duke University | 0 | 4,523 |
| Harvard University | 0 | 53,675 |
| Indiana University | 3 | 187,638 |
| Library of Congress | 0 | 89,416 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,088 |
| Northwestern University | 383 | 7,203 |
| New York Public Library | 20 | 259,557 |
| Penn State University | 28 | 43,308 |
| Princeton University | 50 | 250,839 |
| Purdue University | 799 | 24,780 |
| University of California | 202 | 3,329,971 |
| The University of Chicago | 7,251 | 20,457 |
| University of Illinois | 80,642 | 96,146 |
| Universidad Complutense | 4 | 111,827 |
| University of Michigan | 5,011 | 4,534,989 |
| University of Minnesota | 86 | 95,150 |
| University of Wisconsin | 1 | 534,871 |
| University of Virginia | 1 | 48,922 |
| Utah State | 0 | 90 |
| Yale University | 0 | 23,678 |
| Total | 98,663 | 10,189,045 |
Public Domain (~28%)
| Total* | 96,091 | 2,880,037 |
* Includes volumes opened through copyright review and rights holder permissions
Stacy Kowalczyk, [218]DLP [218]: The HathiTrust [218] Research Center: An Overview, April 4, 2012.
Jeremy York, Access Services in the Age of Mass Digitization [219]. IviesPlus Conference, University of Chicago, April 20, 2012.
Rebuild the Large Scale Search Solr/Lucene index with CJK (Chinese, Japanese, Korean) indexing improvements; to be completed in May or June
You can follow HathiTrust on Twitter [198] or subscribe to receive the monthly update by email [3] (via Google Groups).
[Download PDF [220]]
HathiTrust has announced the members of its new Board of Governors. The full announcement [85], as well as information about the elections process [186], are available on the HathiTrust website. The composition of the Board, which officially begins work April 16, is as follows:
Representatives appointed from the founding partner institutions:
Representatives elected at-large:
Serving 5-year terms from 2012-2016
Serving 4-year terms from 2012-2015
Serving 3-year terms from 2012-2014
The User Support Working Group is seeking nominations from partner institutions for up to 4 new members. Nominations should be sent to Jeremy York (jjyork@umich.edu [221]) and include the name, title, and a short description of current job duties. Additional information that might be relevant to participation in the group may be included as well. User Support members are on call at least one day per week and follow up on inquiries throughout the week, requiring between 2-4 hours of work. Staff that participate on the group will
The charge for the working group is available at http://www.hathitrust.org/wg_user-support_charge [190].
Effective May 1, support for legacy Data API URLs in the following form will be removed:
http://services.hathitrust.org/api/htd/pathinfo-arguments
After May 1, URLs should be submitted according to the current Data API schema [103] without the “api” path element:
http://services.hathitrust.org/htd/pathinfo-arguments
Over the next several months HathiTrust will be implemeting security enhancments to the Data API. The enhancements will require developers using the API to acquire an OAuth 1.0 [215] access key that identifies them, and a secret key that must be used to “sign” URLs to retrieve HathiTrust resources via the Data API. HathiTrust will also provide a Web client that employ’s a user’s login credentials as a proxy for these keys to facilitate non-programmatic uses. In March, staff at the University of Michigan integrated 2-legged [222] OAuth into the Data API and began to develop the Data API client. Once OAuth is released, there will be an approximately 6-month transition period, ending October 1, 2012, during which signed access to the Data API will be possible but not required. After October 1, all requests to the Data API will need to be properly signed with an access key retrieved from HathiTrust. Complete documentation of the security enhancements and methods of obtaining keys and accessing the Web client is forthcoming. OAuth is planned for release in April 2012.
University of Michigan staff are preparing tools that will allow partners to build complete ingest packages for materials they wish to deposit in HathiTrust. The tools will include functionality to remediate images and build METS files to HathiTrust specifications, and validate files prior to submission to HathiTrust. Several institutions have agreed to test the tools in the coming months. It is hoped that over time all partners and other entities that contribute content to HathiTrust will use the tools to create their submission packages, thereby distributing the effort needed to ingest materials produced from different sources.
The Collections Committee’s report on duplicate volumes [83] in HathiTrust is now available. As described in last month’s update, the report recommends that HathiTrust retain all duplicate copies ingested into the repository for the time being, with periodic reassessment. The Strategic Advisory Board has requested that the Committee make further recommendations about the criteria that should be applied in future assessments and identify the future costs and risks of retaining duplicates in the corpus. The Committee also hopes to finalize its recommendations concerning a process for responding to requests and offers within the next several months.
The UX Advisory Group conducted informal usability testing to evaluate the impact of changes proposed to the PageTurner interface to incorporate a volume version (date of last ingest). The group plans to discuss the results and make recommendations on the changes in April, with implementation to follow shortly thereafter.
The table below contains a summary of the issues received by the User Support Working Group in March.
| Issue Type | March | February |
| Content | 203 | 106 |
|
Quality |
193 | 97 |
|
Non-partner Digital Deposit |
0 | 3 |
|
Collections |
9 | 2 |
| Cataloging | 49 | 24 |
| Access and Use | 195 | 131 |
|
Copyright |
137 | 73 |
|
Permissions |
17 | 20 |
|
Takedown |
1 | 1 |
|
Print on Demand |
0 | 1 |
|
Inter-library loan |
2 | 0 |
|
Full-PDF or e-copy requests |
19 | 17 |
|
Datasets |
2 | 1 |
|
Data Availability and APIs |
2 | 0 |
|
Reuse of content |
6 | 0 |
| Web applications | 11 | 22 |
|
Functionality problems |
4 | 7 |
|
Problems with login specifically |
1 | 0 |
|
General Questions about login |
3 | 5 |
|
Partners setting up login |
3 | 3 |
|
Usability issues |
0 | 1 |
|
Feature requests |
0 | 0 |
| Partner Ingest | 5 | 5 |
| General | 101 | 152 |
|
Partnership |
7 | 11 |
|
Infrastructure |
0 | 2 |
|
Miscellaneous |
94 | 139 |
*See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
California Digital Library achieved a milestone in March, loading all bibliographic records submitted by HathiTrust contributing institutions into the Zephir production environment. The goals of this dry run load were to test the functionality of the new metadata management system (Zephir), to test the production infrastructure, and to compare the production loading time with a previous load on a development server. The metadata management team continued to reconcile bibliographic records in Zephir with those in the current system at the University of Michigan to assure all data was accounted for, addressing record discrepancies and ingest errors as they were encountered. The team also began to verify that bibliographic record collation processes in Zephir resulted in the same records clustering as collation processes at Michigan.
Staff of the University of Michigan formally named the journal publishing platform Michigan will use in conjunction with HathiTrust: jPach. Design principles and requirements for jPach, plus a description of the platform’s modules, are posted on the University of Michigan Library website [192]. The project page [223] on the HathiTrust website now includes a full project timeline.
Michigan staff continued work to generate valid JATS XML from DOCX files, render JATS XML files in PageTurner, and create a METS profile for the jPach Submission Information Package.
The HathiTrust Research Center released a report of its activities [191] over the last 6 months. More information about the Research Center can be found on the HTRC web page [53].
Project staff continued whole-volume review of digital volumes in the first production sample (pre-1923 English-language Google-digitized volumes), looking for errors such as missing, duplicate, and out-of-order pages, as well as generally “bad” pages, defined in relation to the severity scale established for page-level review. Staff also continued page-level review of the project’s 4th 1,000-volume sample, consisting of non-Roman language volumes. Physical review of Michigan volumes sampled in the second production run (post-1923 Google-digitized English-language volumes) continued in March. Students have completed review of 543 of the 600 Michigan volumes present in the 1,000-volume sample. Further information about the grant project is available from the project website [120].
Staff at the University of Michigan completed work on the next iteration of advanced full-text search, which will allow users to build queries with greater Boolean complexity and enhance the ability to revise advanced searches. The new features will be released in early April. Staff made significant progress on plans to improve search results relevance ranking.
Michigan staff installed new storage at the Indiana and Michigan sites that will both accommodate 2012 volume projections and replace storage scheduled for retirement. Storage due for retirement will be taken offline starting in April.
Developers and system administrators at Michigan began preparations to move HathiTrust’s Drupal-based informational website and VuFind-based catalog from their initial hosting environments, currently on Michigan library infrastructure, to dedicated HathiTrust hardware, where they will run alongside other HathiTrust applications. This move will simplify application integration.
No outages were reported in March 2012.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of April 1:
| March | Total | |
| Columbia University | 6 | 64,183 |
| Cornell University | 896 | 392,356 |
| Duke University | 1 | 4,523 |
| Harvard University | 1 | 53,675 |
| Indiana University | 480 | 187,635 |
| Library of Congress | 5 | 89,416 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 1 | 8,088 |
| Northwestern University | 554 | 6,820 |
| New York Public Library | 31 | 259,537 |
| Penn State University | 18 | 43,280 |
| Princeton University | 171 | 250,789 |
| Purdue University | 41 | 23,981 |
| University of California | 758 | 3,329,769 |
| The University of Chicago | 309 | 13,206 |
| University of Illinois | 1,001 | 15,504 |
| Universidad Complutense | 3,083 | 111,823 |
| University of Michigan | 4,124 | 4,529,978 |
| University of Minnesota | 2,696 | 95,064 |
| University of Wisconsin | 1,297 | 534,870 |
| University of Virginia | 0 | 48,921 |
| Utah State | 0 | 90 |
| Yale University | 0 | 23,678 |
| Total | 15,473 | 10,090,382 |
Public Domain (~28%)
| Total* | 5,458 |
2,783,946**
|
* Includes volumes opened through copyright review and rights holder permissions
** Corrected 5/11/2012. Previous number included 1,389 images from the Minnesota Digital Library
Jeremy York, "HathiTrust: Aspiring to Build the Universal Library [224]". UKSG Annual Conference, March 26, 2012.
Jeremy York, "HathiTrust and the Research Library of the Future [225]". American Antiquarian Society Conference on Needs and Opportunities, March 31, 2012.
You can follow HathiTrust on Twitter. [226]
Updates are provided in relation to the milestones listed at HathiTrust Research Center Timeline and Deliverables [227]
The HathiTrust Research Center (HTRC) had a productive 6 months as it works out core issues in Phase I of its development effort. Milestone wise, we are looking forward to and planning for a public demonstration of functionality that is tentatively scheduled for June 2012 as is in accordance with the MOU between HathiTrust and HTRC. Phase II in which HTRC Is operational is scheduled to begin date 01 Jan 2013.
In a striking accomplishment, HTRC is delighted to report that three legal agreements guiding the Center have been completed at the University level. The MOU between Hathi Trust and the HathiTrust Research Center has gotten signatures at IU and UIUC and is with University of Michigan. The MOU between IU and UIUC has been fully executed. With the Google Agreement, UIUC and IU have each entered into an agreement with Google separately but the same terms. The agreements have been signed at the University level and are with Google.
The philosophy behind the technical infrastructure is to use existing services and cyberinfrastructure as much as possible to reduce development and maintenance costs. That philosophy manifests itself in a recent evaluation of tools like Blacklight for instance.
The HTRC cyberinfrastructure is up and running on 4 4-core virtual machines hosted at IU. We are working out access to more disk space in anticipation of the next steps with execution of the Google agreement. We have a sandbox set up at UIUC to permit broader internal testing. The sandbox consists of a Cassandra noSQL server v1.0 (for volume store), a Solr index, and v0.1 of the HTRC Data API. The volumes that are available in the sandboxes are 68,724 volumes of non-Google scanned content.
Data API: The HathiTrust Research Center released a beta version 0.1 of the HTRC Data API. The API is a RESTful API through which the HTRC Solr index and volume store are accessed. It cannot be used to download volumes, but can be used to move data to a location where computation takes place. It can also be used to search the Solr index for a set of volume IDs and pass the volume IDs to a service for access and computation. Access to the API will require OpenID authentication and appropriate authorization. The Data API is installed on two sandbox machines, one at UIUC and another at IU, for internal testing. Both sandbox installations work against a small subset of non-Google scanned volumes.
Sandbox: HTRC has set up a sandbox at UIUC that consists a volume store Cassandra repository, corresponding Solr index, and a collection of 68,724 volumes of non-Google scanned content. It supports the Data API v0.1 but without security enabled to encourage testing and exploration. Data Management components:
Non-Consumptive Research: HTRC received funding from the Alfred P. Sloan Foundation for development of secure infrastructure on which to carry out execution of large-scale parallel tasks on copyrighted data using public compute resources such as FutureGrid or resources at NCSA. The high level design uses a pool of VM images that run in a secure-capsule mode and are deployed onto compute resources. The team is working on a proof of concept deployment process onto an OpenStack platform using Sigiri [229].
Blacklight proof of concept. Searching for data is an important function for digital humanists – finding all of the works with a specific set of concepts, a certain genre, by one or more known authors, and other such criteria is generally the first step in the research process. Rather than developing a new search interface, a time consuming activity, the HTRC technical team determined that Blacklight [30], an open source library catalog search and retrieval system, would be a reasonable choice for the HTRC; Blacklight is designed to support data that is both full text and bibliographic, the exact type of data that the HTRC has; it is built on sorl, the same technology that we already use to index the HTRC data; and Blacklight supports faceted searches, a known need of researchers. We have a test implementation deployed on a shared server at UIUC that we were able to deploy with very few problems. In the next quarter, we expect to use the customization options to configure the look and feel of the interface and perhaps to extend the functionality to show snippets of the text to help researchers refine their results. Any new functionality that we develop will be shared with the larger Blacklight community. We expect Blacklight to be a significant component of the public face of the HTRC.
OCR error detection study: Members of the HTRC undertook a study recently on quantifying OCR errors in the HathiTrust corpus. Scholars are interested in doing quality text analysis, but results can be confounded by OCR errors. Information on which books (or pages) in the collection have significant rates of OCR errors could help. The HTRC explored a couple of approaches to OCR error detection and have results for one approach that uses machine-generated and expert-evaluated rules. Starting with a large dictionary of correctly spelled words, HTRC members identified outlier words that were in the HathiTrust corpus but not in the dictionary.
As a check on identified words, the rules by which outliers were detected were verified by a human expert. Using this approach, HTRC formulated 48,308 rules that identified outlier words and provided corrections. HTRC members applied the rules to 256,000 non-Google digitized volumes from HathiTrust, which took 4 hours using the National Center for Supercomputing Applications (NCSA) Ember supercomputer. The results showed that the probability of a word having an OCR error (detected by the rule set) was 0.20%. The average number of errors per page was 0.57. The average number of errors per volume was 156. The probability that a page had one or more errors on it was 11%. The probability that any volume had one or more errors was 84.9%. Overall, 217,754 of the 256,416 volumes had one or more OCR errors and 7,745,034 of the 69,297,000 pages had one or more errors.
Talk “Digital Humanities At Scale: Hathi Trust Research Center”, by Beth Plale at UMaryland February 29, 2012 hosted by Maryland Institute for Technology in the Humanities and the University of Maryland Libraries.
Google Digital Humanities Awards Recipient Interview Report. John Unsworth commissioned a study of award recipients of the Google Digital Humanities Award over the period 2010 – 2011. The study, Google Digital Humanities Awards Recipient Interview Report, interviewed recipients of the Google awards to determine what difficulties the recipients encountered when working with the Google corpus. A recurring theme was weak metadata and poor OCR.
OCR Summit, Oct 17-18, 2011, Texas A&M University. Loretta Auvil of UIUC attended the OCR Summit whose purpose was to bring together experts to work on the problem of Optical Character Recognition for early modern texts, when printing techniques make it difficult for machines to type text by “reading” page images. HTRC proposed involvement would be to provide post processing capabilities based on work we have done in the SEASR Services project to correct the Google Ngrams data as well as a corpus of 18th and 19th century novels. http://idhmc.tamu.edu/ocr-summit-meeting/
Digging Into Data: The HTRC is working with one of the recently awarded Digging into Data projects. The Principal Investigators of the “Digging by Debating” project, have approached the HTRC for help in extending the infrastructure and developing functionality that would benefit humanities researchers in general and the Digging by Debating project in particular. Colin Allen and Katy Börner, Indiana University, Bloomington, NEH; Andrew Ravenscroft, University of East London, Chris Reed, University of Dundee, and David Bourget, University of London, AHRC/ESRC/JISC are interested in creating interfaces and systems that harvest concepts that cross domains; for example, they want to investigate the process by which concepts in one domain, such as philosophy, are used in another domain, such as physics.
HTRC Web Site: A new HTRC web presence was launched early December 2011 as part of the HathiTrust website and can be seen ahttp://www.hathitrust.org/htrc. In addition to the direct access provided via the URL, the HTRC webpages are also available from the HT’s site navigation; under the About option, the HTRC is referred to as “Our Research Center.” The HTRC website has entirely new content that covers the governance of the HTRC, technical architecture and organization, policies for access and use, project information with timelines and deliverables, information about collaboration opportunities, and demonstrations of future functionality.
Community Engagement through “Challenges”. HTRC is creating a set of ongoing "challenges" as a mechanism by which community interest and engagement in the research opportunities represented by the HTRC can be piqued and cultivated. These challenges are inspired by such successful challenges in other domains as, TREC (Text Retrieval Conference), Netflix, and the Music Information Retrieval Evaluation eXchange (MIREX). Downie is the founder of MIREX (2004), so he has had considerable experience in organizing and running evaluation challenges. As it stands now, four challenges are being sketched out as possible candidates:
The first two challenges are inspired by the problems identified in the Google awardee interview report discussed in Section 2. We believe the creation of challenges around these two topics will pay off in two important ways. It will build active community engagement with the HTRC collections and tools; and, it can result in the creation of useable OCR/Metadata tools that can be used to increase the usability of the HTRC collections for future researchers.
Change in leadership. With the departure of John Unsworth to Brandeis University, J. Stephen Downie has stepped into the role as co-director of HTRC representing UIUC. Stephen is Professor and Associate Dean for Research, University of Illinois Graduate School of Library and Information Science. Stephen will join the other members of the HathiTrust Research Center Executive Management Team including Beth Plale (Codirector and chair); Marshall Scott Poole; Robert McDonald; and John Unsworth now Vice Provost for Library and Technology Services and Chief Information Officer, Brandeis University.
[Download PDF [230]]
The elections for the HathiTrust Board of Governors are taking place the first two weeks of March. There are twelve candidates in the running, and six will be elected. Each institution or consortia is able to vote for up to six individuals, and each vote will be assigned the full voting weight of the partner. The 12 candidates are as follows:
Staff at Michigan added 5 new fields to HathiTrust’s tab-delimited inventory files [4]: publication date, publication location, language, bibliographic format, and an indication of whether or not a volume has been identified as a U.S. federal government document. A description of the new fields, which are included in the inventory files as of March 1, is available at http://www.hathitrust.org/hathifiles_description [194].
Effective May 1, support for legacy Data API URLs in the following form will be removed:
http://services.hathitrust.org/api/htd/pathinfo-arguments
After May 1, URLs should be submitted according to the current Data API schema [103] without the “api” path element:
http://services.hathitrust.org/htd/pathinfo-arguments
Michigan staff resumed work on Data API security enhancements, which was postponed in August as staff prepared systems to support access capabilities for users who have print disabilities and access to orphan works. The specification for the new Data API security is available at http://bit.ly/jozHQK.
HathiTrust is now posting reports of public domain and open access volumes in HathiTrust that are available for print on demand. The reports can be found at http://www.hathitrust.org/pod_reports [196] and will be released on the first of every month beginning in April.
HathiTrust began working with the University of Utah and continued conversations with Northwestern University on ingest of locally-digitized volumes. Staff at Michigan completed ingest of a second set of open access volumes from the Utah State University Press.
Michigan Staff received bibliographic metadata for approximately 180,000 volumes from the University of Illinois. These volumes were part of growth projections that are made yearly by partners, on which annual storage purchases are made. Ingest of the Illinois volumes will begin after the 2012 additional storage is in place, likely in April.
The Executive Committee and SAB have approved the Collection Committee’s recommendations for the treatment of duplicates in HathiTrust; the final report will be posted online shortly. The report discusses various categories of duplicates that exist in the repository and attempts to assess their scope and cost, while also noting some of the difficulties in the precise identification of duplicates. The report recommends that HathiTrust retain all duplicate copies ingested into the repository for the time being, with periodic reassessment. Some categories of duplicates are recommended for permanent retention (e.g. early published books). The SAB has requested that the Committee make further recommendations about the criteria that should be applied in future assessments and identify the future costs and risks of retaining duplicates in the corpus.
The group provided feedback on several interface- or usability-related projects: the addition of a volume version (date of last ingest) in the PageTurner interface, a potential change to the default view in PageTurner, and the best way to encourage the creation of high-quality public collections. Work on all three of these projects will continue in March.
The User Support Working Group decided to postpone submission of its report on recommendations to the Executive Committee until later in the year. This will give more time for changes that have or might be implemented as a result of the group’s recent evaluation process to be assessed and incorporated into more formal recommendations.
The table below contains a summary of the issues received by the User Support Working Group in February.
| Issue Type | February | January |
| Content | 106 | 144 |
Quality | 97 | 117 |
Non-partner Digital Deposit | 3 | 0 |
Collections | 2 | 10 |
| Cataloging | 24 | 38 |
| Access and Use | 131 | 79 |
Copyright | 73 | 33 |
Permissions | 20 | 20 |
Takedown | 1 | 0 |
Print on Demand | 1 | 0 |
Inter-library loan | 0 | 0 |
Full-PDF or e-copy requests | 17 | 15 |
Datasets | 1 | 0 |
Data Availability and APIs | 0 | 0 |
Reuse of content | 0 | 1 |
| Web applications | 22 | 24 |
Functionality problems | 7 | 7 |
Problems with login specifically | 0 | 1 |
General Questions about login | 5 | 3 |
Partners setting up login | 3 | 1 |
Usability issues | 1 | 5 |
Feature requests | 0 | 4 |
| Partner Ingest | 5 | 4 |
| General | 152 | 127 |
Partnership | 11 | 7 |
Infrastructure | 2 | 1 |
Miscellaneous | 139 | 119 |
*See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
California Digital Library has nearly completed testing of HathiTrust records received from the University of Michigan in Zephir, the new management system, and is making significant progress on reconciling records ingested in both systems. Staff from Michigan and CDL finalized the minimum record submission standard to be used prospectively for records submitted by partners to HathiTrust. The standard will be integrated into the HathiTrust ingest checklist. CDL and Michigan also worked to address issues related to integration planning and reconciliation of records in Michigan’s system and in Zephir. CDL performed a dry run load test on ingest of records into Zephir.
Staff at Michigan completed the first iteration of a tool that is able to create valid JATS XML from simple DOCX files, and continued development on PageTurner to render JATS XML. Staff clarified the goals of the project to include implementation of a publishing system (allowing management of an editorial workflow) in addition to mechanisms for ingest, display, and discoverability of born-digital journal materials in the HathiTrust repository. More information is available at http://www.hathitrust.org/htpub. [223]
The HathiTrust Research Center released a beta version 0.1 of the HTRC Data API. The API is a RESTful API through which the HTRC Solr index and volume store are accessed. It cannot be used to download volumes, but can be used to move data to a location where computation takes place. It can also be used to search the Solr index for a set of volume IDs and pass the volume IDs to a service for access and computation. Access to the API will require OpenID authentication and appropriate authorization. The Data API is installed on two sandbox machines, one at UIUC and another at IU, for internal testing. Both sandbox installations work against a small subset of non-Google scanned volumes.
The HTRC technical team prototyped Blacklight (http://projectblacklight.org [30]), an open source library catalog search and retrieval system, for deployment in the HTRC. Blacklight is designed to support data that is both full text and bibliographic, it is built on Solr, the same technology used to index HTRC data, and Blacklight supports faceted searches, a known need of researchers. The test implementation of Blacklight was deployed on a shared server at UIUC. In the next quarter, the HTRC expects to use the customization options to configure the look and feel of the interface and perhaps extend the functionality to show snippets of the text to help researchers refine their results. Any new functionality that is developed will be shared back with the larger Blacklight community. Blacklight is expected to be a significant component of the public face of the HTRC.
Members of the HTRC performed a study recently on quantifying OCR errors in the HathiTrust corpus. Scholars are interested in doing quality text analysis, but results can be confounded by OCR errors. Information on which books (or pages) in the collection have significant rates of OCR errors could help. The HTRC explored a couple of approaches to OCR error detection and have results for one approach that uses machine-generated and expert-evaluated rules. Starting with a large dictionary of correctly spelled words, HTRC members identified outlier words that were in the HathiTrust corpus but not in the dictionary. As a check on identified words, the rules by which outliers were detected were verified by a human expert. Using this approach, HTRC formulated 48,308 rules that identified outlier words and provided corrections. HTRC members applied the rules to 256,000 non-Google digitized volumes from HathiTrust, which took 4 hours using the National Center for Supercomputing Applications’ Ember supercomputer. The results showed that the probability of a word having an OCR error (detected by the rule set) was 0.20%. The average number of errors per page was 0.57. The average number of errors per volume was 156. The probability that a page had one or more errors on it was 11%. The probability that any volume had one or more errors was 84.9%. Overall, 217,754 of the 256,416 volumes had one or more OCR errors and 7,745,034 of the 69,297,000 pages had one or more errors.
Project staff completed page-level review of a third production sample, consisting of 1,000 volumes digitized by the Internet Archive. More than 85,000 pages were reviewed in all. Approximately 9,400 of these (about 10%) were coded by two reviewers for quality assurance purposes. The focus of project work shifted then to finalizing training materials and data collection systems and procedures for whole-volume error review (review for errors that apply to an entire volume, such as missing, out-of-order, or duplicate pages). Project staff reviewed approximately 300 test volumes in a new whole-volume review interface to surface issues in using the interface and applying the new error model, and to develop an initial training manual. Whole-volume review began in mid-February on the same volumes reviewed in the first page-level production sample (1,000 English language, public domain, Google-digitized volumes).
Although the primary focus of work shifted to whole-volume review, physical review of the volumes sampled in the first production run continued in February. 870 of the 1,000 volumes in the sample were obtained and reviewed by volunteer graduate students at the University of Michigan. Students also began physical review of 600 Michigan volumes included in the second page-level sample (1,000 English language, Google-digitized volumes published post-1923). More than 400 of these volumes were reviewed by month’s end.
Michigan staff made a number of adjustments to the PageTurner application. These included fixing a bug in the RDFa emitted in the PageTurner bibliographic metadata that had prevented license information from being included appropriately; enhancing the access control mechanism for items that are public domain in the United States to better detect whether a user is on U.S. soil when access is proxied; updating the back-end process by which user feedback is submitted from HathiTrust applications (including PageTurner, the HathiTrust bibliographic catalog, Full-text Search, Collection Builder) to the central HathiTrust ticketing system; and implementing a process to detect cases where multiple tickets are submitted on identical HathiTrust items or records.
Michigan staff began work on the next iteration of advanced full-text search, which will allow users to build queries with greater Boolean complexity and enhance the ability to revise advanced searches. Staff made progress as well on plans to improve search results relevance ranking. This work is planned to begin after the next release of advanced full-text search.
California Digital Library staff completed dictionary-building work for the spelling suggester feature. The code can now build a language-sensitive dictionary of unigrams and bigrams from any Lucene index, automatically choosing a frequency cut-off to constrain the size of the dictionary. Focus will now shift to implementing fast-lookup and suggestion ranking.
Page viewing of volumes classified as "Public Domain in the United States" was unavailable on Tue 2-7 from approximately 5:30-9:45pm EST due to a software problem.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of March 1:
| February | Total | |
| Columbia University | 1 | 64,177 |
| Cornell University | 6,855 | 391,460 |
| Duke University | 0 | 4,522 |
| Harvard University | 233 | 53,674 |
| Indiana University | 205 | 187,155 |
| Library of Congress | 0 | 89,411 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,087 |
| Northwestern University | 210 | 6,266 |
| New York Public Library | 40 | 259,506 |
| Penn State University | 316 | 43,262 |
| Princeton University | 939 | 250,618 |
| Purdue University | 23,053 | 23,940 |
| University of California | 36,848 | 3,329,011 |
| The University of Chicago | 1,198 | 12,897 |
| University of Illinois | 0 | 14,503 |
| Universidad Complutense | 57 | 108,740 |
| University of Michigan | 13,190 | 4,525,854 |
| University of Minnesota | 1,787 | 92,368 |
| University of Wisconsin | 4,995 | 533,573 |
| University of Virginia | 1,525 | 48,921 |
| Utah State | 44 | 90 |
| Yale University | 4 | 23,678 |
| Total | 91,500 | 10,074,909* |
*Volume count does not include archival and image materials in the Minnesota Digital Library project
Public Domain (~28%)
| Total* | 59,574 | 2,791,223 |
*Includes volumes opened through copyright review and rights holder permissions
Tom Burton-West, "HathiTrust Large Scale Search: Scalability meets Usability [231]". Code4Lib, Feburary 7, 2012.
Jeremy York, "HathiTrust: Issues and Challenges in Preserving the Published Record [232]". Amigos Online Conference, February 8, 2012.
[Download PDF [233]]
We are very pleased to welcome Washington University to the partnership. The full press release is available from the Washington University website [234].
The process of electing and appointing members to the new HathiTrust Board of Governors is proceeding on schedule. According to the Governance ballot proposal [235] accepted by partners at the Constitutional Convention, 6 members of the Board will be appointed by the founding partner institutions and 6 will be elected by the partnership. The full process for the elections, [186] including schedule, as well as the Board of Governors charge, [82] are available on the HathiTrust website. As reported in the January Executive Committee meeting minutes, [236] members appointed to the new Board by the founding institutions include:
University of Michigan staff completed and released the first phase of advanced search functionality for full-text search. New features support a variety of operations for searching bibliographic metadata in combination with full-text. Results can be limited to specific publication years, languages, and original formats. The next iteration of work will begin in February and introduce options for building queries with greater Boolean complexity.
California Digital Library staff continued work on the spelling suggester feature, focusing on automatically building a dictionary (including unigrams with language information and frequencies, and bigrams with frequencies) from a test index of public domain materials.
The changes HathiTrust intended to make to the tab-delimited files [4] (“hathifiles”) beginning February 1 resulted in some unexpected problems, which staff at Michigan are in the process of resolving. We currently plan to roll back the changes so that the files are in their pre-February state and pursue a March 1 date to add a total of 5 new fields to the files. Notification of 3 new fields was included in the Update on December Activities [237]. Two additional fields will be added, so that the tab-delimited files will include new fields for publication date, publication location, language, bibliographic format, and whether or not a volume has been identified as a U.S. federal government document. Updates on the status of the files will be send via HathiTrust’s account [6] on Twitter, and posted on the tab-delimited files download page [4].
HathiTrust released a Year in Review [238] of its 2011 activities, highlighting achievements in its repository services, partnership, and position in the library community.
HathiTrust discussed deposit of an additional set of locally-digitized volumes with Yale University, and worked with Columbia University on packaging locally-digitized materials to HathiTrust specifications. Penn State University began preparations to deposit Internet Archive-digitized content into HathiTrust, and Getty Research Institute continued discussions with HathiTrust regarding bibliographic data for its Internet Archive-digitized materials.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The Collections Committee made good progress on a process for responding to requests and offers to include additional materials in HathiTrust, among other pending items on its work agenda.
The Communications Working group announced HathiTrust’s major milestone of reaching 10 million volumes in January, and continued its work to develop a public services informational package. The group also engaged in looking for opportunities to highlight HathiTrust within the media and conference landscape.
The User Experience Advisory Group discussed user interface issues related to possible changes to the Pageturner default view, and potential interface improvements to the list of user-created collections.
In addition to regular activity responding to user inquiries, the User Support Working Group has spent the last several months evaluating its processes, workflows, and performance since it began in March 2011. This was done to prepare recommendations on a future structure and processes for responding to user feedback, which is part of the group's charge. [190] A number of ideas to improve efficiency in responding to inquiries and communicating within the group surfaced and have been implemented. The group completed a draft report on recommendations that it expects to submit to the Executive Committee in February.
The table below contains a summary of the issues received by the User Support Working Group in January.
| Issue Type | January | December |
| Content | 144 | 81 |
|
Quality |
117 | 71 |
|
Non-partner Digital Deposit |
0 | 2 |
|
Collections |
10 | 6 |
| Cataloging | 38 | 30 |
| Access and Use | 79 | 107 |
|
Copyright |
33 |
59 |
|
Permissions |
20 | 4 |
|
Takedown |
0 | 2 |
|
Print on Demand |
2 | 1 |
|
Inter-library loan |
0 | 2 |
|
Full-PDF or e-copy requests |
15 | 28 |
|
Datasets |
0 | 2 |
|
Data Availability and APIs |
0 | 0 |
|
Reuse of content |
1 | 1 |
| Web applications | 24 | 18 |
|
Functionality problems |
7 | 9 |
|
Problems with login specifically |
1 | 1 |
|
General Questions about login |
3 | 0 |
|
Partners setting up login |
1 | 1 |
|
Usability issues |
5 | 2 |
|
Feature requests |
4 | 1 |
| Partner Ingest | 4 | 5 |
| General | 127 | 50 |
|
Partnership |
7 | 7 |
|
Infrastructure |
1 | 0 |
|
Miscellaneous |
119 | 43 |
*See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
The California Digital Library team continued to load and test records in Zephir, the new management system. The team finished a proposal for a minimum record submission standard, and completed work on a refined migration timeline -- both to be reviewed by University of Michigan in early February. CDL also performed a successful test to sync data from the HathiTrust rights database with records in Zephir.
MPublishing staff at the University of Michigan Library created a timeline for work through early 2013. Work continued on a process to convert styled Word documents into JATS XML, focusing on extraction of metadata, and on adaptation of the HathiTrust PageTurner application to display JATS XML.
The primary focus of project staff in January was to complete page-level review of volumes in the third production run, performed on a sample of 1,000 Internet Archive-digitized volumes published pre-1923. As of January 31st, review of more than 97% (over 97,000 digital pages) of the volumes was complete. This included double-review of 10% of the volumes as a check on inter-coder reliability.
Physical review of the volumes sampled in the first production run continued in January. By the end of the month, volunteers from the University of Michigan School of Information had reviewed 848 of the 1,000 volumes.
Project staff at the University of Michigan began testing a beta version of the newly developed quality review interface, targeted specifically for review of volume-level errors such as missing, duplicate, and out-of-order pages. A test sample of known problematic volumes was developed to test the strength of the error model and application. Official data coding of whole-volume errors is expected to begin by the end of February. Please visit the project website [120] for updates.
HathiTrust implemented processes to track accesses to in-copyright works, in cases where access is permitted. The new processes will provide a means for HathiTrust to detect problematic activity such as bulk downloading operations, which may, for example, indicate a compromised user account.
Michigan staff transitioned two new web servers at the Michigan repository instance into service, replacing two older ones. During the same cutover, all Web service was moved to new Web load balancers which, as compared to the previous load balancing mechanism, provide a better distribution of traffic across all servers at both sites, as well as a faster response when individual servers or sites fail. Michigan staff routinely use these load-balancing systems to mask maintenance or upgrade processes that require individual servers or an entire site to be taken offline.
University of Michigan staff received final 2012 volume projections from partners and requested a price quote from Isilon for the purchase of new storage capacity and the annual storage hardware replacement cycle, which since last year have been combined into a single large acquisition. The new capacity is expected to be online in the first quarter of 2012.
The HathiTrust web site, including the bibliographic catalog and full-text search (but excluding page viewing and persistent URL resolution), was down on Friday, January 27 from 8:30-9:00pm EST due to a Drupal software upgrade.
Full-text search web pages may have generated incorrectly from Friday, January 27 at 7:30pm to Saturday, January 28 at 3:10pm due to an accidental, premature release of modifications to the full-text search software related to internationalization support.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
Jeremy York, Panel Presentation [239]. Session 9. Large Digital Libraries: Beyond Google Books. Modern Language Association Annual Meeting.
Jeremy York, Panel Presentation (remarks only [240]). Session 129. What's Still Missing? What Now? What Next? Digital Archives in American Literature. Modern Language Association Annual Meeting.
John Wilkin, Digital Preservation: A Matter of Trust [241]. Session 444. Preservation Is (Not) Just Another Word for Nothing Left to Lose. Modern Language Association Annual Meeting.
Jeremy York “HathiTrust [242]: The Elephant in the Library [242]”. Library Issues Vol. 32 No. 3, January 2012.
Sarah Pritchard “HathiTrust [243] Libraries Map a Shared Path: A Turning Point in Information Access [243]”. Libraries and the Academy Vol. 12 No. 1, January 2012.
All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers [244].
As of February 1:
| January | Total | |
| Columbia University | 0 | 64,176 |
| Cornell University | 645 | 384,605 |
| Duke University | 0 | 4,522 |
| Harvard University | 1 | 53,441 |
| Indiana University | 38 | 186,950 |
| Library of Congress | 0 | 89,411 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,087 |
| Northwestern University | 407 | 6,056 |
| New York Public Library | 13 | 259,466 |
| Penn State University | 29 | 42,946 |
| Princeton University | 0 | 249,679 |
| Purdue University | 0 | 887 |
| University of California | 4,509 | 3,292,163 |
| The University of Chicago | 1,091 | 11,699 |
| University of Illinois | 0 | 14,503 |
| Universidad Complutense | 15 | 108,683 |
| University of Michigan | 8,503 | 4,512,664 |
| University of Minnesota | 342 | 90,581 |
| University of Wisconsin | 1,244 | 528,578 |
| University of Virginia | 0 | 47,396 |
| Utah State | 0 | 46 |
| Yale University | 0 | 23,674 |
| Total | 16,837 | 9,983,409* |
*Volume count does not include archival and image materials in the Minnesota Digital Library project
Public Domain (~27%)
| Total* | 18,803 | 2,731,429 |
*Includes volumes opened through copyright review and rights holder permissions
Continue to work with partners on ingest of locally-digitized materials
Continue working on improvements to advanced full-text search
Resume work on Data API security
[Download PDF [245]]
The close of 2011 marked 4 years since the first formal commitments were made to building HathiTrust, a broad collaborative of academic and research institutions that are working together to ensure the long-term preservation and accessibility of the cultural record.
2011 saw the solidification of the HathiTrust repository's position in the library community, as it received Trustworthy certification from the Center for Research Libraries. It also saw the solidification of HathiTrust services, as a new mobile interface was released, significant enhancements were made to the Full-text search, PageTurner, and Collection Builder applications, and a database of print holdings was incorporated into access systems, providing a mechanism to provide lawful access to in-copyright materials that are held by member institutions.
The HathiTrust partnership achieved a new level of cohesion and stability in 2011 as well, as the member institutions came together in a Constitutional Convention to make collective decisions about the structure and priorities of the initiative going forward. Agreements with a variety of entities (organizations, academic presses and vendors) to expand access to materials in HathiTrust and enhance their discovery further magnified the impact of the partnership’s work in the broader library community.
2011 offered partners the opportunity to reflect on the accomplishments of HathiTrust in its first years, and make collective plans to address the challenges libraries face in stewarding and provisioning the cultural record in years to come. We move into 2012 with optimism, based on what we have been able to achieve, in our ability to collaborate deeply and effectively to address these challenges, and maintain and even enhance the role that libraries play in the new, shared, digital future.
A summary of HathiTrust activities in 2011 is given below:
HathiTrust grew from 52 to 66 partners in 2011. The new institutions that formally announced partnership include:
HathiTrust partners contributed 2,129,874 volumes to the repository in 2011, for a total of 9,966,572. 753,403 of these (2,712,626, or 27% overall) are either in the public domain or volumes that rights holders have given HathiTrust permission to make publicly available. HathiTrust exceeded 10 million volumes in early January 2012 (see the blog post and timeline [84]of repository development).
Having completed a framework for ingesting volumes from varied sources at the end of 2010, in 2011 HathiTrust began to scale up ingest of locally-digitized content from partner institutions. Large-scale deposits continued as well. New institutions contributing content in 2011 included:
Large-scale digitization
Local or in-house digitization
Conversations regarding ingest of locally-digitized materials were initiated with
In October 2011, HathiTrust partners convened a Constitutional Convention to determine directions for the partnership following its first 5-year period, which will conclude at the end of 2012. HathiTrust’s Strategic Advisory Board released a review of the partnership’s activities and progress over its first 3 years prior to the Convention to set the stage for ballot initiatives and partner discussion. 7 ballot initiatives were considered by partners at the Convention. 5 of these were accepeted:
Information about the Constitutional Convention, including notes from the convention, ballot initiatives, attendees, and the 3-year review are available on the Constitutional Convention information page [187]. John Wilkin’s opening remarks and the presentation given by representatives of the Strategic Advisory Board are available on the Papers and Presentations [110] page. John Wilkin’s remarks were also posted on the HathiTrust blog [247].
With the ingest of image content from Minnesota, the establishment of a HathiTrust Research Center, progress to enable HathiTrust as a platform for digital publishing, certification by the Center for Research Libraries for compliance with TRAC, and the establishment of infrastructure to offer access to in-copyright works for users who have print disabilities (see further information on these below), HathiTrust has provided a meaningful deliverable for each of the initial objectives set by the founding partners (see HathiTrust Functional Objectives [2]).
HathiTrust signed agreements with ProQuest, OCLC and EBSCO to make the HathiTrust full-text index searchable through their discovery services.
HathiTrust began to make datasets of public domain materials available on a large scale. See HathiTrust Datasets [212] for more information.
[Download PDF [273]]
View statistics and a timeline on the HathiTrust blog [84].
On February 1, HathiTrust will be adding three additional columns to the tab-delimited inventory files (“hathifiles”) available at http://www.hathitrust.org/hathifiles [4]. The files are frequently used by partners and non-partners as a means to obtain full bibliographic records for HathiTrust items to load into local catalogs (see HathiTrust Data Availability and APIs [274]). The additional columns will identify the publication date and publication location of volumes in HathiTrust, as well as volumes that have been identified as U.S. federal government documents.
Staff at Michigan continued conversations with staff at the University of Florida regarding ingest of locally-digitized materials, and staff at several other institutions regarding ingest of Internet Archive-digitized materials.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The Communications Working Group continued to work on on a public services-oriented communications package, as well as announcements for new partners and the major milestone of 10 million volumes.
The User Experience Advisory Group began reviewing the current home page and discussed additions and issues that will need to be addressed in a forthcoming redesign. Group member Jenny Emanuel contributed a "Perspectives from HathiTrust" blog post [275]about the group's persona work that was completed in November.
The User Support Working Group is still seeking nominations for new members. See the Update on November Activities [276] for details.
The table below contains a summary of the issues received by the User Support Working Group in December.
| Issue Type | November | December |
| Content | 107 | 81 |
Quality | 102 | 71 |
Non-partner Digital Deposit | 0 | 2 |
Collections | 5 | 6 |
| Cataloging | 43 | 30 |
| Access and Use | 103 | 107 |
Copyright | 55 | 59 |
Permissions | 10 | 4 |
Takedown | 1 | 2 |
Print on Demand | 2 | 1 |
Inter-library loan | 0 | 2 |
Full-PDF or e-copy requests | 15 | 28 |
Datasets | 1 | 2 |
Data Availability and APIs | 2 | 0 |
Reuse of content | 1 | 1 |
| Web applications | 24 | 18 |
Functionality problems | 5 | 9 |
Problems with login specifically | 1 | 1 |
General Questions about login | 2 | 0 |
Partners setting up login | 3 | 1 |
Usability issues | 2 | 2 |
Feature requests | 5 | 1 |
| Partner Ingest | 3 | 5 |
| General | 47 | 50 |
Partnership | 6 | 7 |
Infrastructure | 0 | 0 |
Miscellaneous | 41 | 43 |
*See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
Team members from California Digital Library continued work on processes to compare bibliographic records in Zephir, the new metadata management system under development, with records in HathiTrust’s existing system. Zephir team members continued to load and test new records as well, and refine the timeline for migration of bibliographic metadata management services to Zephir in coordination with staff at the University of Michigan.
Staff at the University of Michigan revised the goal statement for HTPub (see the project web page [223]) and plans for system architecture. Staff also began work on establishing a project timeline.
Several changes were made to the HTRC leadership in December. John Unsworth, a key member of the Team at Illinois, accepted a position as vice provost for Library and Technology Services and chief information officer at Brandeis University. He will be leaving the University of Illinois but remain on the Executive Management Team. The Team will keep its base composition of 2 members from the University of Illinois and 2 from Indiana University, so this change will add one new member. Stephen Downie, Associate Dean for Research at the University of Illinois Graduate School of Library and Information Science, will fill the position left by John. Stephen’s research has focused on music information retrieval and data mining. This work has involved building significant infrastructure for research, including grappling with issues of allowing computational access to in-copyright material. Finally, Marshall Scott Poole is stepping aside as co-director of the HTRC for personal reasons, though he will remain on the Executive Management Team. Stephen Downie will take his place as co-director of the HTRC with Beth Plale, who is co-director on the Indiana University side. Beth also chairs the Executive Management Team. The changes are in effect as of January 1, 2012.
In December, project staff completed physical review of more than 90% of the volumes in the first 1,000 volume sample drawn from HathiTrust. Staff are working to arrange on-site review with cooperation from HathiTrust member libraries for the approximately 70 volumes that are not available via inter-library loan due to poor condition, non-circulating collection, or other reason.
Project staff concluded page-level data collection for the second production sample in December (see the Update on September 2011 Activities [277] for details on the composition of the sample). The full dataset will be sent to the project statistician in early January for analysis. Data collection for the third production run began in the late December. The third production run focuses on Internet Archive-digitized volumes published pre-1923.
Project staff continue to define requirements for a new quality review interface, targeted specifically for review of volume-level errors such as missing, duplicate, and out-of-order pages. Please visit the project website [120] for updates.
Michigan staff released a new version of the full-text search index in December. The new release corrected an error in the “Original Location” metadata facet and provided additional metadata for advanced search and relevance ranking. It also made it possible for full-text search results and facets to reflect whether or not users from partner institutions are able to view in-copyright items when lawful access is permitted (HathiTrust is currently pursuing providing access to in-copyright works to users who have print disabilities, for preservation uses, and in circumstances where works are copyright-orphaned). Access in these circumstances, which are still pending deployment to partners, is dependent on partner institutions owning or previously owning print copies of works in question and users’ location inside or outside the United States.
Michigan staff continued development on an advanced search feature for full-text search, including preliminary testing of the first working prototype in HathiTrust’s development environment.
California Digital Library continued work on a spelling suggestion feature for full-text search queries. A CDL developer established an account in the HathiTrust development environment and used a sample index of public domain materials to test strategies for automatically building a bigram dictionary of words with different spellings users might enter.
Tom Burton-West's proposed talk on "HathiTrust Large Scale Search: Scalability meets Usability", was accepted by popular vote for the 2012 Code4Lib Conference in Seattle, WA.
Staff at Michigan released a new throttling mechanism for HathiTrust, which allows throttling levels to be set at more granular levels. Users are now less likely to be throttled in the course of normal use as the new throttling policies are applied to specific scenarios such as viewing thumbnail or page images, or downloading PDFs, as opposed to all use generally. Throttling ensures compliance with third-party restrictions on bulk download of materials, and helps to ensure a consistent and reliable experience for all users.
In connection with HTPub, Michigan staff continued work to adapt the HathiTrust PageTurner to display XML content.
Michigan Library staff continue to work with central IT security analysts to complete the Risk Assessment that was started in November, and have received the final report of the vulnerability penetration test. The report revealed no vulnerabilities that enabled direct or indirect access to the repository, but noted software issues such as cross-site scripting vulnerability and also made recommendations for increased firewalling at the Michigan site. All software issues noted in the report were addressed in December. A broader firewalling project for the data center where the Michigan instance is hosted is already in progress but not yet complete, and so some provisional steps were taken to tighten security while that effort continues.
HathiTrust services were inaccessible or diminished for several periods in December due to problems related to the release of the new throttling system (all times EST): on Tue, Dec 13 4:25-4:30pm, Wed, Dec 14 11:10am-12:00pm, and Wed 12-21 7:30-10:30am, all page viewing was affected, and on Tue, Dec 13 3:45-5:00pm, full-book PDF download was affected. Additionally, page viewing of volumes classified as "Public Domain in the United States" in HathiTrust was intermittently unavailable on Wed 12-21 from approximately 1-4:30pm EST due to an apparent outage with an externally-hosted proxy detection system.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers [279].
As of November 1:
| December | Total | |
| Columbia University | 4 | 64,176 |
| Cornell University | 9,871 | 383,690 |
| Duke University | 21 | 4,522 |
| Harvard University | 434 | 53,440 |
| Indiana University | 324 | 186,912 |
| Library of Congress | 15,769 | 89,411 |
| North Carolina State University | 0 | 3,196 |
| University of North Carolina - Chapel Hill | 0 | 8,087 |
| Northwestern University | 237 | 5,649 |
| New York Public Library | 76 | 259,453 |
| Penn State University | 1,821 | 42,917 |
| Princeton University | 350 | 249,679 |
| Purdue University | 0 | 887 |
| University of California | 114,906 | 3,287,654 |
| The University of Chicago | 1,730 | 10,608 |
| University of Illinois | 0 | 14,503 |
| Universidad Complutense | 28 | 108,668 |
| University of Michigan | 22,907 | 4,504,601 |
| University of Minnesota | 916 | 90,239 |
| University of Wisconsin | 15,902 | 527,334 |
| University of Virginia | 12 | 47,396 |
| Utah State | 0 | 46 |
| Yale University | 0 | 23,674 |
| Total | 185,311 | 9,966,572 |
Public Domain (~27%)
| Total* | 50,434 | 2,712,626 |
You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust [198]
[Download PDF [280]]
HathiTrust is pleased to welcome Boston College as its newest member. The full announcement is available on the Boston College website [281].
A new “Our Research Center [53]” portion of HathiTrust.org was launched in early December, containing information about the governance, timeline and deliverables, architecture, and access and use policies for the HathiTrust Research Center (HTRC), which is jointly led by Indiana University and the University of Illinois. “Our Research Center” also includes information about research partnerships and a demonstration tool that allows users to create tag clouds and perform limited analysis on a small number of works. The HTRC welcomed two new members from the University of Illinois library to the HTRC technical team in November: Kirk Hess and Harriett Green. Kirk and Harriett bring experience in user interfaces and services, areas that complement the technical strengths of Indiana staff currently working on the HTRC.
A new Perspectives on HathiTrust blog post [258] authored by Suzanne Chapman, chair of the User Experience Advisory Group, was released in early December, highlighting HathiTrust’s new mobile interface.
The “Buy a copy” option has now been expanded to include over 30,000 public domain volumes from the University of California. UC will incrementally add new volumes to the service. UC has partnered with Hewlett-Packard to create the reprints and make them available for purchase via Amazon.com.
The User Support Working Group is very pleased to welcome a new member, Kathryn Stine from the California Digital Library. The working group is seeking nominations from partner institutions for up to 4 additional positions. Nominations should be sent to Jeremy York (jjyork@umich.edu [269]) and include the name, title, and a short description of current job duties. Additional information that might relevant to participation in the group may be included as well. User Support members are on call at least one day per week and follow up on inquiries throughout the week, requiring between 2-4 hours of work. Staff that participate on the group will
The charge for the working group is available at http://www.hathitrust.org/wg_user-support_charge [190].
HathiTrust is seeking a volunteer Lucene developer (from partner institutions or not) to work directly through the Lucene contribution process to improve indexing capabilities for Chinese-, Japanese-, and Korean-language (CJK) materials; more specifically, to add overlapping bigram functionality for CJK languages to the Lucene ICUTokenizer (view the Lucene JIRA ticke [203]t for this issue). A new HathiTrust large-scale search blog [202] post on word segmentation for CJK languages provides additional context. Please contact Tom Burton-West [282] for more information.
On February 1, HathiTrust will be adding additional columns to the tab-delimited inventory files [4] (“hathifiles”). A final description of the changes will be posted in the update on December activities. Proposed additions include the publication date and publication location of volumes, as well as an indication of whether volumes have been identified as U.S. federal government documents.
University of Michigan staff have updated the permissions agreement by which rights holder can open access to their works in HathiTrust. The agreement, which is now also available as a fillable PDF, is available at http://www.hathitrust.org/permissions_agreement [283], with instructions on completion and submission.
HathiTrust sent a call to partners in November for projections of volumes to be deposited in 2012. The projections will be used to estimate storage needs and fees for partners in the coming year. A variety of locally-digitized collections were identified for deposit, in addition to volumes digitized through Internet Archive and Google. More information on these and continuing work on ingest will be included in coming months.
HathiTrust has ingested nearly all of approximately 200 rare manuscripts and incunabula from the Universidad Complutense de Madrid. Issues with some of the submitted volumes that prevented ingest will be investigated further by Michigan staff.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The Communications Working Group continued to make progress on a public services-oriented communications package, highlighting ways HathiTrust can be used to address a variety of research and reference inquiries.
The User Experience Advisory Group finalized the user personas it began to develop over the summer. The personas and an accompanying overview of the project are available at http://www.hathitrust.org/personas [268]. The purpose of the personas is to help HathiTrust staff and partners (developers, policy makers, user experience designers and researchers, reference and instruction librarians, etc.) envision different types of HathiTrust users in a more concrete way in order to inform our work. The group welcomes any questions or comments about the personas to be sent to Suzanne Chapman (suzchap@umich.edu [170]), chair of the UX Advisory Group.
The table below contains a summary of the issues received by the User Support Working Group in November.
| Issue Type | October Issues | November Issues |
| Content | 154 | 107 |
Quality | 142 | 102 |
Non-partner Digital Deposit | 0 | 0 |
Collections | 1 | 5 |
| Cataloging | 44 | 43 |
| Access and Use | 136 | 103 |
Copyright | 75 | 55 |
Permissions | 4 | 10 |
Takedown | 4 | 1 |
Print on Demand | 2 | 2 |
Inter-library loan | 0 | 0 |
Full-PDF or e-copy requests | 23 | 15 |
Datasets | 1 | 1 |
Data Availability and APIs | 2 | 2 |
Reuse of content | 2 | 1 |
| Web applications | 29 | 24 |
Functionality problems | 6 | 5 |
Problems with login specifically | 3 | 1 |
General Questions about login | 4 | 2 |
Partners setting up login | 1 | 3 |
Usability issues | 2 | 2 |
Feature requests | 5 | 5 |
| Partner Ingest | 1 | 3 |
| General | 59 | 47 |
Partnership | 8 | 6 |
Infrastructure | 0 | 0 |
Miscellaneous | 51 | 41 |
*See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
The Collections Committee made several revisions to its draft recommendations for handling duplicates in HathiTrust and has submitted these to the Strategic Advisory Board for approval. Once the revisions have been approved by SAB and endorsed by the Executive Committee, the full document will be posted on the HathiTrust website. The committee is currently discussing mechanisms for responding to user-submitted requests and offers and is formulating plans to address other items on its work agenda. Suggestions for additional work items are always welcome and can be sent to the committee chair, Ivy Anderson
The California Digital Library team worked to develop an infrastructure to compare bibliographic records in Zephir, its core metadata management system, with records in HathiTrust. Some of the challenges are determining when a record has been updated in HathiTrust, and managing multiple (non-HathiTrust) identifiers for volumes. The Zephir team loaded and tested records, and refined the timeline for migrating records into Zephir as they work with staff at the University of Michigan. Further information about the project can be found at http://www.hathitrust.org/htmms [12].
Staff in MPublishing began work in November on a tool to convert DOCX files to JATS XML and worked with broader stakeholders at the University of Michigan Library to specify additional design requirements and agree on a set of design principles for HTPub (available on the HTPub project page [223]). MPublishing staff also reviewed notes from a session at THATCamp Publishing 2011 dedicated to shared infrastructure for publishing, to consider how such an infrastructure might affect the architecture of HTPub tools, and services that might be offered in the future using those tools.
Physical review of volumes in the first 1,000-volume sample continued in November, with volumes requested through inter-library loan continuing to arrive at Michigan. Plans are being made for staff at partner libraries to conduct physical review of volumes in cases where the volumes are not available for inter-library loan. There was an error in the previous update [284] with respect to the timing of results analysis for the quality review performed on the first sample of digital volumes. This will be available at a later time. Please visit the project website [120] for updates.
Quality review on the second sample of 1,000 volumes from HathiTrust was completed in mid-November. Measures to evaluate inter-coder consistency required re-review of some volumes in the sample, as well as individual pages within specific volumes. This review began in late November and should be complete in the first week of December. As review of the second sample of volumes was completed, project staff prepared to begin review of a third sample of 1,000 volumes, which will include pre-1923 English-language monographs digitized by the Internet Archive.
Project staff continued to define requirements for a new quality review interface, targeted specifically for review of volume-level errors such as missing, duplicate, and out-of-order pages. The project developer began coding basic elements of the system. Combining this new interface and procedures with those in the first interface, which was designed to review page-level errors, will lead to a system for comprehensive review that will enable certification of volumes at different quality levels. The project team is in the process of drafting specifications for certifying volumes. The final model will based on the findings from statistical sampling and manual review at the page and volume levels.
Work continued on the Orphan Works Project pilot phase, which will continue through the end of December. Reviewers from the University of Michigan and the University of California, Los Angeles have now researched the same set of approximately 50 works. Staff from both institutions are looking at the results and reviewing the process for accuracy. The pilot phase of the OWP is intended to serve as a test for an orphan works identification process, through which we will document examples and further define parameters for research.
Staff at Michigan began a re-indexing process in November for all 9.8 million volumes in HathiTrust. The purpose was to correct an error in the “Original Location” metadata facet, and to provide additional metadata for advanced search, relevance ranking, and to determine the viewability status of volumes (see below). The re-indexing was 98% complete at the end of November and is anticipated to go into production in early December. This re-indexing, and the discovery of a bug in the way Solr processes Boolean queries, slowed development of the advanced search feature that was planned for release in November. A workaround for the bug will be implemented until the bug is fixed. The advanced search feature is now planned for release in January.
As the indexing enhancements were put in place, Michigan staff completed the coding necessary for full-text search results to reflect whether or not a user is able to view items in situations that depend on institutional print holdings and other factors. This will apply to search results that include orphan works (when available), volumes that may be available under Section 108 of U.S. copyright law, and volumes that are accessible to users at partner institutions who have print disabilities. In order to see the availability of these volumes, and access them, users from partner institutions will need to be logged in using their institutional account.
Michigan developers continue to work with staff at the California Digital Library on the development of a spelling suggestion feature. CDL is testing various algorithms on sample HathiTrust data including the Solr/Lucene Levenshtein Automaton and Martin Reynaert’s anagram hashing algorithm. The work is focusing both on the speed and scalability of the algorithms and on the accuracy of the suggestions. Experimental code to extract useful bigrams from existing HathiTrust indexes is in the works, which will obviate the need to maintain multiple indexes to support spelling correction, as is currently the case.
HathiTrust has implemented new policies regarding access to in-copyright works, where lawful access is permitted. Access for authorized users at partner institutions who have print disabilities is now only possible from IP addresses within the United States. Access is limited to one user per physical volume held by the user’s institution. Access to in-copyright works is also now recorded in HathiTrust system logs, in accordance with HathiTrust’s privacy policy: http://www.hathitrust.org/privacy [285].
In connection with the HTPub effort, Michigan staff continued work to adapt the HathiTrust PageTurner to display XML content based on initial specifications.
Michigan staff tested and refined application-specific policies for throttling (e.g., in the PageTurner, Full-text search, and Collection Builder applications), and expect to enable the new policies in December.
Michigan staff purchased and began installing two new replacement web servers for HathiTrust in November. These are the last of eight servers targeted for replacement this year (six others were replaced in July).
The new storage brought online in June of this year was discovered by Isilon Systems, the storage provider, to have a subtle hardware issue requiring all drives and some internal components in eight nodes to be removed and re-installed in a new chassis. The upgrade was preventative in nature; the minor symptom caused by the hardware issue had not been observed by HathiTrust. The maintenance was covered under the existing support agreement, and carried out without any interruption to service by Isilon’s field service technicians under close supervision by Michigan staff.
As part of a regular program for continuous improvement in IT security, Michigan Library staff have been working with analysts in University of Michigan central IT to conduct a thorough risk assessment and vulnerability penetration test of the HathiTrust infrastructure. The scope of the risk assessment, which follows a framework developed at the University, consists primarily of servers and storage hardware, but also includes coverage of aspects such as facilities, management practice and policy, and workflows involving sensitive data. The vulnerability test focuses on network security, and is a hands-on exercise conducted by a trained security expert who attempts to discover flaws in network security and evaluate their potential for exploit. Final reports on both analyses are due in December.
No outages were reported in November 2011.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers [279].
As of November 1:
| November | Total | |
| Columbia University | 123 | 64,172 |
| Cornell University | 5,833 | 374,089 |
| Duke University | 15 | 4,501 |
| Harvard University | 163 | 53,006 |
| Indiana University | 393 | 186,588 |
| Library of Congress | 0 | 73,642 |
| North Carolina State University | 0 | 3,194 |
| University of North Carolina - Chapel Hill | 0 | 8,087 |
| Northwestern University | 57 | 5,412 |
| New York Public Library | 212 | 259,377 |
| Penn State University | 281 | 41,096 |
| Princeton University | 413 | 249,329 |
| Purdue University | 886 | 887 |
| University of California | 27,759 | 3,172,748 |
| The University of Chicago | 822 | 8,875 |
| University of Illinois | 0 | 14,503 |
| Universidad Complutense | 296 | 108,640 |
| University of Michigan | 34,744 | 4,481,254 |
| University of Minnesota | 728 | 89,323 |
| University of Wisconsin | 6,190 | 511,432 |
| University of Virginia | 54 | 47,384 |
| Utah State | 0 | 46 |
| Yale University | 0 | 23,674 |
| Total | 78,971 | 9,781,261 |
Public Domain (~27%)
| Total* | 6,032 | 2,662,192 |
You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust [198]
[Download PDF [288]]
HathiTrust has released a blog post [289] on the outcomes of the October Constitutional Convention. The post includes a link to the official notes [290] from the two-day meeting.
Maliaca Oxnam, an Associate Librarian from the University of Arizona, and current Chair of the Technical Report Archive and Image Library (http://www.technicalreports.org [291]), has engaged a sabbatical research project with the goal of improving access to government documents in HathiTrust. The three primary areas of her work include 1) investigating the accurate identification of government documents, 2) analyzing the copyright status of the documents and the reasons for their copyright determinations in HathiTrust, and 3) securing permissions from government agencies to make government publications viewable to the public at large. The sabbatical work will be completed by July 2012 and a report with recommendations for future actions will be presented to the HathiTrust Executive Committee. Questions or comments about the research can be sent to Maliaca Oxnam (oxnamm@u.library.arizona.edu [292]).
The Orphan Works Project (OWP) is in a pilot phase that will continue through the end of December. Researchers from the University of Michigan and the University of California - Los Angeles are conducting a parallel review of approximately 680 volumes in HathiTrust that do not have readily identifiable publisher contacts. Michigan staff have made significant changes to the research process and project tools in order to improve the rigor and reliability of investigation following a reevaluation of the orphan works candidate identification process in October. An overview flowchart of the new procedure is available at http://www.lib. umich.edu/orphan-works/documentation [293]. Michigan staff will add more extensive documentation in the coming months. The pilot phase of the OWP is intended to serve as a test for an orphan works identification process, through which we will document examples and further define parameters for research.
Ingest rates for Google-digitized volumes from all Google partner libraries were low in October due to problems with Google’s download mechanism. Rates are expected to pick up in November.
HathiTrust began ingest of Internet Archive-digitized content from Duke University and the University of North Carolina in October, and worked with the University of Florida toward ingest of its IA-digitized volumes.
Staff at the University of Michigan continued conversations with the University of Pittsburgh and University of Utah regarding bibliographic metadata for those institutions' contributed volumes. Staff at Michigan received the final set of rare manuscripts and incunabula from Universidad Complutense de Madrid and expect to finish ingest of the materials in November.
Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups [36] for more information.
The Communications Working Group continued work to develop a public services-oriented communications package, highlighting ways HathiTrust can be used to address a variety of research and reference inquiries. The group also made progress on a FAQ for the HathiTrust Research Center, and worked with staff from Indiana University to prepare a presence for the Research Center on HathiTrust.org.
The User Experience Working Group was pleased to welcome a new member, Darcy Duke, to the group. Darcy is the User Experience Librarian and Web Manager at MIT and has been an active member of the HathiTrust UX Special Interest Group. The group worked on finalizing the user personas it drafted over the summer and discussed details regarding a label change for the PDF download link in the PageTurner application.
The table below contains a summary of the issues received by the User Support Working Group in October. Nancy Spiegel, of the University of Chicago, stepped down from the User Support Working Group at the end of the month. The Executive Committee would like to heartily thank Nancy for her work on the group, and her contributions to establishing an ongoing body for user support in HathiTrust. Two positions on the working group are currently open. Nominations and inquiries can be sent to jjyork@umich.edu [269].
| Issue Type | September Issues | October Issues |
| Content | 171 | 154 |
Quality | 154 | 142 |
Non-partner Digital Deposit | 2 | 0 |
Collections | 4 | 1 |
| Cataloging | 25 | 44 |
| Access and Use | 127 | 136 |
Copyright | 73 | 75 |
Permissions | 12 | 4 |
Takedown | 3 | 4 |
Print on Demand | 17 | 2 |
Inter-library loan | 5 | 0 |
Full-PDF or e-copy requests | 24 | 23 |
Datasets | 1 | 1 |
Data Availability and APIs | 7 | 2 |
Reuse of content | 5 | 2 |
| Web applications | 22 | 29 |
Functionality problems | 5 | 6 |
Problems with login specifically | 0 | 3 |
General Questions about login | 2 | 4 |
Partners setting up login | 5 | 1 |
Usability issues | 6 | 2 |
Feature requests | 2 | 2 |
| Partner Ingest | 00 | 5 |
| General | 65 | 1 |
Partnership | 12 | 59 |
Infrastructure | 0 | 0 |
Miscellaneous | 53 | 51 |
*See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
The Collections Committee submitted its recommendations on the treatment of duplicates in HathiTrust to the Strategic Advisory Board (SAB) in October. The recommenations will be posted to the HathiTrust website following incorporation of feedback from the SAB. The Committee will be turning its attention next to a process for responding to requests and offers to include additional materials in HathiTrust, among other pending items on its work agenda.
The California Digital Library development team worked with staff at the University of Michigan on a workflow and timeline for migrating all bibliographic data from Michigan’s integrated library system to California. CDL's metadata analyst finalized the internal metadata schema to be used in Zephir, the core metadata management system. Further information about the project can be found at http://www.hathitrust.org/htmms [12].
MPublishing staff at the University of Michigan gathered input from colleagues in library-based publishing programs in October as they worked to finalize requirements, architecture, and design principles for the new publishing system, and archival package specifications for the published content. Michigan developers began adapting the HathiTrust PageTurner to display the new content based on initial specifications. Details about the publishing effort are available at http://www.hathitrust.org/htpub [223].
Data collection on the second sample of 1,000 volumes in HathiTrust continued in October; nearly 80% of the sample was reviewed by month’s end. October also saw the launch of the official grant project website, available at http://hathitrust-quality.projects.si.umich.edu/ [120]. The website features an overview of the project and detailed status reports by quarter, from the project’s beginning in January 2011 to the present.
Review of the physical copies of volumes included in the first 1,000-volume sample continued throughout October. The review focuses on capturing bibliographic information and physical characteristics of the volumes that may have an impact on errors observed in the digital volumes. By the end of the month, a volunteer staff of 12 students from the School of Information reviewed 476 volumes, or nearly 50% of the sample. Staff are coordinating inter-library loan requests with member libraries to facilitate efficient receipt of volumes, or on-site review of volumes by member library staff.
Initial analysis of the data from the first 1,000-volume sample was completed by the project statistician and will be available on the project website in November. The second round of data collection is expected to be complete in mid-November.
Indiana University staff worked on implementing the technical security infrastructure for the Research Center in October. The first part of this involved setting up InCommon Federation security, which will allow researchers to login to the HTRC with the username and password issued by their own institution. Once logged in, researchers will have the ability to access data and analysis tools in ways not available to the public. Authenticated access to the HTRC is expected to be available on a limited basis to HathiTrust partners in spring, 2012. As the key architectural pieces of the HTRC are put in place, Indiana staff are examining the adoption of a single API by which researchers can access all pieces of the data infrastructure. The best candidate for this appears to be the HathiTrust Data API. Staff will be making proposed extensions to this API available for comment.
Staff at the University of Michigan improved processes to synchronize bibliographic and rights metadata in the Collection Builder with metadata in the catalog and rights database.
University of Michigan staff re-indexed the full-text search index to add additional bibliographic metadata, including title information that will enable title displays in full-text search results to match those in the bibliographic catalog. Staff also continued work on advanced search, prototyping several designs for the user interface and working to improve relevance ranking of results. Staff expect to release the advanced search feature in November.
Michigan developers continue to work with staff at the California Digital Library on the development of a spelling suggestion feature. Developers at CDL are investigating modifications to traditional spelling suggestion algorithms, which are generally designed for single-language corpora, to accommodate the many languages in HathiTrust, and testing alternative spelling suggestion algorithms against a sample index.
Staff at Michigan made minor changes to the full-text indexing process to automatically receive notifications when volumes need to be removed from the index, and to improve index monitoring.
Michigan staff completed enhancements to BookReader and underlying infrastructure to improve the speed that images from the repository are rendered on the Web. The image-serving application behind BookReader now estimates dimensions for images and updates them as the images are loaded in the Web browser, rather than inspecting each image prior to making the whole volume available. Further enhancements included better positioning of images in the thumbnail and scrolling views, and improved relative sizing of images when pages within a volume vary dramatically in size.
Staff at Michigan made progress on the development of new throttling mechanisms for the PageTurner and other applications, which will enter an initial internal testing phase in early November.
No outages were reported in October 2011.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [294].
All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers [279].
As of October 1:
| October | Total | |
| Columbia University | 7 | 64,049 |
| Cornell University | 110 | 368,256 |
| Duke University | 4,486 | 4,486 |
| Harvard University | 5 | 52,843 |
| Indiana University | 23 | 186,195 |
| Library of Congress | 2,224 | 73,642 |
| North Carolina State University | 0 | 3,194 |
| University of North Carolina - Chapel Hill | 8,087 | 8,087 |
| Northwestern University | 6 | 5,355 |
| New York Public Library | 7 | 259,165 |
| Penn State University | 8 | 40,815 |
| Princeton University | 2 | 248,916 |
| Purdue University | 1 | 1 |
| University of California | 3,646 | 3,144,989 |
| The University of Chicago | 11 | 8,053 |
| University of Illinois | 2 | 14,503 |
| Universidad Complutense | 6 | 108,344 |
| University of Michigan | 195 | 4,446,510 |
| University of Minnesota | 163 | 88,595 |
| University of Wisconsin | 893 | 505,242 |
| University of Virginia | 3 | 47,330 |
| Utah State | 0 | 46 |
| Yale University | 0 | 23,674 |
| Total | 19,885 | 9,702,290 |
Public Domain (~27%)
| Total* | 13,325 | 2,656,160 |
You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust [198]
[Download PDF [299]]
On October 8-10, 2011, 130 representatives from 64 HathiTrust partner institutions, including library directors, chief information officers, and senior library administrators, gathered in Washington D.C. for an unprecedented “Constitutional Convention” to reflect on the accomplishments of HathiTrust since its launch in 2008, and determine directions and priorities for the partnership in its next phase. The business portion of the meeting consisted of deliberations and voting on 7 ballot initiatives presented by partner delegations prior to the convention. The final proposals and outcomes are available at http://www.hathitrust.org/constitutional_convention2011 [187]. A large portion of the Convention was also spent in general discussions on a variety of topics including the new pricing model for partner institutions, lawful uses of library-owned materials, and international cooperation. A more complete report on the Convention, its outcomes, and what they mean for the partnership, is forthcoming. The following presentations from the Convention are available on the HathiTrust website:
The University of Miami announced membership [301] in HathiTrust in early October. We are very pleased to welcome Miami to the partnership.
Following a soft release in August, HathiTrust is pleased to formally announce its new mobile interface (visit http://m.hathitrust.org [302]). The interface offers mobile-friendly access to key functionality including searching the HathiTrust catalog and reading HathiTrust “Full view” texts. Users from HathiTrust partner institutions can download texts in PDF or ePub format. Since the mobile interface is web-based, it works on all platforms, and may be viewed either from mobile devices or from desktops and laptops. The interface has special functionality for tablets where there are two ways to read texts: either in the vertical scrolling format, or in a horizontal flip format. Please give the new mobile interface a try and don’t hesitate to send your comments and feedback [294]!
On September 12, the Author's Guild, the Australian Society of Authors, the Union Des Écrivaines et des Écrivains Québécois (UNEQ), and eight individual authors filed a lawsuit against HathiTrust, the University of Michigan, the University of California, the University of Wisconsin, Indiana University, and Cornell University for copyright infringement. The suit was updated on October 8. We believe this is a misguided and unnecessary lawsuit. A full statement [303] by HathiTrust is available online, and links to statements by the University of Michigan and analysis from a variety of sources are available at http://www.hathitrust.org/authors_guild_lawsuit_information [303].
Beginning January 1, 2012, partners joining HathiTrust will need to provide information about their library holdings at the time of joining. The holdings data will be used for partner fee calculations and to offer access on a limited basis to in-copyright materials (see the Holdings Database update in the July newsletter [304]for details). Partners must be configured with Shibboleth [107] for their users to authenticate for partner services in HathiTrust.
University of Michigan staff continued work with several partner institutions on ingest of locally-digitized materials, including Northwestern University, Universidad Complutense de Madrid, the University of Florida, the University of Iowa, the University of North Carolina-Chapel Hill, the University of Pittsburgh, and the University of Utah.
The UX Advisory Group compiled and discussed a list of possible interface features and improvements that have been requested by users and staff at partner institutions. Three improvements were identified as high priority and will be ongoing topics of discussion until solutions are reached which can be passed to the University of Michigan development team. The improvements are:
The following is a summary of the issues received by the User Support Working Group in September.
| Issue Type | August Issues | September Issues |
| Content | 110 | 171 |
Quality | 96 | 154 |
Non-partner Digital Deposit | 3 | 2 |
Collections | 8 | 4 |
| Cataloging | 26 | 25 |
| Access and Use | 111 | 127 |
Copyright | 58 | 73 |
Permissions | 23 | 12 |
Takedown | 2 | 3 |
Print on Demand | 6 | 17 |
Inter-library loan | 0 | 5 |
Full-PDF or e-copy requests | 14 | 24 |
Datasets | 1 | 1 |
Data Availability and APIs | 1 | 7 |
Reuse of content | 7 | 5 |
| Web applications | 27 | 22 |
Functionality problems | 5 | 5 |
Problems with login specifically | 1 | 0 |
General Questions about login | 3 | 2 |
Partners setting up login | 4 | 5 |
Usability issues | 11 | 6 |
Feature requests | 7 | 2 |
| Partner Ingest | 2 | 0 |
| General | 59 | 65 |
Partnership | 13 | 12 |
Infrastructure | 1 | 0 |
Miscellaneous | 45 | 53 |
*See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
The California Digital Library development team continued to work on improvements to Zephir, the core metadata management system, and adaptations of system components to HathiTrust ingest and management workflows. As part of these improvements, project staff developed a program that doubles the speed of ingest for normalized bibliographic records. The team also worked with University of Michigan staff to identify modifications that have been made to records in HathiTrust over time, part of a broader strategy for managing updates to records in the new system.
A project manager from the University of Michigan joined the team working on HTPub, a two-year project to develop a system that will enable MPublishing at the University of Michigan Library to use HathiTrust as a publishing platform for its journals. The team has refined the project goal and requirements [223]and is formulating design principles, a use case specification, and the system architecture. A full-time software developer has joined MPublishing, focusing on the content ingest and publication management components of this system.
The Communications Working Group began working with staff at the University of Indiana to create a presence for the HathiTrust Research Center on HathiTrust. org. The new portion of the website is expected to be released in the next several weeks.
In September, staff at the University of Michigan and University of Minnesota completed quality review of a sample of 1,000 public domain volumes selected at random from HathiTrust (the sampling strategy is described in the July newsletter [304]). Data for more than 110,000 pages in all were collected. Two reviewers coded 10% of the sampled volumes as a check on inter-coder reliability. The project statistician is analyzing the data and initial findings will be available in October.
In addition to review of the digital volumes, the project team launched a process to perform physical review on all volumes in the sample. The project programmer created a data collection interface for this review and a volunteer staff of students as well as project staff began to retrieve and evaluate the physical volumes according to a list of specific criteria. The volunteer staff reviewed approximately 10% of the physical volumes by the end of September.
The project team also prepared for and began review of a second sample of 1,000 digital volumes. The second sample focuses on volumes published after 1922 and employs a different within-book sampling methodology. Whereas in the first run 100 pages at most were sampled from each volume, this run will review a number of pages in each volume proportional to the size of the volume. The second round of data collection is expected to be complete in mid-November. Background information on the project can be found at http://www.hathitrust.org/grants [305].
Staff at the University of Michigan implemented a new process for updating rights information for items saved to personal and private collections.
University of Michigan staff made modest modifications to full-text search indexing as part of a revised re-indexing strategy. Re-indexing of the full-text and bibliographic metadata for the entire corpus of 9+ million books began in late September and will be completed in early October. The re-index updates the full-text index to Unicode 6, and includes metadata changes that will improve title displays and provide the metadata needed to support access mechanisms that depend on holdings information (e.g., print disabled users).
Michigan staff developed a prototype for advanced full-text search and performed a preliminary user interaction/usability walkthrough. Michigan developers provided query logs, N-gram data, and term frequency information to staff at the California Digital Library for use in developing and testing a spelling suggestion feature.
University of Michigan staff worked on improvements to the algorithm used to estimate and update page image sizes for display with BookReader, resulting in a faster time for image display. Staff also included the “missing page” placeholder that appears in traditional views of volumes when pages are known to be missing to the thumbnail view. Pages may be missing from volumes for a variety of reasons, including the pages not being present in the physical volumes that were scanned, and errors in post-scan processing.
Developers at Michigan made progress on new throttling mechanisms that will be implemented at the web application level. Once completed, these mechanisms will make it possible to adjust throttling thresholds depending on the type of content delivered and ultimately reduce the likelihood of users being throttled during normal use.
Michigan staff put additional access controls into place in PageTurner, in anticipation of offering access to orphan works. The controls include limiting access to:
Interface changes were also made to improve display of the copyright status of each work.
No outages were reported in September 2011.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [294].
All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers [279].
As of September 1:
| September | Total | |
| Columbia University | 0 | 64,042 |
| Cornell University | 10,815 | 368,146 |
| Harvard University | 24 | 52,838 |
| Indiana University | 33 | 186,172 |
| Library of Congress | 0 | 71,418 |
| North Carolina State University | 240 | 3,194 |
| Northwestern University | 165 | 5,349 |
| New York Public Library | 115 | 259,158 |
| Penn State University | 1,438 | 40,807 |
| Princeton University | 3,132 | 248,914 |
| University of California | 102,280 | 3,141,343 |
| The University of Chicago | 6 | 8,042 |
| University of Illinois | 0 | 14,501 |
| Universidad Complutense | 183 | 108,338 |
| University of Michigan | 14,018 | 4,446,315 |
| University of Minnesota | 181 | 88,432 |
| University of Wisconsin | 6,810 | 504,349 |
| University of Virginia | 19 | 47,327 |
| Utah State | 0 | 46 |
| Yale University | 5,289 | 23,674 |
| Total | 144,748 | 9,682,405 |
Public Domain (~27%)
| Total* | 49,797 | 2,642,832 |
You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust [198]
Updates are provided in relation to the milestones listed at HathiTrust Research Center Timeline and Deliverables [227].
Progress: HTRC is working with a 50,000 volume collection of materials digitized from the IU library and a 250,000 volume collection of non-Google digitized content. Both collections reside at IU stored on a NAS (Network Attached Storage) unit and are regularly synchronized with the main HT collection at Michigan using the Unix tool rsync.
Progress: HTRC received a 3-year grant from the Alfred P. Sloan Foundation to build this prototype system. The system will prove experimentally and theoretically that it is possible to comply with the non-consumptive constraint in computational research. It will serve the community as a platform for development, testing, and execution of new algorithms developed by the broad research community capable of running at scale on the HathiTrust corpus. This research involves Atul Prakash of the University of Michigan.
Progress: HTRC staff set up a development portal for HTRC. The portal is built using the Lift framework [308]. A key aspect of the portal implementation is its support for InCommon [309]identity and access management, which enables a user user to log in using their home university credentials. The portal is consequently more secure because HTRC does not need to manage identity itself, and users also benefit from the inCommon management tool as they are not required to remember another user ID and password.
Progress: In this first phase HTRC is working on setting up core infrastructure components including the portal, InCommon sign-on, a service registry, Solr indexes, file system and database storage for the collections. Staff are also working on infrastructure for user-created collections and experimenting with text-mining techniques for improving descriptive metadata across the collections. Finally, a set of 60,000 rules developed with the aid of domain experts is being applied to correct OCR errors across the collection.
Progress: HTRC staff demonstrated SEASR running against a small HTRC collection at the Digital Humanities Conference June 2011 using a collection of 50,000 volumes from the Indiana University collection in the HathiTrust. The content was prepared for this use by flattening the HT internal pairtree and converting bibliographic data to RIS format (http://www.refman.com/support/risformat_intro.asp). Future work includes integrating SEASR into the HTRC portal infrastructure by supporting InCommon identity and access management. Other projects currently in progress include scaling to access a large remote data collection and ensuring algorithm integrity against the copyrighted collection particularly in the face of user's ability to rewire workflows at will.
Progress: HTRC has chosen the InCommon framework for trustworthy shared management of access to on-line resources. Researchers have single sign-on convenience using their existing credentials at their host organization, which eliminates the need to create additional accounts. InCommon uses Shibboleth or another SAML-compliant software to exchange attributes with partners, providing only the information necessary to do the authentication and authorization. The InCommon Federation provides the policy and technical framework that makes all of this possible. As of a recent count, all but 15 of the members of HathiTrust are members of the InCommon Federation. We anticipate that membership will grow to 100% of HathiTrust members.
Progress: An official kick-off of the HTRC was held at the Digital Humanities Conference in Palo Alto, CA June 20, 2011. The HathiTrust Research Center team has given 7 presentations to other various groups and conferences.
Progress: HTRC has co-sponsored three grant proposals with institutions inside and outside the HathiTrust partnership community.
Progress: HTRC has met with Project Bamboo [310] on multiple occasions in continuing discussions.
Progress: The Alfred P. Sloan Foundation award is a step towards sustainability. We are working on a long-term sustainability plan.
[Download PDF [311]]
The University of Connecticut announced membership [312] in HathiTrust in early September, and OCLC [313] and EBSCO [314] announced plans to integrate the HathiTrust full-text index in their discovery offerings. We are very pleased to welcome the University of Connecticut, and to be expanding users' ability to find and use materials in HathiTrust's collections.
Duke, Cornell, Emory, and Johns Hopkins universities [315], and the University of California [316] announced in August that they will begin offering users at their institutions access to orphan works in HathiTrust where print copies of the works are held in their library collections. Approximately 160 orphan works candidates have been identified in HathiTrust to date by the University of Michigan in a pilot effort funded by HathiTrust. The total number of orphan works in HathiTrust is unknown, but John Wilkin, HathiTrust’s Executive Director, has estimated [317] that the total proportion of orphan works could be as high as 50% of the entire collection. The currently identified orphan works candidates are listed in a public online catalog [318] and will be considered to be orphan works 90 days from the time of their posting if they are not claimed by copyright holders. As orphan works identification in HathiTrust moves from a pilot to production phase, it is expected that the review of volumes will be expanded to multiple institutions similar to the existing Copyright Review Management System [13]. More information about the Orphan Works Project is available at http://www.lib.umich.edu/orphan-works [252]. Institutions that have previously announced their intention to offer access to orphan works under the same terms as above include the University of Florida, the University of Michigan, and the University of Wisconsin-Madison.
The Update on July 2011 Activities [319] outlined the general framework under which access to orphan works will provided, and enhanced access to orphan works and other in-copyright volumes in HathiTrust for users who have print disabilities. Access in both cases is contingent upon print copies of volumes being held currently or at one time by partnering libraries, and is provided to users who are authenticated via Shibboleth. The Shibboleth attribute and particular attribute values that partner institutions must use to enable access for users who have print disabilities or their proxies are available at http://www.hathitrust.org/shibboleth. [107]These were determined by a small working group comprised of members from the University of Iowa, University of Illinois, and University of Michigan. Institutions may populate these values effective immediately to gain access for their users.
Staff at the University of Michigan are in the process of making enhancements to the HathiTrust catalog and full-text search applications that will allow users to search volumes based on the volumes’ availability to themselves specifically, or to their particular institution. This work is targeted to be complete in early- to mid-October, in conjunction with the availability to partner institutions of the first orphan works.
Partner institutions are in the process of submitting ballot proposals on a variety of topics for consideration at the HathiTrust Constitutional Convention. Some of the topics include governance structure, content deposit, approval processes for partner initiatives, and a distributed strategy for archiving print monographs. The deadline for submitting proposals for the Convention is September 9, 2011. After a brief period of collation, the proposals will be posted publicly to HathiTrust’s Constitutional Convention web page [187], where further information about the Convention is available.
University of Michigan staff released a beta mobile interface for searching and viewing volumes in HathiTrust: http://m.hathitrust.org/ [187]. It is currently considered to be a "soft release" for testing purposes, but is available without restrictions. Once any issues that arise are worked out, the mobile site will be publicized more broadly and mobile users who visit the regular site will be automatically redirected to the mobile version. Although it is designed for small screens, the mobile interface also works in a regular web browser. If you have any questions or would like to submit comments on the new interface, please send them using the "feedback" link on HathiTrust pages, or email Suzanne Chapman (suzchap@umich.edu [170]).
58 of approximately 217 rare manuscripts and incunabula were ingested from Universidad Complutense de Madrid in August. Staff at Michigan and Madrid continued to work on transfer of the remaining volumes to Michigan, and their subsequent transformation for deposit. Michigan staff coordinated with staff at Northwestern University and the universities of Iowa, Pittsburgh, and Utah on ingest of locally-digitized volumes.
HathiTrust ingested 46 of approximately 300 volumes to be contributed by the Utah State University Press. The USU Press is the second university press to deposit back file publications in HathiTrust on an open access basis. In return for open access, HathiTrust is archiving these volumes free of cost.
HathiTrust began ingest of the first Google-digitized volumes from Northwestern University in August and prepared for ingest of Google-digitized volumes from Purdue University. HathiTrust also ingested a set of nearly 3,000 volumes from North Carolina State University, digitized by the Internet Archive.
The Collections Committee submitted its ballot initiative for a Distributed Print Monographs Archive for the Constitutional Convention in August. A draft recommendation for the treatment of duplicates was slightly delayed by August vacations but will be shared with the Strategic Advisory Board soon for feedback and direction about next steps. The committee will turn its attention next to a process for responding to individual requests and offers to include additional materials in HathiTrust, among other pending items on its work agenda.
The Communications Working Group welcomed Stacy Kowalczyk of Indiana University as a new representative from the HathiTrust Research Center. A small team composed of members of the working group and operational staff from the Research Center met to identify specific avenues for developing the Research Center’s presence on the HathiTrust website and other communication activities.
The Usability Group, now one year old, undertook a review of its progress and discussed opportunities for fine-tuning the group's mission. The result of the review was a decision, approved by the Executive Committee, to change from a "working" group to an "advisory" group. To reflect this development, the name of the group has changed to the "HathiTrust User Experience (UX) Advisory Group." As an advisory group for operational needs, the group will continue to report to the HathiTrust Director. It will also continue to manage the HathiTrust UX-SIG email group, participate in other HathiTrust committees as a liaison for UX-related issues, and advise development staff on user interface designs, development priorities, and usability priorities. The group will not play a role specifically in the implementation of usability studies or interfaces for HathiTrust applications and services.
The UX advisory group continued to review and track feedback received via the User Support Group to help discover issues related to usability. Both the UX Advisory Group and the UX-SIG group were given an opportunity to provide early feedback on the HathiTrust mobile beta interface.
The following is a summary of the issues received by the User Support Working Group in August.
| Issue Type | August Issues | July Issues |
| Content | 110 | 90 |
|
Quality |
96 | 89 |
|
Non-partner Digital Deposit |
3 | 1 |
|
Collections |
8 | 2 |
| Cataloging | 26 | 20 |
| Access and Use | 111 | 81 |
|
Copyright |
58 | 52 |
|
Permissions |
23 | 2 |
|
Takedown |
2 | 0 |
|
Print on Demand |
6 | 36 |
|
Inter-library loan |
0 | 9 |
|
Full-PDF or e-copy requests |
14 | 13 |
|
Datasets |
1 | 0 |
|
Data Availability and APIs |
1 | 1 |
|
Reuse of content |
7 | 2 |
| Web applications | 27 | 23 |
|
Functionality problems |
5 | 7 |
|
Problems with login specifically |
1 | 6 |
|
General Questions about login |
3 | 4 |
|
Partners setting up login |
4 | 6 |
|
Usability issues |
11 | 3 |
|
Feature requests |
7 | 8 |
| Partner Ingest | 2 | 2 |
| General | 59 | 23 |
|
Partnership |
13 | 6 |
|
Infrastructure |
1 | 0 |
|
Miscellaneous |
45 | 17 |
Review of the first sample of 1,000 randomly selected volumes from HathiTrust continued in August. As of late August, over half the sample had been coded by reviewers at the University of Michigan and University of Minnesota, amounting to 54,635 out of 100,000 total pages (systematic sampling is being used to select 100 pages within each volume).
In August, the project team also hired a team leader to coordinate logistics for conducting physical review of all of the volumes in the 1,000-volume sample. The team developed and finalized a short survey that will capture certain physical characteristics of the volumes and confirm bibliographic data. An interface for entering data from the survey is under development, and physical review of the original volumes is expected to commence in mid-September. The grant team also worked to finalize sampling parameters for the second production sample of volumes, which is set to begin in late September. The grant Advisory Board met in late August to provide feedback on the team’s progress to date and guidance on future directions. Additional information on the project can be found at http://www.hathitrust.org/grants [305].
The HathiTrust Research Center (HTRC) has heard positive responses on about 70% of invitations it sent out to form an HTRC advisory board. The advisory board is anticipated to guide the HTRC in setting resource allocation policies, being good stewards of the research data and outputs, and securing additional resources to make HTRC a stable entity for years to come. HTRC has been in discussions with HathiTrust to develop an integrated web presence, and is working on a draft of policies for access to and use of HTRC resources.
The California Digital Library development team continues to make improvements to the core metadata management system and work on issues related to integration with HathiTrust systems. The team is also preparing a production virtual environment for performance and load testing and has started work on a migration verification strategy that will use Z39.50 to compare bibliographic data that has been loaded from the University of Michigan with the same data in its pre-load state. The newly-hired metadata analyst for the project will start on September 12, 2011. Finally, the team has given a name to the HathiTrust metadata management system software - Zephir.
Preparations to allow access for users who have print disabilities and access to orphan works took precedence over the ongoing HathiTrust Data API security work in August, though this work remains a high priority. The API enhancements are described at http://bit.ly/jozHQK [257]. Interested parties are invited to submit comments and feedback to feedback@issues.hathitrust.org [66].
Staff at Michigan and the California Digital Library (CDL) continued to make progress on the full-text search tasks identified as high priority [270] by the HathiTrust Full-Text Working Group. Michigan successfully replaced the XPat search engine, which had been used since the launch of HathiTrust for searching within a book, with Solr/Lucene. This move has improved the order in which results are displayed for multi-word searches in “within book search,” the group's third highest priority. CDL's work on a spelling suggestion feature continued as well.
Michigan staff made progress on enabling application-level throttling in HathiTrust applications and will continue this work in September. A proof of concept was implemented and shows promise. Application-level throttling will give HathiTrust finer-grained control over when and when not to throttle user access (block access for short periods of time). This will allow HathiTrust to maintain compliance with third party agreements on content and provide a consistent experience for all users, while offering fewer interruptions to routine activities such as browsing thumbnails of content or scrolling quickly through a volume.
HathiTrust may have been inaccessible for some users from approximately 5:00pm - 5:20pm EDT on Tuesday, August 16 due to network connectivity problems at the Indianapolis site. The problems were intermittent, preventing normal failover mechanisms from triggering.
HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org [66].
As of August 1:
| August | Total | |
| Columbia University | 41 | 64,042 |
| Cornell University | 12,237 | 357,331 |
| Harvard University | 87 | 52,814 |
| Indiana University | 1,252 | 186,139 |
| Library of Congress | 0 | 71,418 |
| North Carolina State University | 2,954 | 2,954 |
| Northwestern University | 5,184 | 5,184 |
| New York Public Library | 215 | 259,043 |
| Penn State University | 166 | 39,369 |
| Princeton University | 4,187 | 245,782 |
| University of California | 55,403 | 3,039,063 |
| The University of Chicago | 1,569 | 8,036 |
| University of Illinois | 0 | 14,501 |
| Universidad Complutense | 201 | 108,155 |
| University of Michigan | 27,448 | 4,432.297 |
| University of Minnesota | 606 | 88,251 |
| University of Wisconsin | 9,508 | 497,539 |
| University of Virginia | 4 | 47,308 |
| Utah State | 46 | 46 |
| Yale University | 0 | 18,385 |
| Total | 121,108 | 9,537,657 |
Public Domain (~27%)
| Total* | 84,644 | 2,593,035 |
You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust [198]
[Download PDF [320]]
Two new partners announced membership in HathiTrust in July: the University of Notre Dame and the University of Florida. Florida announced additionally that it will be offering students, faculty, and other users of UF libraries access to orphan works in HathiTrust that UF also holds in its print collections. We are very pleased to welcome these new institutions and look forward to the ways they will enrich our partnership. News releases can be found at the following links: University of Notre Dame [321]; University of Florida [322].
The 3-year review conducted by Ithaka S+R with oversight from the Strategic Advisory Board (SAB) was completed in July and is available at http://www.hathitrust.org/ constitutional_convention2011 [187], with introduction from Deputy Director of Libraries at the University of Wisconsin-Madison and SAB chair Ed Van Gemert. The review had been planned from the time of the HathiTrust’s launch in 2008 to provide a meaningful assessment of the partnership’s accomplishments and outlook leading up to the HathiTrust Constitutional Convention, also planned to occur in the 3rd year. Institutions and consortia that were members of HathiTrust as of October 2010 will participate in the Convention this coming October to review HathiTrust sustainability and governance, and set new directions for the partnership. Details on the Convention are available at the link above. Questions or comments regarding the 3-year review should be directed to Ed Van Gemert at evangemert@library.wisc.edu [323].
HathiTrust posted the first set of orphan works candidates to a public catalog [324] in July. These are works for which, following an extensive review process, rights holders could not be found or contacted. As reported in last month’s update [325], works in the public catalog that are not claimed by the rights holder after a period of 90 days will be considered orphan works. The first 90-period will expire in October (the expiration date for each work is posted in the catalog). At that time, partner institutions that wish may begin to offer their users access to orphan works in HathiTrust. More information about the orphan works project can be found at http://www.lib.umich.edu/orphan-works/ [326]. Further information on how access will work is included in the Holdings Database newsletter item below.
Staff at the University of Michigan released several enhancements to the HathiTrust Collections list and full-text search application in July. The enhancements to the Collections interface include improved display of collections, the ability to search collections by title and description, and the ability to filter collections by their featured status, last time of update, number of items, and whether or not they belong to the current authenticated user. New full-text search features leverage the addition of bibliographic metadata to the full-text search index to offer faceting (refinement) of search results, and improved search results relevance ranking. These features were the top two prioritized by the HathiTrust Full-text Working Group [270] for implementation. Staff at Michigan and the California Digital Library will continue to work on features in the prioritized list in August. The third feature, improvements to “within book search”, will be released in the next couple of weeks. Please give these new features a try and send feedback to feedback@issues.hathitrust.org [66].
Holdings Database: Update and Lawful Uses of In-Copyright Materials
Early in 2011, HathiTrust began development on a database of holdings information from partner institutions designed a) to support the new cost model that will be implemented for all partners in 2013, b) to form a foundation for the expansion of lawful uses of in-copyright materials to partner institutions (such as access to persons who have print disabilities and access to orphan works), and c) to facilitate collective collection development and management activities among the partnership.
The first iteration of this database, containing data for single part monographs at partner institutions, was put into production in July. Staff at the University of Michigan are in the process of incorporating information from the database into existing applications such as the catalog and PageTurner to begin offering partners access to orphan works in HathiTrust, as well as access to in-copyright volumes for users who have print disabilities. The systems needed to provide access in these scenarios are expected to be in place in late-summer/early-fall.
Access to orphan works
Beginning in October, authenticated users from HathiTrust institutions that have selected to grant their users access to orphan works will see orphan works appear as “Full view” in HathiTrust access systems. Access will only be available to orphan works in HathiTrust that are or had previously been held in the partner institution’s library system.
Access for users who have print disabilities
Beginning in late-summer or early-fall, users at partner institutions who are certified as having a print disability will be eligible to view the full text of all in copyright volumes in HathiTrust that are or had previously been held in the partner institution’s library system. In order to gain access institutions will need:
Specifics on the syntax of the attribute and any additional information will be disseminated to partners in the coming weeks.
Nominations have been extended for a new member of the HathiTrust User Support working group. Please send nominations to jjyork@umich.edu [269] by August 19, 2011.
Staff at Michigan met with staff from Northwestern University to address questions related to ingest of a set of several hundred locally-digitized volumes. Staff at Universidad Complutense de Madrid began to transfer a second set of locally-digitized manuscripts and incunabula to the University of Michigan for ingest. The first set of locally-digitized materials from Madrid will be ingested in August.
The Collections Committee is putting the finishing touches on two major work items with which it has been occupied for the last several months: a ballot initiative for a Distributed Print Monographs Archive to be put forward at the Constitutional Convention, and a draft recommendation on the treatment of duplicates in HathiTrust. A draft of the print archives proposal was reviewed with a subgroup of the HathiTrust Executive Committee, which sponsored the initiative, in July; the final version will be forwarded shortly to the full Executive Committee for its approval. The draft duplicates paper will be shared with the Strategic Advisory Board in August for feedback and direction about next steps. A big thank you from the chair (Ivy Anderson) to her colleagues on the committee for terrific work in pulling these proposals together (the charge and membership of the group are available at http://www.hathitrust.org/wg_collections_charge [327]). Once these items are finalized, the committee will turn its attention to other pending items on its work agenda, including a process for responding to individual requests and offers to include additional materials in HathiTrust.
In July, the Communications Working Group focused on a number of topics including new partner announcements, a strategy to support public services staff in communicating about HathiTrust, soliciting authors and topics for the HathiTrust blog, and looking ahead to communication needs for the Constitutional Convention. The Communications group invites suggestions from partner institutions and others for topics to be covered in the HathiTrust blog. These should be directed to heather.christenson@ucop.edu [328].
The Usability Working Group discussed and provided feedback on the Collections list and full-text search features that were released in July. The group continued to review and track feedback received via the User Support Group on issues related to usability. The HathiTrust User Experience Special Interest Group (HT UX-SIG) has been active in discussions about feature requests and usability improvements to HathiTrust. The HT UX-SIG email group is open to anyone who is interested. Please contact Felicia Poe (Felicia.Poe@ucop.edu [329]) to join.
The following is a summary of the issues received by the User Support Working Group in July.
| Issue Type | Count |
| Content | 90 |
Quality | 89 |
Non-partner Digital Deposit | 1 |
Collections | 2 |
| Cataloging | 20 |
| Access and Use | 81 |
Copyright | 52 |
Permissions | 2 |
Takedown | 0 |
Print on Demand | 36 |
Inter-library loan | 9 |
Full-PDF or e-copy requests | 13 |
Datasets | 0 |
Data Availability and APIs | 1 |
Reuse of content | 2 |
| Web applications | 23 |
Functionality problems | 7 |
Problems with login specifically | 6 |
General Questions about login | 4 |
Partners setting up login | 6 |
Usability issues | 3 |
Feature requests | 8 |
| Partner Ingest | 2 |
| General | 23 |
Partnership | 6 |
Infrastructure | 0 |
Miscellaneous | 17 |
In July, grant project staff at the University of Michigan and University of Minnesota started to review the first of several production-level samples of volumes in HathiTrust, conducted according to the error type and severity model developed by the grant project team. The first sample includes 1,000 randomly selected volumes published before 1923 and digitized by Google. Staff will review a set of 100 pages, chosen at evenly-distributed intervals, within each of the 1,000 volumes. A subset of volumes will be reviewed by multiple staff members as a check on inter-coder reliability. The corresponding print versions of all volumes in the sample will undergo a physical assessment to identify potentially meaningful characteristics that affect quality, such as tight bindings, condition, and other physical features. A subset of the digital volumes will also be subjected to full-volume review to measure errors such as missing pages. The goals of the first production run are 1) to test the quality review system developed by the project team on a large scale; 2) to assemble a body of statistical data of sufficient size to begin to test the feasibility of sampling as a strategy to accurately describe error within a group of volumes; 3) to begin to explore the correlation of physical characteristics of books with observed errors in the digital scans. Review of the 1,000-volume sample is expected to be completed in mid-September.
The University of Michigan has been examining schema options for representing encoded text journal content in the HathiTrust archival package. An investigation of publisher XML formats has yielded a recommendation to use the Journal Archiving and Interchange Tag Set of JATS (an application of NISO Z39.96) as the XML format for encoded text. UM staff are currently researching Portico’s use of a custom profile of an earlier version of this standard in content normalization.
The HathiTrust Research Center has received a $600,000 award from the Alfred P. Sloan Foundation for the first investigation of non-consumptive research for a major large-scale digitized collection of content. The press release for the award is available at http://newsinfo.iu.edu/news/page/normal/19252.html [330].
The HathiTrust Research Center technical group is working on an end-to-end demonstration test of underlying infrastructure functionality. The test, which is planned to be completed in early September, is being conducted using a subset of the HathiTrust full-text Solr index and Indiana University public domain volumes deposited in HathiTrust. OCR text of the volumes was distributed to the Research Center from HathiTrust and is stored in a noSQL data store to be readily available for research purposes. The test scenario runs as follows: a user logs into the Research Center via an InCommon identity and simple algorithms are executed on the user’s behalf to pull word counts out of the index and do simple pattern-matching. The algorithms and services, which are available to all users, are registered in a web services registry where they can be queried by users. Results in this simple scenario are returned to the user as a URL. The test will allow the HTRC technical group to work out issues related to the HTRC’s core architecture, interfaces, and integrated security model.
Staff at the University of Michigan worked in July to prepare a dataset containing the OCR of approximately 240,000 publicly available non-Google digitized volumes in HathiTrust for distribution to the HathiTrust Research Center. The dataset will be delivered in August and also be available for public download. The HTRC is awaiting resolution on a data agreement that will allow it to host and use OCR text of the full HathiTrust public domain corpus. Pending that agreement, this dataset will allow the HTRC to conduct testing of its infrastructure on a larger scale.
The California Digital Library (CDL) development team began the integration phase of the project in July, which focuses on adapting the new management system to the HathiTrust workflow. The team ingested bibliographic records into a virtual staging environment where integration testing with HathiTrust systems will occur. CDL has filled the second Metadata Analyst position for the project, advertised in previous updates. The new staff member will begin work in mid-September.
Staff at the University of Michigan continued development on security enhancements to the HathiTrust Data API. The enhancements are described at http://bit.ly/jozHQK [257]. Interested parties are invited to submit comments and feedback to feedback@issues.hathitrust.org [66].
Last February, University of Michigan staff began development on mobile interfaces to the HathiTrust catalog and PageTurner. Development of an initial version of these interfaces is nearly complete and staff hope to release beta versions for testing in September.
Michigan staff installed new database and ingest servers as part of the first periodic server replacement cycle, which keeps server infrastructure current on a 3-to-4-year cycle. The new database servers are a little ahead of schedule, but configured to support the higher transactional rates expected with the introduction of the print holdings database. The new ingest servers are expected to provide significantly increased throughput rates for ingesting volumes into HathiTrust.
Staff at Michigan experimented with ways to improve the speed that page images are loaded in the new views for scrolling and flipping through books that were implemented in April. Staff will continue to test the strategy, which involves estimating pixel dimensions of all pages in a volume based on a small sample and making adjustments as actual pages are retrieved, throughout August.
Michigan staff continued work on a more sophisticated throttling system to improve the experience of using HathiTrust while ensuring compliance with third-party agreements on content and offering equal access for all users to HathiTrust applications. The new system will provide throttling controls at finer levels so that, for example, delivering thumbnail page images to a user in PageTurner does not count as heavily against a user’s access quota and limit their ability to view full-size pages.
There were no outages in July.
All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers [197].
Number of volumes added:
| June | Total | |
| Columbia University | 95 | 64,001 |
| Cornell University | 17,542 | 345,094 |
| Harvard University | 17 | 52,727 |
| Indiana University | 12 | 184,887 |
| Library of Congress | 0 | 71,418 |
| New York Public Library | 136 | 258,828 |
| Penn State University | 29 | 39,174 |
| Princeton University | 2,489 | 241,595 |
| University of California | 489,975 | 2,983,660 |
| The University of Chicago | 164 | 6,467 |
| University of Illinois | 0 | 14,501 |
| University of Madrid | 7 | 107,954 |
| University of Michigan | 36,362 | 4,404,849 |
| University of Minnesota | 376 | 87,645 |
| University of Wisconsin | 15,020 | 488,031 |
| University of Virginia | 1 | 47,304 |
| Yale University Library | 0 | 18,385 |
| Total | 562,225 | 9,416,549 |
Public Domain (~27%)
| Total* | 94,470 | 2,508,391 |
* Includes volumes opened through copyright review or rights holder permissions.
[Download PDF [333]]
Following the announcement in May [334] of a new initiative to identify orphan works in HathiTrust, the University of Michigan announced [335] last month that it would be making orphan works identified in HathiTrust that are also held in its library collections available to Michigan students, faculty, staff, and other visitors to the UM libraries. Works that are identified as orphan candidates through an extensive review process will be posted in a public catalog at UM for 90 days. Works left unclaimed by rights holders after this time will be considered orphans. Michigan expects to begin offering access to orphan works beginning in the fall. Joining Michigan at the initial release will be the University of Wisconsin-Madison; other partner institutions may begin to make these uses in the coming months.
Update on the Briefing Paper on Progress and Opportunities for HathiTrust Prepared by Ithaka S+R for the HathiTrust Strategic Advisory Board (SAB)
By Ed Van Gemert, Chair, SAB
The HathiTrust Strategic Advisory Board received the draft three-year review prepared by Ithaka S+R on 17 June 2011. The SAB along with Ithaka staff is currently working to revise that draft. Following the revision period, the final report from Ithaka is to be delivered to the SAB on 15 July 2011.
The SAB initially charged Ithaka S+R staff to challenge our collective thinking and the review has certainly done that. The final report, and portions thereof, will be broadly distributed prior to the October 2011 Constitutional Convention. Key areas of focus of the draft report suggests additional attention and work to include:
The SAB expects thorough discussions at the upcoming Constitutional Convention around these and other important questions regarding the future shape of HathiTrust and the role that current and future partner libraries will play in governing and sustaining HathiTrust.
The HathiTrust Research Center hosted a reception at the Digital Humanities 2011 Conference held in Palo Alto, California June 20, 2011. The reception was sponsored by Indiana University and the University of Illinois, the institutions developing the HTRC, and by Google. Opening remarks were given by HTRC directors Beth Plale and John Unsworth, and Google Engineering Director John Orwant. The reception was well attended and well received. The HTRC stressed its receptivity to working with researchers broadly within the scope of available resources to provide computational access to the growing body of HathiTrust materials.
The day before the reception HTRC directors traveled to Oakland, CA to meet with Laine Farley, the HathiTrust Executive Committee liaison to the HTRC, and Heather Christenson, chair of the HathiTrust Communications Working Group. The group was later joined by David Greenbaum of Project Bamboo [336]. Discussions focused on interactions between the HTRC and HathiTrust and ways in which HTRC will collaborate with other projects such as Project Bamboo.
The HTRC is pleased to announce receipt of a $606,000 three-year award from the Alfred P. Sloan Foundation to explore architectural issues around large-scale non-consumptive research. Beth Plale is the PI of the project, with co-PIs Atul Prakash of the University of Michigan and Robert McDonald of Indiana University.
The HTRC wrote letters of support for three proposals to the second round of the Digging into Data Challenge [337].
The Executive Committee is seeking nominations from all partner institutions for a new member of the User Support Working Group. One of the current 8 members will be stepping off the group at the end of July. User Support members are on call to answer inquiries at least one day per week and spend on average of 2-3 hours per week investigating issues and responding to users. Nominations should be sent to Jeremy York (jjyork@umich.edu [269]) before August 1, 2011.
HathiTrust began ingest of the first large set of locally-digitized volumes from Yale University in June. More than 18,000 had been ingested as of July 1.
A draft ballot initiative for a print management proposal intended to be voted on at the Constitutional Convention will be shared with the HathiTrust Executive Committee’s print management subgroup in July. The Committee also expects to submit its draft discussion paper on duplicate volumes in HathiTrust to the Strategic Advisory Board in July for initial feedback. Recommendations for a process for responding to user-initiated requests has been put on temporary hold while the first two deliverables are finalized.
The Communications Working Group launched a new HathiTrust blog in June, “Perspectives from HathiTrust”, with its inaugural post [338] by HathiTrust Executive Director John Wilkin. The blog will feature authors from among the partner institutions writing on a variety of topics. The group also released a mid-year update [339] on HathiTrust activities in conjunction with the ALA annual conference.
After careful consideration and consultation with the HathiTrust Strategic Advisory Board, the Discovery Interface Working Group (DIWG) has officially disbanded. The DIWG, initially convened in spring 2009, fulfilled its charge to accomplish the implementation of the HathiTrust WorldCat Local Prototype interface. One important aspect of this project was working with OCLC to get all of the HathiTrust records loaded into WorldCat. Along the way, the DIWG also supervised the first phase of the HathiTrust Full-Text Search Subgroup and delivered a set of requirements to OCLC for the next phase of HathiTrust WorldCat Local catalog development in FY 2012.
At this point, the focus will shift from the group’s original charge to the ongoing maintenance and development of the HathiTrust WorldCat Local catalog. Julia Lovett of the University of Michigan will be the project manager for this effort, and will draw on the expertise of HathiTrust partner colleagues as needed. The DIWG executive team—John Butler, Lee Konrad, and Julia Lovett—would like to thank all the DIWG members for their contributions: Adam Brin, Patricia Martin, Christopher Walker, Lisa German, Kevin Clair, Suzanne Chapman, and Jon Rothman. Special thanks to John Wilkin and to the HathiTrust SAB for providing valuable guidance and input, and to Bill Carney and the OCLC WorldCat Local team for their very hard work on this project.
Work on the development of HathiTrust personas reported in April’s update [340] continued in June. The group has also begun reviewing feedback received via the User Support Group to help discover and track usability issues.
The User Support Working Group and staff at the University of Michigan fielded more than 750 user inquiries from April through June 2011. The break-down of issues received during that time is shown in the table below. We will continue to report these statistics on a monthly basis.
| Issue Type | Count |
| Content | 347 |
|
Quality |
302 |
|
Non-partner Digital Deposit |
2 |
|
Collections |
21 |
| Cataloging | 54 |
| Access and Use | 246 |
|
Copyright |
139 |
|
Permissions |
14 |
|
Takedown |
2 |
|
Print on Demand |
16 |
|
Inter-library loan |
3 |
|
Full-PDF or e-copy requests |
59 |
|
Datasets |
19 |
|
Data Availability and APIs |
10 |
|
Reuse of content |
11 |
| Web applications | 86 |
|
Functionality problems |
30 |
|
Problems with login specifically |
9 |
|
General Questions about login |
8 |
|
Partners setting up login |
3 |
|
Usability issues |
13 |
|
Feature requests |
19 |
| Partner Ingest | 12 |
| General | 68 |
|
Partnership |
30 |
|
Infrastructure |
5 |
|
Miscellaneous |
33 |
See User Support Working Group Issue Types [201] for a description of the types of issues included in each category.
The grant project team’s work in June focused on preparations for production level data collection to begin in early July. These preparations included continuing work to examine and improve inter-coder consistency, incorporating new data from the University of Minnesota review team, and undertaking several small sampling exercises to guide development of a model for systematic random sampling of HathiTrust volumes, and pages within volumes, for quality review. The project team, under the guidance of the Principal Investigator and team statistician, completed a draft of this model in June. The first large sample for production level analysis will be drawn in early July. Additional information on the project can be found at http://www.hathitrust.org/grants [305].
The University of Michigan hired the first of two programmers to work on the HTPub project. Interviews will take place in July for the second opening. Meanwhile, Michigan continued to examine schema options for representing journal content in the HathiTrust archival package, and questions surrounding interoperability of the envisioned HTPub software components with the HathiTrust repository. Details on the project can be found at http://www.hathitrust.org/htpub [223].
The California Digital Library team completed development of the major functionality for the core metadata management system, and on June 14, 2011, demonstrated the core system to staff at the University of Michigan. For initial testing, the system was loaded with approximately 200,000 metadata records from HathiTrust partner institutions. When it is implemented in 2012, the system will manage initially close to eight million.
The next major development effort is to adapt the new system to the HathiTrust workflow. This includes integrating the system with the HathiTrust rights management database and developing batch export functionality for metadata records. CDL is working with University of Michigan staff to understand the particulars of the HathiTrust workflow.
CDL continues to interview for the open Metadata Analyst position: http://www.cdlib.org/services/d2d/d2d_mda2.html [341].
Further information on the project is available at http://www.hathitrust.org/htmms [12].
University of Michigan staff began to code enhancements to the Collection Builder interface in June. The enhancements will allow users to explore the list of collections more easily using new filtering and searching options. Deployment of the new interface is expected in July.
Michigan staff began development of security enhancements to the HathiTrust Data API in June. The enhancements are described at http://bit.ly/jozHQK [257]. We invite interested parties to submit any comments or feedback to feedback@issues.hathitrust.org [66].
Michigan staff deployed a timestamp-based sentinel file in the development environment to make it easier for the Plack Perl module, which was implemented to support the new PageTurner functionality, to stay up-to-date when changes to Plack-based applications are deployed to production.
Staff at Michigan completed development to replace the XPat search engine with Solr as the mechanism for searching inside individual volumes from Pageturner (details on the change were reported in the Update on May 2011 Activities [342]). Use of the Solr back-end will eliminate differences between the ways that Solr and XPat work currently, which can interfere with searching activities, and improve relevance ranking of page-level results. Michigan staff have begun to test the current Solr configuration and search performance to optimize indexing and query response times. The code supporting the new functionality will undergo final testing for production deployment after the release of the new faceting and relevance-ranking features for full-text search, which is projected for mid-July. The coding for these features, the top two identified by the HathiTrust Full-text Working Group [270], was completed in June, and usability and internal tests are underway in preparation for the mid-July release.
HathiTrust has throttling protections in place to prevent systematic download of materials in the repository for which, due to third-party agreements, this type of activity is not allowed (see the Message from John Wilkin in the Update on September 2010 Activities [343]). Staff at Michigan have started a process to add more sophisticated capabilities to HathiTrust applications (for instance, optimization of thumbnail presentation) that will ensure compliance with such agreements while offering fewer interruptions to use.
Michigan staff upgraded software on both Michigan and Indiana storage instances and added 100TB of new capacity with no service interruption.
There were no outages in June.
All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers [197].
Number of volumes added:
| June | Total | |
| Columbia University | 0 | 63,906 |
| Cornell University | 16,321 | 327,552 |
| Harvard University | 0 | 52,710 |
| Indiana University | 156 | 184,875 |
| Library of Congress | 0 | 71,418 |
| New York Public Library | 1 | 258,692 |
| Penn State University | 23 | 39,174 |
| Princeton University | 21 | 239,106 |
| University of California | 37,321 | 2,493,685 |
| The University of Chicago | 132 | 6,303 |
| University of Illinois | 0 | 14,501 |
| University of Madrid | 1,203 | 107,947 |
| University of Michigan | 12,603 | 4,368,487 |
| University of Minnesota | 625 | 87,269 |
| University of Wisconsin | 7,598 | 473,011 |
| University of Virginia | 0 | 47,303 |
| Yale University Library | 18,114 | 18,385 |
| Total | 94,118 | 8,854,324 |
Public Domain (~27%)
| Total* | 35,670 | 2,413,921 |
* Includes volumes opened through copyright review or rights holder permissions.
[Download PDF [344]]
HathiTrust is an international partnership of academic and research institutions dedicated to ensuring the preservation and accessibility of the vast record of human knowledge. The partnership owns and operates a digital repository containing millions of public domain and in copyright volumes digitized from partnering institution libraries. The preserved volumes are made available in accordance with copyright law as a shared scholarly resource for students, faculty, and researchers at the partnering institutions, and as a public good to the world community. For more information, visit HathiTrust.org [79].
HathiTrust announced a new initiative [334] in May to identify orphan works in HathiTrust. In June, the University of Michigan announced that it will begin to make orphan works in HathiTrust that are also held in its library collections available to University of Michigan Library users [335]. Other HathiTrust partners are moving ahead with similar plans.
HathiTrust was pleased to welcome 2 new partners in the first part of 2011: Boston University, and Lafayette College, its first liberal arts college partner. Several other institutions have joined and will be announced in the coming weeks. As of June, HathiTrust has 58 partners.
HathiTrust partners contributed more than 900,000 volumes to the repository between January and June 2011, raising the total number of volumes to over 8.8 million. Nearly 2.5 million volumes are in the public domain. New institutions to contribute content in 2011 include:
HathiTrust has been working with several partners on ingest of locally-digitized volumes. These include the University of Illinois, University of Iowa, Universidad Complutense de Madrid, Northwestern University, the University of Pittsburgh, and Utah State University Press.
The University of Minnesota, the Minnesota Digital Library and the Minnesota Historical Society engaged in a prototype project in 2010 to ingest nearly 60,000 images into HathiTrust. All of the images were successfully ingested in February 2011. More information is available on the HathiTrust MDL project page [272].
HathiTrust was certified in March as a Trustworthy Repository by the Center for Research Libraries. More information can be found at http://www.hathitrust.org/trac [253].
In January, HathiTrust and OCLC announced the release of a collaborative prototype bibliographic catalog. In the next year, this catalog is planned to replace the temporary catalog HathiTrust has had in place since April 2009. More information about the catalog initiative is available in OCLC’s press release [264]. Information about HathiTrust’s discovery strategy more broadly is posted in the first entry [338] in HathiTrust’s new blog: Perspectives from HathiTrust.
Indiana University and the University of Illinois announced the launch of the HathiTrust Research Center [345] in April, a cutting-edge research environment that will provide computational access to the growing body of materials in HathiTrust.
In conjunction with the Research Center initiative, HathiTrust has begun to distribute texts from the repository to researchers for computational analysis. Information on the available datasets and how to obtain them can be found at at http://www.hathitrust.org/datasets [212].
As of January, rights holders are able to open access to their works in HathiTrust under Creative Commons licenses. Several hundred volumes have already been opened in this way, including large numbers by the Brooklyn Museum and the Society of American Archivists. Creative Commons licenses can be applied using a permissions agreement [346], which can be downloaded from the HathiTrust website.
The University of Michigan continued work to assemble a database of the print holdings of all partner institutions. Information from the database will form the basis for yearly cost calculations [249] beginning in 2013. It will also provide a foundation for the expansion of lawful uses of in-copyright materials held in HathiTrust by partner institutions, and facilitate collaborative collection management and collection development initiatives.
California Digital Library continued work on the new metadata management system for HathiTrust. The development team reached a major milestone in May, with the completion of the core components of the infrastructure. Information on the project and updates are posted at http://www.hathitrust.org/htmms [12].
The University of Michigan continued work on a new initiative to enable the use of HathiTrust as a platform for publishing. An overview of the project, including development plan, design principles, and proposed architecture, is available at http://www.hathitrust.org/htpub [223].
Work on an IMLS grant led by University of Michigan professor Paul Conway began in January and has progressed rapidly. Background on the project is available at http://www.hathitrust.org/grants [305] and updates can be found in the monthly newsletter beginning in March [347].
The Strategic Advisory Board contracted in March with Ithaka S+R to perform a formal review of HathiTrust. The results of the review will be distributed to the membership for full discussion and review prior to the Constitutional Convention of partners to occur in October 2011. An update on the review process was included in the update on May 2011 Activities [342].
The Executive Committee charged a new User Support working group in March to respond to user inquiries to HathiTrust. The 8-member group raises to more than 40 the number of staff from partner institutions participating officially in HathiTrust working groups and committees. Many more staff, and a growing number, are participating in initiatives such as copyright review, bibliographic management system development, the HathiTrust Research Center, grant projects, pilot efforts around ingest of image and audio content, HathiTrust publishing, content quality and metadata error resolution, and listservs around communications, usability, and HathiTrust usage tracking.
37 of HathiTrust’s 58 partners are now configured for authenticated access to HathiTrust via Shibboleth. Authentication enables full-PDF download of all public domain materials, facilitated access to the Collection Builder feature, and will be a key mechanism for delivering additional partner services, such as those planned in limited circumstances for in-copyright materials. Information about Shibboleth in HathiTrust can be found at http://www.hathitrust.org/shibboleth. [107]
Numerous papers and presentations were given in the first part of the year. All are available online at http://www.hathitrust.org/papers. [110]
HathiTrust introduced a number of improvements to its PageTurner in April, including new functionality to scroll and flip through volumes, a streamlined interface, and quick-copy links for individual pages.
Enhancements made to Collection Builder in February and March allow users to create full-text searchable collections of arbitrary size out of repository materials.
The Full-text Search Working Group, launched in 2010, released a list of prioritized recommendations [270] for enhancing features of HathiTrust’s full-text search. The first of these, the use of bibliographic metadata in relevance ranking and faceting of search results, will be in place in July.
Staff at the University of Michigan have begun to implement new security features in the HathiTrust Data API to enable a range of new activities by partner institutions and others. These activities and the specifications for the new features are available online at http://bit.ly/jozHQK [257].
HathiTrust installed new hardware in May that will be used to conduct periodic, generalized auditing procedures on the repository, such as routine checksum validation of content, as well as ad-hoc cross-repository analysis, for example investigating specific values or usage of preservation metadata elements.
HathiTrust completed its first full storage replacement cycle in May, retiring storage purchased in late 2007. Storage is replaced on a cycle of approximately 3-4 years. Replacement of storage hardware will now be an annual or semi-annual process, shadowing historical patterns of repository growth and storage purchases.
Staff at Michigan on are on track in development of new mobile interfaces for conducting bibliographic searches and reading volumes in HathiTrust. Initial versions of the new interfaces are expected to be released in late July.
[Download PDF [348]]
With funding from HathiTrust, the University of Michigan Library Copyright Office has begun work to identify orphan works – works that are known to be in copyright, but whose rights holders cannot be identified or located – in HathiTrust’s growing repository. The goal of the project is to provide concrete data on the number of orphan works in HathiTrust, which could be used in the creation of legal or policy-based frameworks to allow broader access to orphan works for scholarly and research purposes. The official press release is available on the University of Michigan Library website [334].
The Strategic Advisory Board welcomed two new members from the University of California in May: Todd Grappone, Associate University Librarian for Digital Initiatives and Information Technology at UCLA, and Julia Kochi, Director, Digital Libraries and Collections at UC San Francisco. Todd and Julia take the place of Bernie Hurley, UC Berkeley and Bruce Miller, UC Merced, who are stepping down from their duties on the SAB. The HathiTrust Executive Committee would like to thank Bernie and Bruce for their important contributions to the partnership on this committee.
Michigan staff continued to work out details of ingest with several institutions, including loading bibliographic metadata for additional volumes from Yale, and completing an initial pre-ingest transformation process for locally-digitized content from Universidad Complutense de Madrid.
HathiTrust completed ingest of more than 47,000 volumes contributed by the University of Virginia in May.
The Collections Committee continues to work on its current key deliverables, including recommendations regarding duplicate volumes in HathiTrust, coordinated print management, and responding to user requests to contribute volumes to the repository. The group plans to share a draft discussion paper on duplicates with the Strategic Advisory Board in June or July for initial feedback, and will also review its work on print management with the HathiTrust Executive Committee’s print management subgroup during that same timeframe.
In May, the Communications Working Group continued planning for a HathiTrust Facebook presence, and made progress on the development of new HathiTrust promotional materials. The group also began discussions with the Usability group on working collaboratively to assemble user stories, and other potential synergies between the two groups.
Work on the development of HathiTrust personas reported in last months’ update [340] continued in May. The group continued to solicit and collect real-life user stories, including in particular those based on librarian and patron interactions. The group’s liaison to the Communications group attended the May Communications call to give an update on the progress of the persona project and to discuss collaboration on, and use of, the collection of real-life user stories.
For the last few months the group has been soliciting members for a User Experience Special Interest Group (HT UX-SIG) and has now received around thirty volunteers. This new SIG will be activated in June.
The User Support Working Group assumed responsibility for the wide range of inquiries and feedback received through HathiTrust interfaces and help email addresses in May. The 8-member group has established an on-call rotation throughout the week, including weekends, to address issues in a timely and efficient manner. HathiTrust’s response to user feedback received several positive comments via Twitter during the last month. The working group is committed to maintaining a high level of service to address user comments, feedback, and suggestions.
The grant project team’s work in May focused on the preparation of materials to orient and bring several newly-hired reviewers at the University of Minnesota on board to data collection and review. This work included updating and streamlining the quality review Web application to allow for efficient remote operation. Training of the staff at Minnesota commenced at the end of May and the new reviewers are set to begin work in June. As data from Minnesota are collected, the project statistician will be examining inter-coder reliability among all reviewers and working to establish a final model for sampling volumes in the repository. Gathering data for analysis will be the focus of the grant team’s efforts in June. Additional information on the project can be found at http://www.hathitrust.org/grants [305].
Staff at Michigan have begun discussing mechanisms to synchronize the text and bibliographic records of public domain materials in HathiTrust to the HathiTrust Research Center. Michigan Staff developed an initial model for the data transfer (using rsync) in May, and Indiana University staff began performing tests with sample data.
The California Digital Library (CDL) development team is preparing a demo of the Metadata Management core system for staff at the University of Michigan on June 14, 2011. As the first major component of the new HathiTrust Metadata Management System, this is a major milestone and deliverable. The next step will be for the CDL team to address feedback raised in the demo. The team continues to interview for the open Metadata Analyst position. In the meantime, a senior metadata analyst at CDL has conducted a full metadata audit, confirming the validity of the core system design. Further information on the project is available at http://www.hathitrust.org/htmms [12].
University of Michigan staff completed the first draft of requirements for improved security in the Data API. The draft has been made available for comment at http://bit.ly/jozHQK [257]. We ask that interested parties submit comments to feedback@issues.hathitrust.org [66]. Initial coding will begin as feedback is received.
Staff at Michigan implemented a “diff” service as part of support for administration of the HathiTrust Development Environment (HTDE). At the time when new code is staged for testing, the code administrator can now choose to see differences between the last deployed version of the code repository and the version being staged in preparation for the next deployment. Michigan staff also implemented topic branch staging for beta testing. This facilitates testing of code changes on a staged beta testing site without pushing the code branch to the central code repository before its desired time. Parties at partner institutions that are interested in exploring the development environment should contact feedback@issues.hathitrust.org [66].
HathiTrust full-text search uses the Lucene-based Solr search engine to index content and provide volume-level results. However, when searches are conducted within a single volume, a different search engine known as XPat is used to dynamically index and search the volume and display page-level results. Differences between the ways that Solr and XPat work sometimes cause inconsistencies in the user’s experience. To remedy this, staff at Michigan have started a process to replace XPat with Solr. The majority of this work is to be completed in June, though testing and optimizing may result in a later release date. Accomplishing the change will achieve one of the higher priority features identified by the Full-text Search Working Group: improved results display for multiword searches when searching within a book. Staff will conduct this work in parallel with other full-text search improvements currently underway, including the use of bibliographic metadata for relevance ranking and faceting of search results, and, with development contributions by CDL, a “spelling suggestion” feature. Michigan staff aim to release the relevance ranking and faceting improvements by July 1st.
Michigan filled one of two programmer positions advertised for the new HathiTrust publishing initiative, led by the MPublishing division at the University of Michigan Library. The new hire will start on June 27th. The search continues for the second position, which is posted at http://umjobs.org/job_detail/54579/application_developer [349].
MPublishing recently hired an intern who will be working over the summer to explore potential archival XML schema solutions for electronic journal content.
During their last visit to HathiTrust’s Indianapolis storage facility, Michigan staff installed two new servers that will perform periodic, generalized repository auditing, including checksum validation of repository content, using newly-developed auditing tools. The auditing tools can also perform ad-hoc cross-repository analysis as they run, culling information from the repository using custom one-time scripts. For example, staff may add a custom script to the next auditing run to analyze and report on a specific detail of PREMIS metadata usage.
Enhancements to the new PageTurner views were released in May in response to user feedback. Staff at the University of Michigan added a full-screen viewing mode, optimizing the use of screen space for content display, and improved landscape image viewing, aligning viewing controls to browser window dimensions when scrolling through the image viewport. Staff also researched ways of improving performance for larger books.
Michigan staff have completed security wipes on all recently retired storage equipment. The equipment was returned to the vendor for a credit, completing this (the first) annual replacement cycle. The next cycle is planned for the first quarter of 2012.
There were no outages in May.
In September 2010, California Digital Library’s Discovery and Delivery group released an SFX target for HathiTrust monographs, which was made available to partnering libraries. In May, CDL made the target available to libraries broadly via EL Commons CodeShare, a forum hosted by Ex Libris. The formal announcement is available on the CDL website [350]. Please contact Margery Tibbetts (Margery.Tibbetts@ucop.edu [351]) with questions and inquiries.
All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers [356].
Number of volumes added:
| April | Total | |
| Columbia University | 5,423 | 63,906 |
| Cornell University | 121 | 311,231 |
| Harvard University | 1 | 52,710 |
| Indiana University | 838 | 184,719 |
| Library of Congress | 0 | 71,418 |
| New York Public Library | 0 | 258,691 |
| Penn State University | 135 | 39,151 |
| Princeton University | 2,051 | 239,085 |
| University of California | 47,637 | 2,456,364 |
| The University of Chicago | 999 | 6,171 |
| University of Illinois | 0 | 14,501 |
| University of Madrid | 2,947 | 106,744 |
| University of Michigan | 17,516 | 4,355,884 |
| University of Minnesota | 1,659 |
86,644 |
| University of Wisconsin | 11,081 | 465,413 |
| University of Virginia | 47,303 | 47,303 |
| Yale University Library | 110 | 271 |
| Total | 137,821 | 8,760,206 |
Public Domain (~27%)
| Total* | 174,061 | 2,378,582 |
* Includes volumes opened through copyright review or rights holder
Ed Van Gemert, for the Strategic Advisory Board
This update is a follow-on to the report on the HathiTrust 2011 Constitutional Convention given in the Update on January 2011 Activities.
HathiTrust contracted in March with Ithaka S+R to conduct a three-year review of HathiTrust’s progress toward meeting the needs of libraries, scholars, students and other users. The review will inform discussion and promote participation at the October 8-10, 2011 Constitutional Convention in Washington DC. The Strategic Advisory Board (SAB) is providing oversight for the review, working closely with the Ithaka staff assigned to the project.
Ithaka’s efforts include gathering and preparing research on HathiTrust’s existing structure and needs, including background meetings with stakeholders, team members at the University of Michigan, and members of both the Executive Committee and the Strategic Advisory Board.
A survey and review of user needs has followed the initial research. A survey was sent to the 52 HathiTrust Contributing Partners and Sustaining Partners (non-content contributing). Ithaka S+R is also interviewing 20 representatives from libraries that do not currently participate in HathiTrust, along with 12 scholars in the humanities and social sciences. The survey officially closed at the end of the day on 3 June and results are being formulated.
Preliminary indicators from this research process provide useful data and commentary including:
Follow-up interviews by Ithaka S+R staff will probe further these and other issues.
Ithaka S+R is required to submit a draft briefing memo to the SAB on June 17, 2011. Ithaka will then take comments from the SAB and the Executive Committee until July 1. Following a two-week revision period, Ithaka will submit a final report to the SAB on July 15, 2011. The SAB will then distribute Ithaka’s report to the HathiTrust membership for full discussion and comment leading up to the Constitutional Convention in October.
Questions or comments regarding the three year review can be directed to Ed Van Gemert, (evangemert@library.wisc.edu [357]) Deputy Director of Libraries at the University of Wisconsin-Madison and Chair of the HathiTrust Strategic Advisory Board.
[Download PDF [359]]
HathiTrust is an international partnership of academic and research institutions dedicated to ensuring the preservation and accessibility of the vast record of human knowledge. The partnership owns and operates a digital repository containing millions of public domain and in copyright volumes digitized from partnering institution libraries. The preserved volumes are made available in accordance with copyright law as a shared scholarly resource for students, faculty, and researchers at the partnering institutions, and as a public good to the world community. For more information, visit HathiTrust.org [79].
26 institutions joined HathiTrust in 2010, doubling the size of the partnership and making a total of 52 institutions that will participate in a constitutional convention next year. In this convention, partners will review repository governance and sustainability and determine directions for the next phase of HathiTrust. View the press release [360].
HathiTrust partners contributed 2.6 million volumes to the repository in 2010, raising the total number of volumes to more than 7.8 million. Nearly 2 million volumes are in the public domain. New institutions to contribute content in 2010 included:
The Executive Committee approved a new cost model for HathiTrust in February 2010, which will be the basis of costs for all partners beginning in 2013. The new model is based on the overlap of partner institutions’ print collections with the digital volumes in HathiTrust. Institutions that do not have large amounts of content to deposit are able to join under the new model before 2013, and more than a dozen have already done so (view the full list of partnering institutions [361]). A FAQ [249] for the new model is available on the HathiTrust website.
Staff members at the University of California and University of Michigan worked together over a period of months to develop specifications and routines to ingest partner materials from the Internet Archive at scale. Well over 100,000 volumes from the Internet Archive have been deposited in HathiTrust by three institutions to-date, and more are on the way. This was a major step in the expansion of HathiTrust’s ability to accommodate content from a variety of digitization sources.
4 new groups were formed in 2010, reflecting both the growing number of partner institutions and the expanding work of the partnership:
Authenticated users from partner institutions are able to access full PDFs of all public domain volumes in the repository, and use a local sign-on to build permanent public or private collections of volumes. More information about Shibboleth [107] can be found on the HathiTrust website.
Over the summer, staff at Indiana University, the University of Wisconsin, and the University of Minnesota joined in work begun at the University of Michigan to review the copyright status of works in HathiTrust published from 1923 to 1963. More than 90,000 volumes have been reviewed since the project began two years ago and approximately 55% of those reviewed have been determined to be in the public domain.
University of Michigan staff added functionality to the Collection Builder application to enable users to add multiple items from full-text search results to public or private collections.
Staff at the University of Michigan developed the capability to deliver full PDFs of all public domain materials through the HathiTrust PageTurner.
Mechanisms and servers were put in place to achieve full redundancy of the large-scale search index, with copies of the index at both the Michigan and Indiana storage sites.
The Communications working group, in conjunction with the Usability working group and developers at the University of Michigan, combined existing interfaces to create a single portal at HathiTrust.org [79] for accessing repository services and finding information about the HathiTrust partnership, infrastructure, and activities.
Members of a multi-institutional working group completed the work of specifying requirements for, and developing, a collaborative environment for the development and enhancement of HathiTrust applications. Documentation of the new environment will be forthcoming in 2011.
A multi-institutional working group was charged with exploring the value of adding a third instance of storage to HathiTrust’s infrastructure. The working group’s report [362] is available the HathiTrust website.
Staff at the University of Michigan made enhancements to ingest capabilities, including a general increase in processing throughput, improvements in barcode validation, preparation for PREMIS 2.0 support, cleaner integration with pre-ingest transformation processes (for non-Google-scanned materials), and new controls to automatically manage priority levels for content ingested from multiple sources.
The University of Califonia began development of a new bibliographic metadata management system for HathiTrust in November 2010. The system is projected to be operational by the first quarter of 2012.
HathiTrust hosted several “HathiTrust 101” web- and phone-based discussions for new and existing partners in the summer and fall. More of these discussions and informational sessions are planned in 2011.
Staff at the University of Michigan created an application [363] using only publicly available APIs to demonstrate how the Data API [103] could be used to locate and download complete book packages for public domain volumes not digitized by Google (Google-digitized volumes can be accessed through the Data API one page at a time).
HathiTrust will serve as a testbed for research led by Paul Conway, Associate Professor at the University of Michigan’s School of Information, to develop a framework and methodology for validating the quality of content in large-scale digital repositories. Details can be found in the School of Information news release.
Significant progress was made on developing policies, specifications, and technological infrastructure to facilitate the ingest of locally scanned materials from partner institutions at scale.
Staff at the University of California developed search widgets for HathiTrust that can be embedded in local websites to execute catalog and full-text searches. The widgets are available at http://www.hathitrust.org/widgets [364].
The Center for Research Libraries’ report on HathiTrust compliance with the Trustworthy Repository Audit and Certification [365] criteria (TRAC) is expected in early 2011.
The University of Minnesota in conjunction with the Minnesota Digital Library (MDL) and the Minnesota Historical Society (MHS) have been working with staff at the University of Michigan to develop a prototype workflow for depositing images and associated metadata into HathiTrust for access, storage, and preservation. The prototype project, which includes tens of thousands of digital images from MDL and MHS, is nearing completion. Further details are available in the HathiTrust Update on October Activities [366].
A prototype of the HathiTrust-OCLC catalog will be released in beta in January.
HathiTrust will soon offer rights holders the option to attach Creative Commons [367] licenses to works they wish to open access to in HathiTrust.
With the ingest of image content from Minnesota, the establishment of a HathiTrust Research Center, progress to enable HathiTrust as a platform for digital publishing, and significant steps towards compliance with TRAC, HathiTrust will fulfill all of the initial objectives set by the founding partners (see http://www.hathitrust.org/objectives [2]).
A new version of PageTurner, including the scroll and flip functionality and other features of the open source BookReader [254] software, will be released in early 2011.
The first full re-indexing of HathiTrust volumes will be completed in January.
The Executive Committee has approved the proposal of Indiana University and the University of Illinois for the creation of a HathiTrust Research Center. Details and an announcement will be forthcoming.
The University of Michigan has finalized the terms of an agreement with Google that will allow HathiTrust to distribute the texts of public domain volumes to researchers for scholarly purposes. Details and announcement will also be forthcoming.
A group of partners from CIC institutions is at work to develop the legal framework and technical implementation criteria to extend full-text access to both public domain and in copyright materials in HathiTrust to users at partner institutions who have print disabilities. Further reports on this work will be given throughout 2011.
As reported in the HathiTrust Update on October Activities [366], the MPublishing division of the University of Michigan Library has engaged in a 2-year effort to create ingest, management, and presentation tools that will enable the use of HathiTrust as a publishing platform for encoded text and page-image materials. The effort will focus first on journal content, with support for books planned at a later stage.
HathiTrust has begun to assemble a database containing the print holdings of partner institutions. The database will facilitate the calculation of costs under the new cost model (see the new cost model FAQ [249]), as well as broader partner activities around cooperative collection management and development. Work on the database will continue through 2012, to be completed by the time the new model takes effect in 2013.
The HathiTrust partnership will hold a major meeting in October 2011 to conduct a formal review of HathiTrust governance and sustainability and shape future directions for the partnership.
Shared Digital Repository
Update on April 2008 Activities
9 May 2008
This is the second regular update on activities in the Shared Digital Repository (SDR). These updates will be made available monthly, typically on the 2nd Friday of the month, and will provide a variety of information about the general health of the repository and updates on the development of the SDR. Each update will be sent via e-mail to an official representative (typically the library director) of a participating institution, and will be posted on the SDR website. We plan to make an RSS feed for the updates available soon, in order to share the information as broadly as possible. Throughout this update, we refer to the draft Short-Term and Long-Term Functional Objectives (being articulated by the CIC’s SDR committee) as a work item relates to those Objectives.
As of April 30th, the SDR contains:
In a future update, we will provide a link to our draft response to the required elements in the Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist. As mentioned in the last update, we coordinated a site visit by a team from the Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) effort in the European Union. Their report, which gives an extremely favorable review of the SDR, should be released publicly soon. (CIC SDR Short-Term Functional Objectives)
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
Please contact Phyllis White (pmwhite at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
At this time, the following outages are scheduled:
May 8, 2009 [Download PDF [368]]
Temporary Beta Catalog Released – A major milestone for HathiTrust was reached in April, as a temporary beta catalog for HathiTrust was released on April 24. The catalog provides bibliographic search and faceted browsing of all volumes in HathiTrust, and integrates with the HathiTrust Page Turner to provide access to individual items. It can be accessed at http://catalog.hathitrust.org [369]. Further integration with the HathiTrust Collection Builder, as well as other enhancements, are planned for a second phase of development. This catalog (including phases 1 and 2) is temporary, pending the release of the permanent HathiTrust catalog to be developed by OCLC in conjunction with HathiTrust partners (see the final news item on this page for details). Michigan and California also discovered a strong mutual interest in improving functionality in the HathiTrust Page Turner, which will be explored further in May.
Ingest Started From Indiana University And The University of California — In a month of significant developments for HathiTrust, ingest of content from both Indiana University and the University of California began in April. The loading of bibliographic metadata for the initial set of Indiana volumes was completed and approximately 10,100 had been ingested by May 1. Bibliographic loading continues for the University of California, and ingest started late in April. Several hundred volumes are now available in the repository.
HathiTrust-WorldCat Local Project — In April, the HathiTrust-WorldCat Local Implementation project team, consisting of members from HathiTrust libraries and OCLC, met in Chicago to begin the process of creating a production-level bibliographic discovery interface for HathiTrust. The initial version, due out in the 1Q 2010, will build on WorldCat Local’s standard functionality and will be tailored to HathiTrust’s entirely digital collection. The design of HathiTrust’s recently released temporary beta catalog (using VuFind) will also inform this project’s requirements and interface design. Some of the project’s overall priorities will be:
For more information, please contact John Butler (j-butl@umn.edu [370]), Lee Konrad (lkonrad@library.wisc.edu [371]) or Bill Carney (carneyb@oclc.org [372]).
Storage – New storage that was purchased in March has been installed and is operational at the Michigan site. The storage at Indiana has been received, and installation is scheduled for mid-May.
Large-scale Search – Michigan and California developers shared experiences and ideas in a fruitful discussion about Lucene-based search engines, XTF and Solr. Investigations into software solutions for improving response times for slow queries led us to add common-gram indexing and searching capabilities to Solr, significantly improving performance of slow phrase queries. Common-grams increase index size, but the difference so far seems to be manageable and worthwhile. We are continuing to refine a hardware configuration to use for Solr servers based on discussions of indexing workflows and continuing research of different indexing algorithms, which have an impact on storage requirements.
Data API – Useful feedback was received from California Digital Library staff on the first draft of a functional specification for the HathiTrust Data API and a response is in the works. The draft is online (http://www.hathitrust.org/hathitrust_data_api [373]). Coding of an alpha version the Data API is done, and limited use of the API will start in May; CDL will use it to validate ingest of UC content into the repository.
Development Environment – We are in the early stages of conceiving a new development environment for building and testing repository applications and services. Server hardware has been allocated to this purpose and setup will take place as design discussions progress.
Outages
PLEASE NOTE: Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list to receive information about unscheduled outages.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
Advance notice for scheduled outages is given on business days, at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
[Download PDF [374]]
HathiTrust released new functionality for its PageTurner application in April, improving the way volumes in the repository can be viewed and used. Enhancements to the PageTurner include:
Development of the new functionality was initiated by staff at the California Digital Library (CDL) in HathiTrust’s collaborative development environment, and completed by staff at the University of Michigan. The Usability Working Group provided input and feedback on the interface design. The new views were built using Open Library’s open source BookReader [254]. The thumbnail view was created specifically for HathiTrust by CDL staff, and has been incorporated as a standard feature in the core BookReader software.
We welcome comments and feedback on the new PageTurner. Please use the “Feedback” link that appears in the upper right corner of the page when viewing HathiTrust volumes, or email feedback@issues.hathitrust.org [294].
HTPub is an effort of the MPublishing Division of the University of Michigan Library to enable the use of HathiTrust as a platform for publishing open access electronic journals. It was first reported on in the Update on October 2010 Activities [375], and has been in planning stages over the winter. MPublishing recently hired a summer intern who will be working with Michigan staff to define requirements for archival objects produced through HTPub. Michigan is in the process of hiring two full-time positions to support the work of the initiative. More information is available on the HTPub project page [223].
John Butler of the University of Minnesota, John Weise of the University of Michigan, and project consultant Eric Celeste briefed CNI membership at the Spring 2011 Membership Meeting on the Minnesota Digital Library-HathiTrust image content prototype project. A summary of the project and slides for the presentation are available at http://www.hathitrust.org/mdl_images [272]. Access to the images, now in the HathiTrust repository, will be enabled in late May or June. MDL has yet to draw conclusions regarding deposit of images in HathiTrust beyond the prototype phase. However, much has been learned throughout the project and HathiTrust intends to use the prototype and the experience gained and a base for developing general image ingest specifications that can be used for ingest of images from partner libraries.
HathiTrust has begun to post weekly reports on the ingest status of content submitted by partner institutions. The reports [255] are available on the HathiTrust website, as well as a description [376] of the information the reports include.
Michigan staff worked with Universidad Complutense de Madrid, Yale University, and the University of Illinois in April on ingest of locally-digitized volumes. We expect to begin ingest of volumes from Madrid in May, as well as the full set of volumes from Yale (a sample was ingested in December).
Ingest of an initial set of more than 50,000 volumes from Harvard University was completed in April.
The Collections Committee continues to work on a series of recommendations regarding duplicate volumes in HathiTrust, coordinated print management, and responding to users requests to contribute volumes to the repository. A draft discussion paper on duplicates will be shared with the Strategic Advisory Board in June for initial feedback.
The Communications Working Group finished a round of new partner webinars on April 12 and 15th. The webinars were well-attended and generated questions and rich discussion. The webinar slides and audio recording are available on the HathiTrust website. The working group also continued to craft a Facebook presence for HathiTrust, plan for a HathiTrust blog, and develop informational materials for use by partner libraries.
The Usability Working Group made significant progress in April in developing a set of personas for HathiTrust users and scenarios of use. To help inform this draft set, the group has been gathering real life use cases from user feedback, reference interactions with users, and uses of HathiTrust that have been posted in blogs and tweets. It has also been analyzing HathiTrust usage statistics for trends. The personas and scenarios are intended to inform development and policy-making surrounding HathiTrust applications and interfaces. The group anticipates having the draft set of personas and scenarios ready to share with partner institutions and other HathiTrust working groups in May. The personas will be refined over time as additional use cases are assembled and user research conducted.
The Usability Group is still accepting volunteers to join the new User Experience Special Interest Group (UX-SIG), reported in February’s update. Please contact Suzanne Chapman (suzchap@umich.edu [170]) if you are interested in joining this group or have any questions about participation.
During March and April, the chair of the User Support Working Group chair coordinated with staff members at the University of Michigan who have been handling user feedback for HathiTrust, to configure a partner-wide issue tracking system using JIRA. User Support members began accessing the system in April and observing the preliminary processes that had been put in place. The working group will assume responsibility for responding to issues and directing feedback as apporpriate to partner institutions and working groups in May. Michigan staff will continue to play an integral role in addressing issues related to content quality and bibliographic metadata.
The grant project team continued to refine definitions for the preliminary set of quality errors they have identified within volumes, and make improvements to the quality review application interface. The team continued to focus on dual review of volumes (two reviewers coding the same set of volumes) to identify problematic error definitions and refine descriptive wording to better illustrate each error type. The team also revised definitions for the scale of severity that is applied to errors, in order to improve inter-coder consistency. A second sample of 10 public domain volumes was reviewed by project staff to provide sufficient data for the project statistician to develop appropriate sampling techniques for Phase Two of the project: production level coding. The University of Minnesota will be joining in data collection efforts and will begin remote reviewing in the next two months after a series of training sessions with members of the project team. Background information on the project can be found on the grant projects page [305].
The HathiTrust Metadata Management System team completed development of the core database system in April, as well as an API to export bibliographic data in XML format. Approximately 200,000 records have been loaded into the system for initial testing. The team is analyzing MARC records from current content-contributing partner institutions, received from the University of Michigan, looking for irregularities and performing a general survey of the record set. CDL staff continue to interview for a Principal Metadata Analyst. Details on the project are available at http://www.hathitrust.org/htmms [12].
Staff at Michigan have completed a rough draft of requirements for improved security in the Data API based on symmetric key cryptography. The draft will be made available for comment in the near future.
New MySQL servers installed in the development environment by staff at the University of Michigan have boosted performance of print holdings database operations by an order of magnitude. Similarly-configured servers will be installed in the production environment in May.
Michigan staff began development work on priority features for full-text search as identified in the Full-Text Search Working Group’s report. The implementation team is focusing initially on relevance ranking of search results based on a combination of full-text OCR and bibliographic metadata, and on faceting of results using bibliographic metadata. The goal is to release significant new features that use the bibliographic data to enhance full-text search results by July 1, 2011.
All replacement storage equipment at the Michigan and Indiana storage sites is online and in use. The storage equipment that was replaced is being wiped for security purposes by staff at the University of Michigan and will be traded in for a credit on new storage that will be purchased in June 2011.
There were no outages in April.
All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers [356].
Number of volumes added:
| April | Total | |
| Columbia University | 3 | 58,483 |
| Cornell University | 40,729 | 311,110 |
| Harvard University | 52,709 | 52,709 |
| Indiana University | 893 | 183,881 |
| Library of Congress | 0 | 71,418 |
| New York Public Library | 0 | 258,691 |
| Penn State University | 18 | 39,016 |
| Princeton University | 8,810 | 237,034 |
| University of California | 41,512 | 2,408,727 |
| The University of Chicago | 0 | 5,172 |
| University of Illinois | 0 | 14,501 |
| University of Madrid | 15,486 | 103,797 |
| University of Michigan | 19,974 | 4,338,368 |
| University of Minnesota | 1,419 | 84,985 |
| University of Wisconsin | 10,602 | 454,332 |
| Yale University Library | 0 | 161 |
| Total | 192,155 | 8,662,385 |
Public Domain (~27%)
| Total* | 181,909 | 2,386,430 |
* This count includes volumes already in the repository to which rights holders have newly opened access
September 12, 2008
Growth
Forecast for September development
Outages
PLEASE NOTE: We still do not yet have contact email addresses for institutions for notification. As the service becomes more widely used, this will be an essential means of communication. Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
September 11, 2009 [Download PDF [381]]
[1]
Working Group On Computational Research Center – The Research Center proposal planning group has made great progress in the last month. The group has continued discussions on the types of research that could utilize the centers, how results might be shared, and what environments/datasets are best suited to which types of research. In bi-weekly calls, subgroup meetings, and individual interviews, the team has been working through difficult issues such as defining non-consumptive research and recognizing hurdles related to the management and publication of research results. Next steps include developing a draft plan for the infrastructure of the centers and marrying legal and security restrictions with that infrastructure. The group aims to have a draft proposal prepared in early October and a full proposal completed later that month.
Working Group on Development 'sandbox' – Based on a general conversation with the working group and a useful discussion of potential use cases with UC staff during their Ann Arbor visit, staff at the University of Michigan have gathered enough information to start building the development environment. The initial goal is to support all of the current development projects in a single place, and provide a large subset of content with which to work. The new environment will be a substantial improvement over current conditions, and should be a building block for additional capabilities later on, including significant partner development. Michigan has racked, cabled, and started operating system installs on the equipment set aside for the project. When further progress has been made on the base installations the full working group will assemble to discuss the provisions of the environment.
University of Michigan Press Backfile and Reprint Purchase Links in HathiTrust – HathiTrust is collaborating with the University of Michigan Scholarly Publishing Office and the University of Michigan Press to open access to the majority of the published backfile of the UM Press in HathiTrust. The volumes, which are being digitized by the Press, will be available in HathiTrust with an option to purchase a print-on-demand copy in mid to late October.
HathiTrust Disaster Preparedness – Over the summer, an IMLS grant-funded intern in digital preservation performed an in-depth evaluation of disaster preparedness in HathiTrust. The report provides detailed information about the strengths of HathiTrust’s current disaster recovery planning, as well as recommendations for improvements in the short-, intermediate-, and long-term. It is available at http://www.hathitrust.org/technical_reports/HathiTrust_DisasterRecovery.pdf [382].
Prototype for New HathiTrust PageTurner — Staff from the University of California and University of Michigan held two teleconferences in August to discuss deeper integration of the UC prototype PageTurner into the existing application. Team members discussed strategies for offering full development capabilities on a limited amount of HathiTrust content in advance of the development ‘sandbox’ environment. A working strategy has been reached and a development space should be available in October. UC has continued in the meantime to improve GnuBook functionality with thumbnail views of page images and the ability to display full-text OCR. Staff at UM are investigating ways to alter current processes that make access-quality images available to the PageTurner, to produce images that can be used by the GnuBook.
METS Profile Available — Staff at the University of Michigan have created a version 1.0 METS profile for HathiTrust content, which can be downloaded at http://www.hathitrust.org/preservation [383]. The profile currently applies only to Google content in HathiTrust, but will be updated to reflect requirements for locally-scanned content and volumes digitized by the Internet Archive.
Returned Duplicates — For several years, Google has been working on ways to reduce duplication in its digitization workflow. In August, it implemented processes that use metadata to detect volumes that have been scanned previously at other institutions so identical volumes will not be scanned again. The number of volumes rejected in this de-duplication effort has raised concerns among HathiTrust institutions about the accuracy of Google’s detection processes. The University of California, the University of Wisconsin, Indiana University, and the University of Michigan have undertaken a review of volumes returned as duplicates to better understand how duplicate determination takes place. The four universities have identified a target set of materials to review and are finalizing methodology to perform a manual evaluation. It is hoped that the results will be available for the Google library partner summit later this month.
Mobile Interface — Michigan made significant progress on the development of a mobile interface to the HathiTrust Catalog in August. The work continues, and staff will next turn their attention to the PageTurner application. Initial development will be followed by user testing for both applications.
Large-scale Search – After additional search performance testing in August, an improved index configuration was established by staff at the University of Michigan using a punctuation filter and a list of 400 common words (see blog post for details: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-perform... [384]). This index configuration will be put into production on the new dedicated server hardware, which was installed in August. Michigan also completed additions to the indexing control software (SLIP) to support distribution of indexing across several servers, each with multiple Solr index shards. A continuous indexing strategy for this distributed system and corresponding requirements for storage configuration and scripting has been implemented, and the first indexing tests will have begun by the time this report is published.
Ingest – The number of volumes ingested dropped significantly in August as ingest rates caught up with the rate at which partner content was made available from Google.
Data API – Ed Summers provided insightful and constructive feedback on the HathiTrust Data API in a blog posting in mid-August (http://inkdroid.org/journal/2009/08/13/open-to-view/ [385]). The comments are being reviewed by University of Michigan staff.
Collection Builder – Two new APIs for Collection Builder are being tested by staff at Michigan. The first returns the list of collections owned by a user. The second adds multiple items to a collection. These APIs will support future integration of Collection Builder functionality into other applications, such as the HathiTrust temporary catalog.
Outages – On Wednesday August 5 from 8:15pm to 9:30pm EDT, service was degraded (service may have been unavailable to some users) due to a storage system problem at the Indiana site. On Sunday August 23 at 6:30pm EDT to Monday August 24 at 8:00am EDT, Wednesday August 26 from 5:00pm to 6:00pm EDT, and Friday August 28 from 7:25pm to 8:35pm EDT, service was degraded due to network connectivity problems to database servers.
Software and firmware upgrades were performed during the weeks of August 10 and 17 at both sites without incident or interruptions in service. The upgrades conducted during the week of August 17 were preventative in nature, and addressed a hardware problem discovered by the storage system provider, and which was the underlying cause of the service disruption on August 5.
The cause of the other outages has been thoroughly researched but is still not known; workarounds that eliminate any service impact have been put into place, systems are being monitored, and investigation into the problem continues.
Number of volumes added:
| August | Total | |
| Indiana University | -- | 18,482 |
| University of California | 148,810 | 457,494 |
| University of Michigan | 58,878 | 3,129,152 |
| University of Wisconsin | -- | 215,045 |
| Total | 207,688 | 3,820,173 |
September 10, 2010 [Download PDF [386]]
[387]
Princeton University Joins HathiTrust – The full announcement can be found at http://bit.ly/bEbkSb [388] and more information is available at HathiTrust.org [389]. We are very excited to welcome Princeton University Library and look forward to the ways they will strengthen and enrich our partenrship.
HathiTrust 101 – Members of the Communications working group and John Wilkin, the Executive Director of HathiTrust, hosted two informal “HathiTrust 101” sessions for working group members and directors of partner libraries in August. The webinars were initiated in connection with the recent growth in partnership and the deepening involvement of member institutions in new working groups and the Collections committee. The purpose was to provide an overview of foundational elements of HathiTrust, including mission, governance, finances, and collections, as well as updates on current activities and areas of focus. A third session is scheduled in September, and plans are being considered to hold similar sessions on a periodic basis to keep partners updated about recent and upcoming developments, answer questions, and receive feedback on partner activities and plans. Slides from the “HathiTrust 101” presentation are available at http://www.hathitrust.org/documents/HathiTrust101-201008.ppt.
September Meeting In Chicago – Staff from a number of partner institutions, including members of the Executive Committee, Strategic Advisory Board, and several HathiTrust working groups, will be meeting in Chicago on September 23 and 24 to discuss a broad array of issues and plans. Some of these, in addition to topics regularly reported in this newsletter, include the new cost model to be implemented in 2013, and the constitutional convention of partners to be convened in 2011. Institutions who join HathiTrust on or before October 31, 2010 will be eligible to participate in this convention, in which partners will conduct a formal review of HathiTrust governance and sustainability and shape new directions for the partnership.
Local Digitization Ingest – Staff members at the University of Michigan continued to work on the first draft of a policy and specifications framework for ingesting locally digitized content into HathiTrust. Staff have begun to use the framework to evaulate a sample of materials submitted by the University of Illinois, and the framework will go out to partner institutions for comment and further trial in September. HathiTrust plans to begin ingest of locally digitized content from Illinois and other CIC institutions in the fall.
Communications – The Communications Working Group continued to discuss issues surrounding the redesign of the HathiTrust website, as well as plans and processes for receiving new partners.
Development Environment – Staff at the University of Michigan have nearly completed migration of the code for HathiTrust applications to the development environment, including establishing the methods and scripts needed to deploy applications into production. Focus has shifted from migrating code to staging, deploying, and testing applications in development and production areas of the new environment. Developers at Michigan have begun to transition to the new environment and system administrators have configured and opened access to additional servers to support this transition. Networking changes to provide access from the integration testing area of the development environment to the full repository were also completed.
Discovery Interface – At the end of August, OCLC had loaded over 3.7 million HathiTrust records into WorldCat. This constitutes 98% of the available HathiTrust records. The Discovery Interface team is planning a beta release of the phase 1 HathiTrust-OCLC catalog at a date to be determined, pending some final adjustments to the interface to be completed by OCLC. The Discovery Interface team, in conjunction with OCLC, is planning usability analysis that will start before the catalog is released and continue throughout the beta release phase.
The Discovery Interface team is also looking forward to a face-to-face meeting in September, during a larger meeting of HathiTrust partners in Chicago. The agenda will include: taking stock of Discovery Interface projects and activities to date, setting the purpose and scope for future work, supporting the Discovery Interface Full Text Search subgroup, and creating a roadmap for phase 2 of the HathiTrust-OCLC catalog.
Usability – The Usability Working Group has begun regular meetings and is in the process of setting priorities and defining member roles in relation to other committees.
Columbia – HathiTrust began ingest of volumes contributed by Columbia University in August, including both Google- and Internet Archive-digitized volumes. This was the first set of Internet Archive-digitized materials to be ingested since the initial deposit by the University of California in April, when specifications for Internet Archive-digitized content in HathiTrust were developed.
Yale – Staff from Yale and the University of Michigan have been working to determine the pre-ingest transformation steps needed for Yale’s Microsoft-digitized volumes and transfer the content to servers at the University of Michigan, where it will be ingested. Both of these tasks are nearly finished, and we hope to begin ingest of Yale’s initial set of volumes by the end of September.
Bibliographic Metadata Management – University of California staff are collaborating with staff at the University of Michigan to produce a series of planning documents for a HathiTrust Metadata Management system to replace the system currently in use. The goal is to prepare a set of documents for in-person review at the September meeting in Chicago. Teams are at work on documents that will codify goals, success criteria, system requirements, development, integration and migration strategies, acceptance testing and project timelines and milestones.
Large-scale Search – Michigan staff continued tests to determine the effects of cache warming on performance. Staff also continued the tests related to scaling strategy and indexing speed that were reported in the Update on June Activities [390].
PageTurner – Staff at Michigan improved the way that PDFs are created for books with landscape-oriented pages.
Storage Upgrade – Michigan staff completed the same upgrade at the Michigan storage site that was completed at the Indiana site in July: adding 160 terabytes of new storage, replacing cluster interconnect switches, reorganizing the equipment layout, and recabling all servers and storage. As reported in the Update on July Activities [391], the usable storage capacity at each site is now 475 terabytes.
Outages – HathiTrust full text search was unavailable on Friday, August 20 from 2:40-2:45pm EDT due to an accidental release of a software module from the new development environment while troubleshooting a full-text indexing problem. Full text search may also have been unavailable for some users from approximately 2:30pm on Friday, August 27 to Monday, August 30 at 3:30pm due to a network file system locking problem at the Michigan site.
UC Validation Tool – Staff at the University of California are developing an automated tool to validate the completeness and correctness of objects ingested into HathiTrust and retrieved through the Data API [103]. The tool will be used initially to validate samples of ingested Google- and Internet Archive-digitized objects in comparison with their pre-ingest originals. A prototype of the tool is scheduled for demonstration by the end of September.
SFX HathiTrust Target – California staff are packaging code for an SFX HathiTrust target for partners who also license the Ex Libris SFX software. UC expects to announce the availability of the code to partners in late September. The target will be offered through Ex Libris EL Commons wiki later in the Fall.
Number of volumes added:
| August | Total | |
| Columbia University | 56,730 | 56,730 |
| Indiana University | 286 | 177,962 |
| Penn State University | 10,202 | 33,357 |
| University of California | 133,900 | 1,769,227 |
| University of Michigan | 40,866 | 4,130,008 |
| University of Minnesota | 54 | 73,674 |
| University of Wisconsin | 14,167 | 379,111 |
| Total | 199,475 | 6,563,339 |
Public Domain
Total (~20%) | 55,132 | 1,311,288 |
| HathiTrust 101 | August 5 and 27 |
| University of Iceland | August 5 |
| IFLA 2010 (paper and presentation) | August 15 |
January 9, 2009
Outages:
PLEASE NOTE: Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
January 15, 2010 [Download PDF [393]]
[1]
Columbia Partnership – HathiTrust is very pleased to welcome Columbia University as its newest partner. A representative of HathiTrust will be travelling to Columbia in late January to give a full introduction to repository operations, current activities, and future plans. We look forward to the experience and expertise that Columbia will bring to the enterprise, and the new possibilities that are opening for HathiTrust as it continues to expand its membership and its collections. A full press release on the new partnership can be read at http://www.columbia.edu/cu/lweb/news/libraries/2009/20091216.hathi.html.
5 Million Volumes – A significant milestone was passed in December as HathiTrust exceeded 5 million volumes in digital holdings. More than 3/4 of a million of these are in the public domain. A steady rate of growth is expected to continue in 2010, and partner collections are projected to grow to more than 8 million volumes.
TRAC Audit – In early December, HathiTrust began a process with the Center for Research Libraries (CRL) to assess the digital repository in relation to the Trustworthy Repositories Audit and Certification (TRAC) criteria. The assessment is scheduled to proceed until mid-February, and the findings will be publicly available. More information about the audit can be found on the CRL website at http://www.crl.edu/archiving-preservation/digital-archives/certification-and-assessment-portico-and-hathitrust.
Bib API – HathiTrust has released a new bibliographic API that enables retrieval of descriptive and rights information for objects in the repository based on standard identification numbers (e.g., ISBN, ISSN, LCCN, OCLC). The API is a replacement for the (now deprecated) Rights API [394] and the specification is available at http://www.hathitrust.org/bib_api.
Discovery Interface – OCLC is completing preparations for the import of HathiTrust data into WorldCat Local (WCL). The installation of a HathiTrust WCL instance is scheduled to be complete in late February, and loading of records into this first version of the joint catalog will begin in March 2010. Looking towards version 2 of the catalog, the HathiTrust-partner working group began reviewing its scope and membership needs as its purview expands beyond bibliographic metadata in the catalog to include the integration of features such as full-text search and the HathiTrust Collection Builder. The group was renamed the HathiTrust Discovery Interface Working Group (from HathiTrust/OCLC Catalog) to reflect this broadening scope. The HathiTrust Executive Committee approved a proposal to have the working group report to the Strategic Advisory Board (SAB) in December, ensuring stronger alignment of the development and delivery of discovery services with future directions in HathiTrust as a whole.
Collaborative Development Environment – Staff at the University of Michigan completed setup of one of the servers that will be used in the initial proof-of-concept partner development environment. The server is configured with all of the tools and software needed to support the PageTurner development that the University of California and Michigan engaged in collaboratively in 2009. A developer at UC has begun to test features of the environment and will be reporting and providing feedback to the working group when the full group is re-engaged in January.
Research Center – The RFP produced by the working group was approved by the Executive Committee in December and is available on the HathiTrust website at http://www.hathitrust.org/documents/hathitrust-research-center-rfp.pdf [395].
Internet Archive Ingest – During the month of December, staff from UC and UM finalized many of the procedures and conventions related to the ingest of Internet Archive-digitized books into HathiTrust. These included file identification, preservation and technical metadata elements, content transformation and validation processes, error logging, and exception handling. UC delivered bibliographic metadata for an initial set of IA-digitized volumes to UM, and UM worked steadily on coding the transformation and validation processes for ingest. An end-to-end pilot test, including download, ingest, and quality review of ingested items will be performed in late-January.
New Programmer For Non-Google Ingest – Applications are still being taken for a programmer to receive and prepare non-Google materials for ingest into HathiTrust. Review of applications and interviews are being conducted simultaneously. The bidding process will close in mid-January, but will be extended again if an applicant is not selected. Full-time and part-time positions are being considered, and it is increasingly likely that one of each may be filled.
Shibboleth – In the near future HathiTrust will be implementing Shibboleth as a mechanism for inter-institutional authentication into HathiTrust. Distributed authentication will make it easier for users to take advantage of personalized services in HathiTrust, such as the Collection Builder. It will also enable the delivery of enhanced services to HathiTrust partner institutions. Staff at UM discussed the implementation strategy for Shibboleth in December and installed the Shibboleth service provider software on development servers to begin the work of integration. A forecast for the timeline of implementation will be included in the next update.
Large-scale Search – Staff at UM continue to refine the daily index update and release workflow, making it more resilient to problems that are sometimes encountered during indexing. New server equipment will soon be purchased for use at the Indiana site, and a schedule projected for continuous new hardware acquisition to maintain performance levels as the size of the index grows. As part of index and query response time testing, UM staff also updated and released a revised cache-warming procedure based on production log analysis. Warming (pre-populating) the cache of completed queries improves search performance.
Outages – There were no outages in December.
(What is your institution doing with HathiTrust? Let us know!)
UC and SFX – A University of California group has started work on a project to demonstrate proof-of-concept success in exposing HathiTrust public domain books through UC’s UC-eLinks service (SFX). The project is investigating the various HathiTrust APIs capable of supporting this service, and in addition to gathering usage statistics for the new target, will report on the functionality, usefulness, and viability of each of the APIs for future endeavors. The target will eventually be made available to ExLibris so that it can be added to the SFX package for all customers, but will be available to HathiTrust partners who use SFX before then.
Number of volumes added:
| December | Total | |
| Indiana University | 16,923 | 133,482 |
| Penn State University | 233 | 5016 |
| University of California | 263,089 | 1,155,367 |
| University of Michigan | 230,881 | 3,659,874 |
| University of Wisconsin | 12,137 | 267,353 |
| Total | 516,514 | 5,221,092 |
[Download PDF [396]]
From September through December 2010, the University of Minnesota worked with HathiTrust on a prototype project to add 50,000 image objects and associated metadata from the collections of the Minnesota Digital Library, and another 8,000 from the Minnesota Historical Society. To date, numerous lessons have been learned regarding format standards, identifiers, and rights issues related to image data sourced from different institutions. The project is also expected to shed some light on the costs of archiving image data in HathiTrust relative to that for published books and journals. Completion of the project and release of the final report are expected in the next month. For more information, please contact John Butler (j-butl@umn.edu [370]).
Staff at the University of Michigan incorporated feedback received from a variety of sources in October and November into the policy and specifications framework for scaling ingest of locally-digitized partner materials. The framework was finalized and approved, and is available at http://www.hathitrust.org/ingest [397]. The bulk of the enhancements to ingest systems to support this work were completed as well, and ingest of Minnesota images and a sample of Yale content have occurred in the new ingest environment. The new environment will eventually be used for Ingest of all materials, including those downloaded from Google and the Internet Archive.
Developers at Michigan began implementing changes to support Creative Commons licenses in the repository’s rights management scheme. Development is expected to be completed in February. Beginning March 1, CC licenses will be included in the “Rights” and “Rights determination reason code” fields of tab-delimited files [4] HathiTrust makes available for download. These files contain copyright, identifier, and limited bibliographic information for all volumes in the repository.
At the beginning of December, HathiTrust requested information from partners about the print holdings of their respective libraries. The information is being used to assemble a database that will support the new cost model all partners will participate under in 2013, facilitate legal access uses of materials in HathiTrust (e.g., section 108 uses and access for users with print disabilities), and form a base for collaborative collection management and collection development activities among the partnership. Partners are requested to provide this information by the end of February.
The recently-formed HathiTrust Collections Committee is a new standing committee reporting to the Strategic Advisory Board charged with establishing strategic directions related to the collection, including collection building and management (see charge and membership). The Committee held its first meeting in October 2010. Examples of issues currently under consideration include the role of duplicates in HathiTrust, models for shared management of print collections, and a variety of rights-related concerns. A more general area of investigation will be an exploration of specific collection development opportunities that the partnership might pursue and recommendations for how such activities should be prioritized and carried out, including considerations relating to non-book formats and collaboration with other initiatives. The Committee is considering a survey of the membership in order to assemble a better picture of partner expectations and aspirations. Input from other HathiTrust partners is welcomed; feel free to contact Ivy Anderson, chair (ivy.anderson@ucop.edu) or another member of the Committee with comments and questions.
The Communications Working Group continued to craft a marketing and communication plan for 2011, and expects to send a draft to the Executive Committee and Strategic Advisory Board by the end of January.
Despite holiday vacations, December was a busy month as the Discovery Interface Working Group (DIWG) worked with OCLC to take the final steps towards releasing the version 1 prototype catalog. With endorsement by the Strategic Advisory Board, the DIWG is now pleased to announce that the public release will go forward as planned in mid-January. Keep an eye out for the official announcement from OCLC. In addition to planning for the scheduled release, the DIWG is also developing post-release processes for managing user feedback and monitoring the system, an operational responsibility that will be supported by the California Digital Library. A three-month period of user testing will take place post-release, which will provide valuable input and help shape version 2 of this important effort.
The group reviewed a plan for a second round of usability for the HathiTrust-OCLC prototype catalog to be conducted in conjunction with OCLC and the Discovery Interface Group. The group also provided feedback on some proposed designs for a new PageTurner and revised home page.
A sample of digitized content from Yale University Library was ingested in December. The content is being reviewed by staff at Yale ahead of full ingest, which is expected to begin in January.
In December 2010, the California Digital Library (University of California) and HathiTrust solidified business arrangements and posted a Principal Metadata Analyst position to support development of the HathiTrust Metadata Management System. The University of Michigan transferred input files and scripts describing current bibliographic metadata transformation practices so work can begin at CDL on developing routines for metadata ingest. Progress is also being made on the development of the core metadata storage system. Project information, including overview, milestones, and timeline, is available at http://www.hathitrust.org/htmms [12].
The original storage equipment purchased in late 2007 for HathiTrust has reached its retirement age of approximately 3 years. HathiTrust uses modular storage, and modules may be removed and replaced without disrupting service. Data migration is handled in the background and is fully automatic, though the process does take time to complete. Michigan staff have developed a plan for the upgrade at the Michigan site, and once in progress, will start a similar upgrade process at the Indiana site. The replacement of storage hardware will now be an annual or semi-annual process, shadowing historical patterns of growth and storage purchases.
Michigan staff have begun developing audit mechanisms to verify the integrity of content stored in the repository. These processes will augment existing features of the storage system that routinely scan, detect, and repair hardware-level data storage errors (commonly referred to as “bit rot”). As part of this initiative, a preliminary integrity check of all repository Zip archives--which are used as containers for image, text, and metadata files--was run. The check revealed an error in one page of one volume resulting from a problem with data synchronization from Michigan to Indiana; this was easily corrected. Developers are now coding and testing a comprehensive set of audit routines to ensure that all items recorded as being present in the repository are stored properly and are fully intact, including checksum validation.
Work to re-index the full text of all volumes in the repository continues, and after encountering some out-of-memory problems, additional tuning and upgrades were made to Solr servers. Performance is almost an order of magnitude better than expected, owing to new optimizations that are being tested for the first time. This effort is on schedule for completion by the end of January.
Staff at Michigan outlined final steps for integrating BookReader with PageTurner. Changes to the user interface layout and performance testing are the main areas of remaining work. The layout design was completed in December, and will be coded in January.
In November, developers at Michigan made a change to the structure of the URL used to retrieve content and metadata from the repository through the Data API [103]. The old structure will no longer be supported as of March 1. Users of the Data API should consult the URL structure specified in the current Data API documentation [103].
There were no outages in December.
Number of volumes added:
| December | Total | |
| Columbia University | 34 | 57,316 |
| Cornell University | 161,803 | 215,610 |
| Indiana University | 322 | 179,351 |
| New York Public Library | 191 | 258,019 |
| Penn State University | 1,035 | 34,400 |
| Princeton University | 364 | 208,506 |
| University of California | 139,891 | 2,048,246 |
| The University of Chicago | 4 | 2,444 |
| University of Illinois | 0 | 14,428 |
| University of Madrid | 78,256 | 78,256 |
| University of Michigan | 14,956 | 4,249,620 |
| University of Minnesota | 2,507 | 76,371 |
| University of Wisconsin | 6,922 | 413,987 |
| Yale University Library | 144 | 144 |
| Total | 397,429 | 7,836,698 |
Public Domain (~25%)
| Total | 138,527 | 1,959,223 |
HathiTrust and Growth in 2011
HathiTrust grew rapidly in 2010, increasing the relevance of the HathiTrust collection to the management of our print collections. The HathiTrust collection will reach two milestones in early 2011: in January, HathiTrust will reach 2 million public domain volumes and, soon afterwards, the collection as a whole will pass 8 million volumes. Thanks to ongoing collection analysis work by OCLC Research, we know that North American research libraries overlapped with the HathiTrust collection at a median rate of just over 33% and that the median rate of overlap with the Oberlin Group of Libraries was closer to 40%. In each of the last two years, the repository grew by nearly 3 million volumes, and the rate of overlap between ARL libraries and HathiTrust grew at about 1% per 240,000 volumes of HathiTrust growth.
If you manage a rich library collection, you will find a significant percentage of your holdings online in HathiTrust; moreover, because of the size and diversity of the HathiTrust collection, you can add over one million new public domain volumes to your collection through the addition of HathiTrust links to your catalog. It is certainly true that the obstacles to using the in-copyright volumes for the delivery of mainstream library services are immense, but just as certainly this phenomenal collection can help us change the way that we administer the storage of little-used print collections. We can confidently say that in 2010 we made progress on HathiTrust’s mission-related goal [398] “[t]o stimulate redoubled efforts to coordinate shared storage strategies among libraries, thus reducing long-term capital and operating costs of libraries associated with the storage and care of print collections.” With 2011 also comes a change in HathiTrust’s growth trajectory and the need for a better understanding of the challenges and opportunities for future growth as a tool in shaping our collection management. To date, the two largest depositors in HathiTrust have been California and Michigan, representing approximately 26% and 57% of HathiTrust’s total deposits. The growth of new content from these institutions will slow in 2011 and, barring significant changes, HathiTrust’s collection will grow by fewer than 2 million volumes in 2011. Even this more modest growth is good news: it may lead to a 40% overlap between HathiTrust and ARL libraries.
We know from OCLC’s analysis (Cloud-sourcing Research Collections: Managing Print in the Mass-digitized Library Environment,” by Constance Malpas [399]) that even 33% overlap is of significant value to many of our libraries. Still, HathiTrust needs growth. HathiTrust’s value as a pivotal resource in viewing the aggregation of our collections benefits from growth. Building comprehensive and accessible online collections is a necessary part of our strategy for designing effective print storage and access strategies. This is true, for example, for US federal government publications, and is just as true for the large volume of mid-20th century publishing, much of which languishes in suboptimal off-site storage facilities in our libraries. While a 33% overlap between the HathiTrust collection and the collections of ARL libraries is valuable, 50% and 60% overlap can be a powerful catalyst to major changes in print storage.
Our growth is key for a broad array of library access and management opportunities. The case for HathiTrust as a catalyst for changed print management has become clear to our partners. There are other important reasons as well:
Large numbers of titles appear to be protected by copyright but are in fact in the public domain. Digital availability has been a necessary piece of the strategy that has helped HathiTrust partners open access to 55% of the books published in the US between 1923-1963. A new effort will also open access to large numbers of non-US works. Because of our investments to date, adding to the US 1923-1963 collection will also increase what we know to be in the public domain.
Partners are now working to assign resources to securing permissions for use of books and journals now online. Preliminary efforts have opened access to thousands of volumes. Online availability ensures that opening access is merely a matter of flipping a switch once permission is secured, increasing our incentive to work on the problem.
The richness of the collection makes possible important lawful uses of in-copyright materials. Many library volumes are eligible for uses under Section 108 provisions in US copyright law, and the online availability of a volume can help a library provide lawful access to an out-of-print work that is damaged, deteriorated, lost or stolen. HathiTrust partners are poised to follow Michigan’s lead and use the online volumes to provide services for their users with print disabilities. Again, this is an activity that can only happen when the volume in question is online.
Despite the likelihood of lower growth for 2011, the possibility of future HathiTrust growth remains great. Overlap between HathiTrust and ARL libraries will probably grow to “only” 40% in 2011, but based on current prospects, that overlap could grow to 60% in the coming year. The impediments to that growth are significant but tractable. For example, several newer HathiTrust partners who have also invested in local repository infrastructure have millions of volumes of digital content that would enrich the collection. Needless to say, the prior investments by these institutions make the additional cost of deposit in HathiTrust expensive, but because of a predominance of pre-1978 materials, this content is a rich resource for copyright determination work and would significantly increase overlap. Some partners also face legal and contractual obstacles. The majority of volumes digitized from CIC are embargoed under the presumption that they are in copyright. As we have learned through our copyright determination work, significant percentages of this content are actually in the public domain, and again these volumes would also increase overlap.
The growth of HathiTrust and the nature of the collection have created critical opportunities, but we must continue to push toward the goal of a nearly comprehensive digital collection in order to benefit fully from what that collection can offer. Copyright determination work, securing rights, and especially print storage management will all be furthered by growth. We will continue to address existing impediments and urge our partners to help round out HathiTrust’s large and increasingly comprehensive collection.
John Wilkin
Executive Directory, HathiTrust
March 13, 2009
Outages:
PLEASE NOTE: Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
March 12, 2010 [Download PDF [400]]
[1]
In the last several months, the HathiTrust partners have made steady progress in expanding the repository’s ability to support the variety of digital outputs produced at their local institutions. While the bulk of content in HathiTrust currently is the result of Google’s digitization efforts, preserving and delivering content from libraries’ non-Google sources is an important part of HathiTrust’s mission to meet the needs of libraries broadly, and assemble a comprehensive collection of published materials that is co-owned by libraries themselves. Three items in this month’s update highlight our efforts in this area: our progress in ingesting materials digitized by the Internet Archive, the hiring of two new programmers to focus on the transformations and normalizations involved in bringing in diverse content, and the creation of a demonstration application that uses the HathiTrust Data API to deliver master repository content from non-Google sources to users. We will be highlighting developments such as these in the coming months.
Internet Archive Ingest – Ingest of UC volumes digitized by the Internet Archive was delayed in late February due to a validation error that UM staff encountered, but ingest of more than 200 pilot volumes was begun in early March. Following quality review of the volumes by UC staff and the resolution of any associated issues, download of UC’s Internet Archive-digitized volumes will formally begin. Staff at UC and UM are in the process of compiling technical and procedural documentation related to Internet Archive ingest to share with partner institutions and the community at large.
New Programmer For Non-Google Ingest – UM has hired two new programmers, for a total of 1.7 FTE, to concentrate on developing ingest routines and common workflows for non-Google-produced materials. These will include materials digitized by the Internet Archive and through local digitization efforts at partner institutions.
Data API – The interface to the Data API demonstration application that was undertaken by Michigan in January is available at http://www.lib.umich.edu/two-over-threehundred/ [401]. The goal of the application was to use HathiTrust’s Data API [103] to facilitate the location and download of complete book packages for public domain volumes not digitized by Google. The code [402] used to produce the demonstration is also available. The application is still processing the HathiTrust data files, and so will only display a subset of the full data.
Quality, Ingest, and Error Rate – The working group kicked off activities under its recently revised charge [403] in February , and will be meeting on a monthly basis. At this stage, the group is undertaking information gathering and doing planning for work items, including building a framework for defining quality principles and developing a varied set of scenarios under which content would be gated from entering HathiTrust. This work will help to spur discussion and identify larger issues that are play. Members of the group include Paul Fogel (California Digital Library), Peter Gorman (University of Wisconsin), Bryan Skib (University of Michigan), and Paul Soderdahl (University of Iowa).
Discovery Interface – The HathiTrust-OCLC team made significant strides in February towards the version 1 catalog beta implementation, with some adjustments to the projected timeline. Due to changes in OCLC’s product release cycles, the catalog is now expected to be complete in May 2010. The HathiTrust library team is now exploring strategies and requirements for the catalog’s public release, with the guidance of both the HathiTrust Strategic Advisory Board and Executive Committee.
The load of HathiTrust bibliographic metadata to WorldCat remains on schedule. OCLC is currently testing the first batch of records, and large-scale loading will take place throughout the month of March. Preliminary user testing is currently underway at Penn State and will be complete in mid-March, thanks to the collaborative efforts of OCLC and HathiTrust’s usability group.
Collaborative Development Environment – The working group reconvened via conference call in February to discuss strategies for version control. All agreed that the version control tools used should facilitate development at local sites as well as within the environment itself, and allow public availability of the source code. Modern distributed version control systems, including some third-party systems such as GitHub, fit well with these needs, and UM staff will propose an architecture to the group at their next meeting in early March for approval. The group also discussed building logical divisions in the environment to segregate its use for various purposes, such as active code development, integration testing and staging for production release, the presentation of relatively stable “beta” versions of software systems, and replicating and troubleshooting issues live in production.
University of Minnesota – Ingest of content from the University of Minnesota began in February, with nearly 65,000 volumes being deposited. All of these volumes are government documents, and are part of a larger effort [404] of the Committee on Institutional Cooperation (CIC – the Big Ten plus the University of Chicago) in partnership with Google to digitize more than 1 million U.S. Federal Documents from their combined collections. The Minnesota documents themselves can be found by clicking on the University of Minnesota facet in the HathiTrust Catalog [405].
Shibboleth – UM is in the process of finalizing Shibboleth attribute release requirements for HathiTrust applications in coordination with partner institutions, and is registering HathiTrust as a service with the InCommon [406] Shibboleth federation. The release of this enhancement to HathiTrust applications is still planned for a March timeframe.
Large-scale Search – The large-scale search index grew to the point in February that it exceeded the Solr/Lucene limit of 2.1 billion unique terms. Core Lucene developer Michael McCandless graciously provided a patch raising thelimit to 274 billion unique terms. Michigan continued performance tests aimed at identifying optimal shard sizes. Staff at Michigan also led team members at CDL on a walk-through of the large-scale search implementation in mid-February.
Four new redundant servers for index service arrived at Indiana and will be installed once additional power and networking infrastructure work has been completed, probably in late March. Two new servers for index building arrived in Michigan and are tentatively scheduled for March installation as well, pending staff availability.]
PageTurner – Michigan revamped the PageTurner code that generates PDFs from the repository in February, optimizing it for high performance delivery of full-book PDF files containing full-resolution page images. The ability to download full PDF files of HathiTrust public domain volumes will be available to partner institutions when Shibboleth is implemented. Michigan also explored pipelines for fast on-the-fly generation of scaled, rotated, and watermarked page images and developed a prototype image server. Once completed, it will serve all individual page images not encapsulated in PDF.
Outages – There were no outages in February.
Number of volumes added:
| February | Total | |
| Indiana University | 23,066 | 174,882 |
| Penn State University | 128 | 5,144 |
| University of California | 5,976 | 1,162,315 |
| University of Michigan | 50,873 | 3,781,841 |
| University of Minnesota | 64,966 | 64,966 |
| University of Wisconsin | 35,683 | 303,727 |
| Total | 108,977 | 5,434,537 |
[Download PDF [407]]
The HathiTrust Communications Working group has scheduled a second webinar, following the HathiTrust 101 webinar offered last summer, to review basic elements of the partnership (including the business model, collections and services), discuss current activities and future directions, and answer questions from participants. The webinar is targeted specifically toward new partners, but is open to members of all partner institutions. The same webinar will be held at three different times in order to provide more opportunities for participation: Wednesday March 23,1:30-3:00pm, Tuesday April 12, 12:30-2:00pm, and Friday April 15, 12:30-2:00pm (all Eastern Daylight Time). If you plan to attend, please RSVP to Jeremy York as soon as possible before each webinar: jjyork@umich.edu [269]. Please also include any questions or issues you would like the presenters to address (a week in advance will give time to prepare, though we are interested in receiving questions and feedback at any time).
HathiTrust is pleased to announce the availability of public domain texts on a large scale for computational research purposes. Approximately 120,000 texts are freely available; up to 2 million more can be obtained with institutional sponsorship through an agreement with Google. More information, including the Google agreement and directions for obtaining texts, is available at http://www.hathitrust.org/datasets [212]. Unlocking the research potential of the collections assembled in HathiTrust is an ongoing goal of HathiTrust partners, and we are excited to take this step in enabling new forms of discovery and analysis.
HathiTrust is in the process of defining a new working group to respond to questions and issues received from users on a variety of topics, including searching and accessing content, copyright, quality, access to datasets, and more. A call for participants was sent to HathiTrust partner institutions in February; membership in the group will be finalized and the charge posted in the coming month.
All of the nearly 60,000 images and associated metadata involved in the prototype project between the HathiTrust, the University of Minnesota, the Minnesota Digital Library and the Minnesota Historical Society have been successfully ingested into HathiTrust. Public access to the image content is pending approval of a formal agreement. Project members John Butler, John Weise, and Eric Celeste will give a project briefing at the upcoming CNI Sprint 2011 Membership Meeting. More information about the project can be accessed at http://www.hathitrust.org/mdl_images [272].
With the initial policies, specifications, and technical framework in place, HathiTrust is ready to begin to scale ingest of locally-digitized book and journal content from partner institutions. HathiTrust has begun working with institutions of the Committee on Institutional Cooperation (CIC) and will broaden its scope throughout the coming year. Partners with digital book and journal content should review the deposit guidelines and content deposit form available at http://www.hathitrust.org/ingest [397], to be apprised of ingest requirements and preparations of content that may be needed prior to submission.
HathiTrust has enabled support for Creative Commons licenses. The Brooklyn Museum has posted an entry on its blog [248] about the volumes it has opened. If you hold the rights to a volume or volumes preserved in HathiTrust and would like to open access using a Creative Commons license, you can do this by filling out and submitting a permission form [346].
The Collections Committee is working on draft recommendations for the treatment of duplicate scans in HathiTrust, which it hopes to have ready for SAB consideration in late March or April. The group has also begun preliminary work on a print management proposal for the Executive Committee in advance of the Constitutional Convention. Another project the Committee will be taking up is a process for responding to requests to add specific content to HathiTrust. There has been one membership change on the Committee: Tom Teper (University of Illinois) has recently stepped in to replace Kim Armstrong (Committee on Institutional Cooperation) and will be serving as a formal liaison to the Executive Committee for the print management work item.
The Communications Working Group is pleased to welcome 2 new members: Robin Bedenbaugh from Texas A&M University, and Oya Rieger from Cornell University. The departure of one member earlier this year left in vacancy in the group, and because of the expanding work of the group and excellent pool of nominees submitted by partner institutions, the Executive Committee decided to approve two new appointments. We are pleased to welcome Robin and Oya and add their knowledge and expertise to our communication efforts.
A draft of the working group’s Communications and Marketing Plan for 2011 was reviewed by the Strategic Advisory Board and the Executive Committee in February, and the group is now incorporating feedback into a final version. The working group also made progress on the development of a second webinar (see announcement above) and on a handout designed to communicate the basics of HathiTrust to a broad audience.
The Discovery Interface Working Group (DIWG) has begun to balance its efforts between advancing the full implementation of the HathiTrust WorldCat Local catalog, and enhancing HathiTrust Full-text Search. The DIWG-OCLC team is currently developing a list of desired enhancements to the functionality and interface for a second version of the HathiTrust WorldCat Local catalog. The HathiTrust Full-Text Working Group has continued to meet weekly, and is finalizing a list of features and functions to be deployed in the initial short-term phase of the Full-text Search enhancements.
User experience experts from the DIWG and OCLC have finalized a WorldCat Local Prototype usability test, which will run for about 2 weeks during March.
The Usability Group continues to participate in other committees via liaison roles. Two group members are actively participating in the Full-text Search working group and another continues to be actively involved in the Discovery Interface Working Group.
The Usability Group is establishing a User Experience Special Interest Group (UX-SIG). Our intention is to find people at partner institutions with some experience or interest in user experience topics, including usability & interface design. In addition to being a place for user experience (UX) related discussions, this group will provide a base for the solicitation of volunteers to participate in various short-term activities related to the HathiTrust user interface (e.g., contribute to personae and use cases, provide feedback on proposed site changes, join a task force project). There is no implied commitment in joining the group unless a member chooses to participate in a project. Membership in the UX-SIG will provide an interesting opportunity to connect with your UX colleagues across the HathiTrust partnership! Please contact Suzanne Chapman (suzchap@umich.edu [170]) if you are interested in joining this group.
Staff at California Digital Library have completed development of the core file system, the first major component of the new HathiTrust Metadata Management System. The development team is now reviewing existing workflows for receiving bibliographic data from each HathiTrust content-contributing institution. This work includes testing record import and transformation functions and performance. Development of the next major component, the core database for the system, has begun, and CDL continues to interview candidates for a Principal Metadata Analyst position for the project. Ongoing project information is posted at http://www.hathitrust.org/htmms.
Staff at Michigan have begun modifications to Collection Builder that will allow the creation of permanent, full-text-searchable collections of HathiTrust volumes of arbitrary size. The revised design leverages the Solr index used in Full-text Search instead of relying on a dedicated Collection Builder index. In the new configuration, items added to collections of less than 1,000 volumes will be full-text searchable immediately on inclusion. Full-text indexing of collections of more than 1,000 items will be slightly delayed - generally completed within 48 hours. Very large collections of more than 20,000 items will require staff mediation. While 98% of collections contain fewer than 100 items, there has been increasing demand from users for collections with tens and potentially hundreds of thousands of items. The necessary enhancements will be completed in March.
Work that was underway at Michigan to design and implement Data API security enhancements is temporarily on hold, with staff focusing on enhancements to Collection Builder. Michigan staff did create a simple API, however, to supply access and use statements to the HathiTrust OAI feed based on a combination of volume rights and source attribute values. This is not formally part of the Data API, and at this point is intended for internal use only.
Tests were done that confirmed the viability of the plan to make Collection Builder reliant upon the full-text search index, described above.
Integration of BookReader into Page Turner was largely completed in February and the code is ready for production deployment. However, initial testing revealed that performance of the new interface could be increased significantly through the installation of the Plack (http://plackperl.org/) Perl module. Plack is now being deployed on HathiTrust web servers and production deployment of PageTurner with BookReader is expected in April.
A bug related to proper ID representation was fixed in PageTurner’s COinS implementation. COinS support was also added to PageTurner search results. COinS is an embeddable format that provides bibliographic metadata to citation tools such as Zotero.
Michigan staff have completed half of the storage replacement work at the Michigan and Indiana storage sites with no service interruptions or other issues, and are continuing replacement work in March, starting in Michigan. The process for securely purging data from retired storage nodes has been finalized and put in place.
HathiTrust remained available during an extended scheduled outage of the main campus data center at the University of Michigan from approximately 2:00pm EST on Friday, February 18 until approximately 2:00pm EST on Sunday, February 20. There were no issues resulting from the maintenance.
Number of volumes added:
| February | Total | |
| Columbia University | 1,051 | 58,465 |
| Cornell University | 23,371 | 239,010 |
| Indiana University | 1,889 | 181,895 |
| New York Public Library | 482 | 258,565 |
| Penn State University | 2,653 | 37,174 |
| Princeton University | 10,910 | 219,466 |
| University of California | 224,373 | 2,304,411 |
| The University of Chicago | 765 | 3,227 |
| University of Illinois | 0 | 14,428 |
| University of Madrid | 6,125 | 85,537 |
| University of Michigan | 26,878 | 4,303,356 |
| University of Minnesota | 2,909 | 79,498 |
| University of Wisconsin | 8,074 | 431,524 |
| Yale University Library | 0 | 140 |
| Total | 309,480 | 8,216,700 |
Public Domain (~26%)
| Total | 125,144 | 2,098,494 |
February 13, 2009
Sample 1: The first sample will be composed of 5,000 texts, which may be requested in one of three bundles. Texts in all bundles are pre-1923 (pre-1869 for works published outside of the United States) and are as follows:
Sample 2 - Digging Into Data: A second sample of 50,000 texts will be made available for participants in the Digging into Data Challenge [337]. The corpus represents a mix of dates (as above, all pre-1923, and pre-1869 for materials published outside the United States), countries of origin, languages, character sets, and formats (i.e., some serial literature in a body of mostly monographic literature).
More information about these datasets, as well as specifications of file formats and modes of access, will be posted soon on HathiTrust.org.
Outages:
PLEASE NOTE: Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
February 12, 2010 [Download PDF [408]]
[1]
New Cost Model – The HathiTrust executive committee approved a new cost model for partnership in December that will be adopted by all partners beginning in 2013. In the new model, partners will share in the cost of public domain and open access volumes preserved in HathiTrust, and in the cost of in copyright volumes that they hold, or have held, in their physical collections. The model will distribute the costs of curating and managing the digital collections in a way that more accurately reflects the benefits each partner receives from deposited volumes. It will also allow institutions to join HathiTrust who do not necessarily have content to deposit, but who wish to support and benefit from the long-term curation and access services that HathiTrust provides. Such institutions are eligible for partnership effective immediately, and do not need to wait for the 2013 general adoption. Details of the new cost model are available at http://www.hathitrust.org/documents/hathitrust-cost-rationale-2013.pdf [409]. Please contact hathitrust-info@umich.edu [410] for additional information and inquiries about partnership.
Disaster Recovery Planning – Following an evaluation of disaster preparedness [382] performed last summer by an IMLS-funded intern, and the hiring of a preservation librarian in November, the University of Michigan is taking steps to formalize and expand HathiTrust’s policies and practices relating to disaster recovery. The UM preservation librarian is leading a process to form a Disaster Recovery Planning Committee and, with support of a winter intern from the UM School of Information, has begun to gather key inventory, personnel, and workflow documentation. Guided by industry standards such as TRAC [411] and best practices in the digital preservation community, the committee will ensure a high level of preparedness for known and unknown risks to the long-term integrity and use of materials in the repository. A preliminary meeting of key staff will occur in February, and membership in the Disaster Recovery Planning Committee will be finalized soon thereafter.
Digital Library Profile – As part of its participation in an NSF EAGER grant awarded in September 2009, HathiTrust completed a technological profile of its repository based on two frameworks developed by Johns Hopkins University. The profile can be found at http://www.hathitrust.org/technology [256].
Quality – In July 2009, the Strategic Advisory Board (SAB) assembled a working group to investigate issues surrounding the quality of partner institution volumes downloaded from Google. The working group was asked to research and provide recommendations on a quality threshold HathiTrust uses to limit ingest of poor quality volumes. The working group presented its recommendations to the SAB in January and the SAB decided to continue the working group with a revised and expanded charge. The new charge is to a) develop a set of quality principles for HathiTrust, b) monitor quality control as related to user experience, c) track developments in a separate quality working group established by Google and Google library partners following the Google partner summit in October, and d) evaluate HathiTrust practices with regard to thresholding or limiting ingested content. Membership in the new group, called the HathiTrust Quality Ingest and Error Rate Working Group, is currently being determined.
Discovery Interface – With the version 1 catalog beta release only a few months away, the Discovery Interface Working Group is turning its focus to the usability of the catalog and its integration with existing HathiTrust Digital Library services (Collection Builder, Page Turner, and Full-Text Search). The Working Group formed a usability subgroup, which will collaborate with staff at OCLC to begin usability testing of the catalog before it is released. Testing will also be performed in post-release phases. Aspects of the pre-release analysis will include verifying accurate functionality and fulfillment of agreed-upon requirements.
In preparation for loading HathiTrust volumes into Worldcat for the version 1 release, staff at UM provided an API that will allow OCLC to display HathiTrust volume information in Worldcat records.
Collaborative Development Environment – UM staff have been gathering specific topics for the working group to discuss when it reconvenes (now planned for late February), and have developed a draft timeline for the steps ahead. A message to reassemble the group was sent in early February, and scheduling is underway. The area the group will address first is the design of a version control system. UM staff have also begun to research the GlusterFS cluster file system as a storage back-end for the environment.
Storage – The working group tasked with making recommendations on a third instance of storage for HathiTrust presented its final report to the Executive Committee in January. The group concluded that although there were significant benefits to implementing a third instance of storage, given the high level of preservation confidence in HathiTrust and the absence of economic conditions favorable for acquiring and operating new storage, there was no urgency in establishing a new instance. The group noted, however, that HathiTrust should be prepared to establish a third instance of storage if such a course becomes more economically feasible.
The Executive Committee would like to solicit broader feedback from partner institutions regarding these recommendations (especially from a collection development perspective), and requests that thoughts on the report and a third instance of storage be sent by email to hathitrust-info@umich.edu. Those who wish to remain anonymous should indicate this in their email. The full report of the working group is available at http://www.hathitrust.org/projects#wg_storage [412].
General – Ingest rates were low in January, due in part to challenges UC experienced in retrieving bibliographic records from one of its systems. UM loaded the first set of bibliographic records for Minnesota, but could not begin ingest because of problems with Google’s delivery of the content files. Ingest numbers from other institutions were also low because HathiTrust caught up with the rate that partner volumes were made available from Google.
Internet Archive Ingest –UM began testing validation routines on a batch of 200 volumes of Internet Archive-digitized volumes from the University of California in January. The teams are revising validation strategies based on the findings of these tests and the results of quality assurance performed by UC staff on transformed, but not yet ingested objects. UM and UC will proceed with the ingest pilot in February, testing all aspects of bibliographic and content loading, validation, and access. Completion of the pilot is projected for late February.
New Programmer For Non-Google Ingest – UM extended the bidding period for the new programmer position through mid-January, and several new qualified candidates have been interviewed. UM staff are in the final stages of selecting candidates, and expect to have a new full-time staff member and a new part-time staff member on board by the end of February.
Shibboleth – Shibboleth implementation in HathiTrust is nearly complete. Major portions of the code are in place and UM staff have begun to contact partner institutions to exchange information that will allow individuals from partner institutions to authenticate into HathiTrust. The initial benefit to partners will be increased facility in creating personal collections in Collection Builder, though non-partners will still have the ability to create collections using the University of Michigan “friend account” [413] system. Within the next couple of months, however, full-PDF download of all public domain volumes will also be available to partners. In the long term, HathiTrust hopes to use Shibboleth to extend services such as enhanced access for users with print disabilities and U.S. Copyright Section 108 uses to all member institutions. Deployment of Shibboleth is planned in March.
Data API – In January, staff at the University of Michigan began work on a web application that will use the Data API to facilitate the location and download of complete book packages for public domain volumes not digitized by Google. The application is being created entirely with data and services available to the general public and is meant to demonstrate uses that can be made of the API. The first step of crawling the repository for eligible volumes is in progress, and release of a beta version of the application is expected in February.
Large-scale Search – UM improved logging and log analysis in January, enabling staff to monitor search performance in a way that more closely resembles the user’s experience. UM staff documented changes to large-scale search hardware in a new blog post entitled “Scaling up Large Scale Search from 500,000 volumes to 5 Million volumes and beyond” [414].
New index servers were ordered for the Indiana site and are scheduled to be in service before the end of March. The current index release process already synchronizes an updated version of the index to be stored in Indiana on a daily basis. Acquisition of the new hardware will provide full redundancy of the large-scale search application servers as well. Two additional servers that will be used exclusively for index building are on their way to the Michigan site, and one server originally purchased for production service is being re-purposed for testing and development.
PageTurner – PageTurner development was slowed in January but will pick up in February and March as staff time devoted to the ingest of materials from the Internet Archive decreases.
Outages – There were no outages in January.
Number of volumes added:
| January | Total | |
| Indiana University | 38,344 | 151,816 |
| Penn State University | 0 | 5016 |
| University of California | 972 | 1,156,339 |
| University of Michigan | 71,904 | 3,730,968 |
| University of Wisconsin | 691 | 268,044 |
| Total | 104,342 | 5,312,183 |
[Download PDF [415]]
The HathiTrust Discovery Interface Working Group is pleased to report the availability of a prototype HathiTrust catalog. This new interface is the result of a partnership between OCLC and HathiTrust, leveraging our collective expertise to facilitate discovery of the materials held in the HathiTrust Digital Library. One of the project’s main goals is to situate HathiTrust’s multi-institutional holdings within the larger world of library holdings represented in WorldCat. The new prototype catalog, accessible at http://hathitrust.worldcat.org [267], is built on OCLC’s WorldCat Local platform. HathiTrust and OCLC are eager to receive user feedback to inform the design of a next version of this catalog. Feedback can be submitted to HathiTrust via http://www.hathitrust.org/feedback [416]. For more details about this project, see OCLC’s press release at http://www.oclc.org/news/releases/2011/20114.htm [264].
From September through December 2010, HathiTrust worked with the University of Minnesota (UMN) and its partner, the Minnesota Historical Society (MHS) to add digital images from the state-wide Minnesota Digital Library and MHS collections to HathiTrust as a preservation archive. This prototype project was intended to begin addressing HathiTrust’s long-term functional objective to “support formats beyond books and journals.” Nearly 60,000 images and associated metadata were involved in this ingest project, providing a testbed for the evaluation of numerous technical, economic, and policy-related considerations now underway. Conclusions have yet to be drawn, but the report of one of the independent consultants for the prototype ingest effort is available at http://eric.clst.org/wupl/MDL/MDL-HT-report-110126.pdf [417]. For additional information, please contact John Butler (j-butl@umn.edu [370]).
The University of Michigan Library’s User Experience (UX) Department will begin work in February on the development of mobile interfaces for HathiTrust, focusing primarily on interfaces for reading volumes and bibliographic searching. The Department will contribute the time of a mobile developer and two User Experience Specialists for the next 7 months to conduct research and design and develop the interfaces. The UX Department staff will be consulting both the Discovery Interface and the Usability Working Groups throughout the development process. Anyone interested in contributing to this project should contact Suzanne Chapman (suzchap@umich.edu [170]).
HathiTrust now offers rightsholders the ability to open access to their works under Creative Commons [367] (CC) licenses. The first CC licenses will go live in HathiTrust on March 1, at which time the license designations will also begin appearing in HathiTrust’s tab-delimited metadata files [4] and OAI feed (information at http://www.hathitrust.org/data [274]). The metadata files contain bibliographic and identifier information for every volume in HathiTrust.
As of the end of January, users at three new partners institutions have the ability to login into HathiTrust to take advantage of additional services: the University of California-Los Angeles, the University of Utah, and the University of Washington. Current services include full-PDF download of all public domain materials and the ability to create permanent collections in HathiTrust’s Collection Builder [418] using a local sign-on. HathiTrust uses Shibboleth [419] to enable partner authentication. In order to be configured for Shibboleth, institutions must release required attributes to the HathiTrust Shibboleth Service Provider (see http://www.hathitrust.org/shibboleth [107]).
We continue to urge partners to configure Shibboleth to work with HathiTrust so that the full (and growing) array of services can be delivered to every partner institution. The institutions listed below are configured, and we are in the process of working with three other institutions (Utah State University, the University of California-Berkeley, and the University of Madrid) to enable access. If your institution is not on this list, we would appreciate your help in making the appropriate connections to enable login via Shibboleth for your institution.
HathiTrust will be holding informational webinars in the second half of March, geared specifically toward new partner institutions. Additional details will be disseminated soon. Please contact Heather Christenson (heather.chistenson@ucop.edu [420]) or Julie Bobay (bobay@indiana.edu [421]) for more information.
As noted in the Update on December Activities [422], partners are requested to provide information about their print holdings by the end of this month. Please contact Julia Lovett (jalovett@umich.edu [423]) with any questions.
Members of the Collections Committee met with representatives from DLF, OCLC and others at ALA Midwinter to discuss the DLF/OCLC Registry of Digital Masters. The Committee has agreed to provide use cases and additional input for an assessment project that DLF is planning to mount to chart the future of the Registry. Discussions continue on several key work items, including the role of duplicates in HathiTrust and opportunities for shared print collection management.
The announcement of a number of new developments occupied the Communications working group in January; in particular, the rollout of the prototype OCLC WorldCat Local interface. The group also drafted a prioritized communications and marketing plan for 2011. Among the high priorities in the plan are repurposable materials for librarians to use in explaining HathiTrust to their constituencies, internal communications mechanisms for use among HathiTrust partners, and an introductory webinar for new partner institutions (look for an announcement soon).
In January, the Discovery Interface Working Group (DIWG) reached an important milestone in the release of the HathiTrust WorldCat Local prototype catalog [424]. Now that the prototype has been released, the DIWG’s work will focus on gathering user feedback on the catalog and conducting formal usability testing.
The Strategic Advisory Board would like to take this opportunity to thank everyone in the working group for their dedication to the catalog project: John Butler, co-chair (University of Minnesota), Lee Konrad, co-chair (University of Wisconsin), Julia Lovett, project manager (University of Michigan), Suzanne Chapman (University of Michigan), Kevin Clair (Pennsylvania State University), Lisa German (Pennsylvania State University), Patti Martin (California Digital Library), Jon Rothman (University of Michigan), Christopher Walker (Pennsylvania State University). Adam Brin (California Digital Library) is no longer with the group but his contributions during the requirements phase were vital to the group’s success.
The Strategic Advisory Board and DIWG would also like to thank OCLC’s team for their very hard work, particularly Bill Carney, who served as OCLC’s project manager. In addition to the creation of the prototype interface, the collaborative process itself proved to be important in helping both organizations understand the inherent benefits and challenges to working on large-scale projects across disparate types of institutions. The processes that were developed for the coordination of communication, project management, design, user testing, metadata, and systems work will serve the DIWG and HathiTrust well in future projects and partnerships.
January was an important month for the newly formed Full-Text Search working group, a subgroup reporting to the DIWG. The group held its first two meetings, and will continue to meet on a weekly basis. The group is currently developing a list of features and functions that will have a high impact value for users, and can be supported in the existing technology framework.
The Usability group continues to participate in other committees via liaison roles. Two group members recently joined the Full-Text Search subgroup to discuss the future of full-text search. The group also provided feedback on proposed designs for the improvements to PageTurner. The Usability group has begun to identify areas across HathiTrust that are in need of further development, usability research, or new design solutions.
Development at California Digital Library (CDL) on the core system for the new HathiTrust Metadata Management System progressed in January. CDL staff also consulted with staff at Michigan on documentation for the transformations involved in ingesting bibliographic records from partner institutions. CDL is in the process of hiring a Principal Metadata Analyst for the project. Ongoing project information is posted at http://www.hathitrust.org/htmms [12].
Developers at the University of Michigan updated the Data API [103] in January to support Creative Commons licenses, return access and use statements [425] for retrieved volumes, and provide access to coordinate OCR contained within volume packages.
Michigan staff made improvements to the development environment to facilitate testing of new code prior to release.
Over the last 2 months, staff at Michigan worked to rebuild the entire full-text index of HathiTrust materials, composed currently of more than 8 million volumes. The new index is in production and will be updated as new volumes are ingested. The rebuilding process included an upgrade of the Solr search engine. This upgrade, coupled with a number of strategic modifications to the way the index is constructed, has resulted in faster indexing time, (staff originally estimated re-indexing would take up to 40 days but it was completed in 10), smaller index size, improved handling of non-Latin scripts (e.g., CJK, Thai, Devanagari), and the inclusion of additional catalog metadata.
Michigan developers made considerable progress on integrating BookReader [254] into HathiTrust’s PageTurner application. Page layout modifications specified in December were implemented, leaving performance testing as the final area of work. Performance testing will be conducted in February and the enhanced PageTurner is planned for release in early March. The current interface to PageTurner will remain the default for the initial release, with BookReader functionality introduced as a “New” feature for users to try.
Staff at Michigan also began work to include Creative Commons licensing information as RDFa in PageTurner application output. Coding will be completed in February. CC licensing information will appear in the PageTurner bibliographic metadata display.
Michigan staff completed [correction] half of the the storage replacement work described in last month’s update [422] at the Michigan site, and are beginning the replacement process at the site in Indiana. Staff expect all storage replacement to be completed by the end of March. While the process is non-disruptive and both sites remain in live service during the replacement process, staff have paused ingest and full-text indexing work at crucial moments to be prepared to respond to unexpected problems. In conjunction with this work, staff are testing a process for purging data from retired storage nodes for security purposes before those nodes are decommissioned.
Outages
There were no outages in January.
New Growth
| January | Total | |
| Columbia University | 98 | 57,414 |
| Cornell University | 29 | 215,639 |
| Indiana University | 655 | 180,006 |
| New York Public Library | 64 | 258,083 |
| Penn State University | 121 | 34,521 |
| Princeton University | 50 | 208,566 |
| University of California | 31,792 | 2,080,038 |
| The University of Chicago | 18 | 2,462 |
| University of Illinois | 0 | 14,428 |
| University of Madrid | 1,156 | 79,412 |
| University of Michigan | 26,858 | 4,276,478 |
| University of Minnesota | 218 | 76,589 |
| University of Wisconsin | 218 | 423,450 |
| Yale University Library | 0 | 144 |
| Total | 70,666 | 7,907,220 |
Public Domain (~25%)
| Total | 14,127 | 1,973,350 |
Over the past three years, HathiTrust has assisted research libraries in moving more than 8 million scanned volumes online. From an initial group of CIC libraries and the University of California System, HathiTrust has grown to include more than 50 partner libraries, including a small but growing number of international participants. Together, these contributions to HathiTrust represent a significant slice of the world’s research holdings. As HathiTrust’s library network and content base grows, the partnership will likely have new and different needs for governance, sustainability, and for technology.
In order to address these needs the SAB is finalizing agreement with a consultant to provide the membership with an independent, thorough review prior to the October 2011 Constitutional Convention.
The consultant’s review will evaluate HathiTrust’s progress to date, using the functional objectives as guideposts. The SAB is also requesting a forward-looking view of the next steps that will be needed to sustain and grow the digital library. The SAB identified these questions as the most important to address in the review:
The review will be completed in time to allow discussion and comment from the membership. It is anticipated that the review document will play a crucial role at the HathiTrust Constitutional Convention in October 2011.
Please direct questions or comments to any SAB member including: John Butler, University of Minnesota, Trisha Cruse, California Digital Library, Bernie Hurley, University of California-Berkeley, Bruce Miller, University of California-Merced, Sarah Pritchard, Northwestern University, Paul Soderdahl, University of Iowa, Ed Van Gemert, University of Wisconsin-Madison, (chair), John Wilkin, University of Michigan (ex-officio), and Bob Wolven, Columbia University.
8 August 2008
This is the fifth regular update on activities in the HathiTrust, previously referred to as the Shared Digital Repository (SDR). These updates are distributed monthly, typically on the 2nd Friday of the month, and provide a variety of information about the general health of the repository and updates on the development of the HathiTrust. Each update will be sent via e-mail to the Library Director and CIO at each participating institution. We will soon release a website for the HathiTrust initiative, and will post all updates on that site. We plan to make an RSS feed for the updates available in order to share the information as broadly as possible.
Throughout this update, we refer to the draft Short-Term and Long-Term Functional Objectives (being articulated by the CIC’s SDR committee) as a work item relates to those Objectives. We plan to restructure future updates to provide an update on activities along with a review of work on those objectives.
Growth of the HathiTrust
As of August 1st, the HathiTrust contains:
Archival certification
We have completed a draft response to the required elements in the “Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist.” The draft will be published online as part of the release of the HathiTrust website. We have had preliminary discussions with the Center for Research Libraries about the possibility of a formal review of the HathiTrust repository.
As mentioned in an earlier update, we coordinated a site visit by a team from the Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) effort in the European Union. The DRAMBORA team plans to make their report public. (CIC SDR Short-Term Functional Objectives)
Infrastructure Development
Service Development
Forecasting August development
HathiTrust Governance
Status/availability of the HathiTrust
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
Please contact Phyllis White (pmwhite at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
Service was unavailable on Thursday July 31 from 7:00-7:30am EDT for a storage system software upgrade.
At this time, the following outages are scheduled:
August 14, 2009 [Download PDF [427]]
[1]
UC Staff Visit Ann Arbor – HathiTrust project leads from the California Digital Library joined staff at the University of Michigan for two days of intense and fruitful discussion and planning from July 20-21. The teams consulted on a variety of forward-looking topics including a roadmap for the ingest of content digitized by the Internet Archive, strategies for future bibliographic metadata management, the challenges of providing help and feedback to users in a virtual library with multiple constituencies and stakeholders, HathiTrust PageTurner development, and creating infrastructure for collaborative development efforts. Several new planning efforts were initiated as a result of these discussions and both partners came away believing the visit had helped them to further coordinate efforts and was instrumental to continuing their successes in the future.
New HathiTrust Working Group On Storage – A new working group has been convened to explore the possibility of securing a third instance of storage for HathiTrust in the western United States. The working group members include Stephen Abrams, California Digital Library (co-chair), John Kunze, California Digital Library (co-chair), Luc Declerck, University of California San Diego, Rob Lowden, Indiana University, David Minor, University of California San Diego, and Cory Snavely, University of Michigan. If a third instance of storage is recommended, the group will investigate a variety of technical, management, and organizational issues involved in implementation.
Working Group On Computational Research Center – The Research Centers working group has been hard at work over the last month. The participants (please see the June update) have been engaging in a series of conference calls discussing issues related to the creation of the centers, including the types of research that will be done, the environment needed to support such research, and legal restrictions surrounding the use of the data. The group will continue to discuss these issues and others, such as funding sources and derivative research resulting from HathiTrust data use, in calls throughout August and September.
Working Group on Development 'sandbox' – The Development Environment working group convened for the first time in mid-July via teleconference to discuss the scope of the environment, the contexts in which development will occur (remote development versus local, specific use cases and desired features), and working group logistics. The group identified current applications such as the HathiTrust PageTurner and Collection Builder, and GROOVE, HathiTrust’s ingest mechanism as priority systems to be made available in the development space, and conferred about particular ways that work will be done, such as code versioning. The development environment was a focus of one of the sessions during the meeting between California Digital Library and University of Michigan staff mentioned above, where further discussion on these issues took place. In the coming weeks, team members at Michigan will prepare hardware that has been set aside for the project and do preliminary configuration of the environment on that hardware.
Prototype for New HathiTrust PageTurner — Collaboration between the California Digital Library and the University of Michigan to enhance the HathiTrust Page Turner with GnuBook functionality continued in July, primarily in the form of discussions about division of labor and the establishment of a basic collaborative work environment. A new planning and development team with staff from both institutions met in mid-August to kick off the next phase of GnuBook and PageTurner development.
HathiTrust-OCLC Catalog Project — The HathiTrust WorldCat Local Implementation team is nearing the completion of high-level requirements document for the version 1 catalog, with a target deadline of August 31, 2009. The team also began to document usability issues and suggestions for the proposed interface. OCLC has begun working on the e-content synchronization process that will bring HathiTrust’s records into WorldCat Local.
In striving to create a consistent user experience of HathiTrust, the team has turned to user feedback on the temporary beta catalog (http://catalog.hathitrust.org/ [405]).
HathiTrust Statistics — Member institutions have identified the need to make statistics about how HathiTrust is being used more broadly available within the partnership. As a provisional measure, access statistics gathered by Google Analytics are being provided to representatives at these institutions. While these analytics will be useful in the short-term, there is a need for a reporting tool that will provide more granular information, such as usage by institution and by format, in the future.
Large-scale Search – University of Michigan staff investigated the indexing problems with the beta large-scale search that were reported in the last update. The problems were due to a shortage of available memory. However, a decision was taken to wait for new hardware to be deployed before taking further action. The new hardware, purchased in June to support large-scale search, was received in July, and is currently being prepared for testing and use. With the new hardware in place, it is planned to have full text search of all volumes in HathiTrust by October 1st.
UM staff made refinements to the custom punctuation filter for large scale search, and ran tests only to discover the filter did not provide the performance boost anticipated. The punctuation filter has been set aside temporarily, but has potential for future implementation. Tests conducted by staff to compare response times for common-grams Solr indexes in various configurations resulted in a new emphasis being placed on the importance of a well-tuned list of common words. A new program that evaluates the total number of term occurrences for the most frequently occurring words in an index was created to aid in the selection of common words for this list. Additional details can be found on the HathiTrust Large Scale Search Blog (http://www.hathitrust.org/blogs/large-scale-search/ [428]). Four new posts were added to the blog in July.
Ingest – Ingest was slowed in July by the discovery that Google was making volumes available for ingest that did not contain the required descriptive metadata. Google addressed the problem and ingest continued as normal after these volumes were re-ingested.
Data API – University of Michigan staff responded to feedback received from California Digital Library on the Data API and discussion of the API continued when CDL visited Michigan. Key issues that have arisen are security and determining how much functionality should be built into the baseline API.
Collection Builder – Michigan explored solutions for integrating Collection Builder functionality into the temporary HathiTrust Catalog. Planned improvements would allow users to save multiple items to a public or private collection directly from a search results or bibliographic record listing in the catalog.
Outages – At 8:15pm EDT, Wednesday, August 5th, an incident (that we are currently investigating) at the Indianapolis data center caused HathiTrust storage at that site to be unavailable for 1 hour and 15 minutes. During that time the entire Ann Arbor node of HathiTrust as well as web servers at the Indianapolis node continued to be available for users. Our current load balancing and failover strategy does not adequately account for this sort of partial failure. In the worst case, a user whose browser was directed to the Indianapolis site may have been unable to view books in the repository during the period from 8:15-9:30pm EDT. For most users, however, load balancing would have directed their browsers to the Ann Arbor site during this period. In the coming year, we will be replacing mechanisms that currently handle load balancing and failover, and will devote attention to developing a more nuanced failover strategy.
Number of volumes added:
| July | Total | |
| Indiana University | 601 | 18,482 |
| University of California | 109,403 | 308,648 |
| University of Michigan | 187,903 | 3,070,274 |
| University of Wisconsin | 3,707 | 215,045 |
| Total | 301,614 | 3,612,449 |
August 13, 2010 [Download PDF [429]]
[1]
Yale University Joins HathiTrust – We are pleased to announce that Yale University Library has joined HathiTrust. Yale will initially be contributing close to 30,000 volumes digitized with support from the Yale Provost’s Office and Microsoft, and will be identifying further materials over time. HathiTrust will benefit from Yale’s deep experience and expertise in all areas of library collections and services. More information about Yale’s new partnership can be found at http://www.hathitrust.org.
New Collections Committee and Usability Working Group – Two new HathiTrust groups were charged in July: a Collections Committee, charged by the Strategic Advisory Board (SAB), and a Usability Working Group, charged by the Executive Director. The Collections Committee is a standing committee in HathiTrust, created to make recommendations about the content in HathiTrust, including the activities, policies, tools, and services needed for partners to manage collections, as well as the processes by which collection development and management decisions should be made. Members of the committee include Ivy Anderson (chair, California Digital Library), Kim Armstrong (Committee on Institutional Cooperation), Sharon Farb (University of California, Los Angeles), Bryan Skib (University of Michigan), Claire Stewart (Northwestern University), Ann Thornton (New York Public Library), and Robert Wolven (SAB liaison, Columbia University). The full charge can be found at http://www.hathitrust.org/wg_collections_charge [327].
The Usability working group is charged with coordinating and overseeing usability activities across all of HathiTrust’s public interfaces, including web and mobile devices. Working group members include Suzanne Chapman (chair, University of Michigan), Jenny Emmanuel (University of Illinois), Felicia Poe (California Digital Library), Matthew Sheehy (New York Public Library). The charge of the group is available at http://www.hathitrust.org/wg_usability_charge [430].
Local Digitization Ingest Progress – In July, a group of staff members at the University of Michigan drafted a policy and specifications framework to facilitate ingest of content from a variety of digitization sources into HathiTrust. Staff will be working in August to refine the framework using samples of locally digitized content from Committee on Institutional Cooperation (CIC) institutions. HathiTrust plans to begin ingest of content from some of these institutions in the fall and increase the scope and scale of local digitization ingest in the following months and year.
Website Redesign/Usability Exercise – Over the next couple of months staff from partner institutions, coordinated by the Communications working group and in consultation with the Usability working group, will be redesigning HathiTrust’s web presence. Their goal is to integrate the current informational (HathiTrust.org [389]) and access (Catalog.HathiTrust.org [369]) portions of HathiTrust into a single location and interface at HathiTrust.org. Staff at the University of Michigan conducted a card sorting usability exercise in July inconjunction with this redesign, to help improve the architecture of the site and general categorization. The exercise was completed by staff members across the partnership. The website redesign is targeted for completion by the end of October 2010.
Communications – In its July meeting, the Communications Working group discussed the October website redesign, including overall goals and audience for HathiTrust.org. The group plans several new content pieces for the site: a statement on quality, and a “What is the HathiTrust” primer. Group members are planning for an in-person meeting in September where they will do substantial work on an overall communications and marketing plan for HathiTrust.
Development Environment – Staff at the University of Michigan continued to move active development of HathiTrust applications and services into the new development environment in July, and were able to run HathiTrust applications successfully there for the first time. Steps for completing migration to the new environment include integrating code that has been developed during the migration process, performing additional tests on migrated code, and developing scripts that will move the new code into production. UM Staff are also doing work to configure the environment for use by developers. Current areas of focus are establishing virtual web service and database resources on a per-developer basis, and establishing logical separations in the environment from which core developers will be able to do integration testing against the full repository.
Loading of Bibliographic Data from Illinois, Columbia – Staff at Michigan finished loading bibliographic data for content digitized both by Google and the Internet Archive (IA) from Columbia University. Ingest of this content is set to begin in mid-August. Staff also received and loaded bibliographic data for IA-digitized content from the University of Illinois.
Large-scale Search – A server dedicated to testing indexing processes and performance for large-scale search was deployed by Michigan staff in July. Specific tests mentioned in the June report remain the focus of activity and are greatly facilitated by the new testing server.
PageTurner – Developers at Michigan put a new page-image service into production in July. The service interfaces with master images in the repository to deliver access images on the Web in real time. It is being used initially to generate full-book PDF files, and will eventually serve individual page images directly to the HathiTrust PageTurner. Significant effort went into optimizing performance of the image service, which is important both for PDF generation and for serving images to applications such as GnuBook. Work to integrate GnuBook with PageTurner is in progress.
Collection Builder – Staff at the University of Michigan deployed new functionality in July that allows users to add items returned in full-text search results to a collection. More information on the new functionality and Collection Builder in general can be found under “Building Collections in Collection Builder” in the HathiTrust FAQ [431].
Storage Upgrade – Michigan staff completed the installation of 160 terabytes of new storage at the Indiana site in July. In the process, staff also upgraded cluster interconnect switches and, as the result of a data center reorganization project, relocated and re-cabled all storage and server equipment. Similar installation work is scheduled during August in Michigan. The new storage will bring the usable storage capacity at each site to 475 terabytes.
Improvements to Ingest – Architectural improvements to the ingest system are in planning and early development stages. Major enhancements include a general increase in processing throughput, improvements in barcode validation, preparation for PREMIS 2.0 support, cleaner integration with pre-ingest transformation processes (for non-Google-scanned materials), and new controls to automatically manage priority levels for content ingested from multiple sources. Suggestions and ideas in this improvement process are welcome. Please contact hathitrust-info@umich.edu [410].
Database Problem Resolved – HathiTrust was unavailable for a total of 2 hours during June and July due to a database problem associated with heavy usage patterns, resulting in rapid consumption of disk space, and ultimately an outage for users of HathiTrust. A fix was developed and released in July that has resolved this problem.
Outages – HathiTrust services were unavailable on Wednesday, July 21 from 1:00-1:30pm EDT due to exhausted storage capacity on database servers in both data centers; services were intermittently unavailable to some users from 8:00pm EDT on Wednesday, July 21 to 10:00am on Thursday, July 22 due to a web server not being restarted properly by staff following scheduled storage work in Indiana.
Number of volumes added:
| July | Total | |
| Indiana University | 343 | 177,676 |
| Penn State University | 331 | 23,155 |
| University of California | 126,158 | 1,543,677 |
| University of Michigan | 32,307 | 4,081,191 |
| University of Minnesota | 34 | 73,620 |
| University of Wisconsin | 11,305 | 364,944 |
| Total | 170,478 | 6,363,864 |
Public Domain
Total (~20%) | 47,805 | 1,256,156 |
11 July 2008
This is the fourth regular update on activities in HathiTrust, previously referred to as the Shared Digital Repository (SDR). These updates are distributed monthly, typically on the 2nd Friday of the month, and provide a variety of information about the general health of the repository and updates on the development of HathiTrust. Each update will be sent via e-mail to the Library Director and CIO at each participating institution. We will soon release a website for the initiative, and will post all updates on that site. We plan to make an RSS feed for the updates available in order to share the information as broadly as possible.
Throughout this update, we refer to the draft Short-Term and Long-Term Functional Objectives (being articulated by the CIC’s SDR committee) as a work item relates to those Objectives. We plan to restructure future updates to provide specific reports on the CIC’s short-term and long-term functional objectives.
HathiTrust Governance
Growth of HathiTrust
As of July 1st, HathiTrust contains:
Archival certification
We have completed a draft response to the required elements in the “Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist” and are currently reviewing the draft for public release in July or early August.
As mentioned in an earlier update, we coordinated a site visit by a team from the Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) effort in the European Union. Their report, an extremely favorable review of the repository, should be released publicly soon. (CIC SDR Short-Term Functional Objectives)
Infrastructure Development
Service Development
Forecasting July development
Status/availability of HathiTrust
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
Please contact Phyllis White (pmwhite at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
There were no interruptions in service in June.
At this time, the following outages are scheduled:
July 10, 2009 [Download PDF [432]]
New Working Group on Computational Research Center – June was an exciting month for HathiTrust, both in terms of repository development and in terms of deepening collaboration among the HathiTrust Partners. Calls sent out in May for participation in two HathiTrust working groups were answered, and membership in both the Research Center and Development ‘sandbox’ groups was finalized. Members of the Research Center working group, which will develop a proposal for a Research Center to be created under the terms of the Google Settlement, include Steven Abney (University of Michigan), Jack Bernard (University of Michigan), Geoffrey Fox (Indiana University), David Goldberg (University of California Irvine), Robert McDonald (Indiana University), Qiaozhu Mei (University of Michigan), John Ober (California Digital Library), Beth Plale (Indiana University), Scott Poole (University of Illinois), Sarah Shreeves (University of Illinois), and John Unsworth (University of Illinois). The group will be coordinated by Kat Hagedorn, with project support to be provided by Jeremy York from HathiTrust.
Working Group on Development 'sandbox' – The Development ‘sandbox’ working group, which will work to create a development environment for partners to build and test repository applications and services, includes Stephen Abrams (California Digital Library), Albert Bertram (University of Michigan), Lynne Cameron (California Digital Library), Kaylea Champion (University of Chicago), Stephanie Collett (California Digital Library), Steve DiDomenico (Northwestern University), Bill Dueber (University of Michigan), Mike Durbin (Indiana University), Phil Farber (University of Michigan), Paul Fogel (California Digital Library), Eric Hetzner (California Digital Library), Sebastien Korner (University of Michigan), John Kunze (California Digital Library), David Loy (California Digital Library), Andy Mardesich (California Digital Library), Mairéad Martin (Pennsylvania State University), Jon Miller (University of Chicago), David Minor (San Diego Supercomputer Center), Bill Parod (Northwestern University), and Cory Snavely (University of Michigan).
Prototype for New HathiTrust PageTurner — As plans for the Development environment moved forward, the University of Michigan and California Digital Library (CDL) continued to explore possibilities for integrating the GnuBook reader into the current HathiTrust PageTurner to expand PageTurner’s features and capabilities. The California Digital Library created a prototype GnuBook-integrated page turner application with repository code and a sample volume made available by the University of Michigan. Staff at the University of Michigan are currently testing the functionality of the prototype and will work with CDL in July to determine the next steps for development. This collaboration is exciting not only because of the enhancements it will bring to the existing PageTurner application, but because it demonstrates the way that shared development will enhance the services and capabilities HathiTrust is able to offer.
CDL staff to visit Ann Arbor — Collaborating to enhance services and capabilities is the major theme of a visit that HathiTrust team members from the California Digital Library will make to the University of Michigan in July. Staff from both institutions will discuss a range of topics including the ingest of Internet Archive and other non-Google content, development of the HathiTrust PageTurner, communication about HathiTrust, and future development directions in a series of focused meetings from July 20th to 21st.
HathiTrust-OCLC Catalog Project — June was an important month for discussions about the HathiTrust-OCLC catalog, particularly regarding metadata and holdings information functions and display. At each juncture, the project team has prioritized meeting the unique needs of HathiTrust’s all-digital catalog while maintaining consistency across the entire WorldCat database. For example, the team recently discussed how to accommodate viewability levels (e.g., search only, full-text, or a mix of the two in multi-volume sets) that do not occur in any other WorldCat records. The team has also focused on strategies for displaying and faceting on HathiTrust’s many contributing institutions, in a way that would differentiate this information from print holdings.
In striving to create a consistent user experience of HathiTrust, the team has turned to user feedback on the temporary beta catalog (http://catalog.hathitrust.org/ [405]). Future months will see increased focus on display and interface concerns as well as functionality issues.
HathiTrust.org Website Reorganization — Due to the evolving nature of HathiTrust and the additional information it has been necessary to incorporate on the HathiTrust website, a comprehensive reorganization of the website was undertaken by staff at the University of Michigan. The website has an identical look and feel, but information about areas such as preservation, rights management, partnership, and access have been more clearly separated and defined to be easier to locate. As part of the changes, additional information about the requirements and benefits of becoming a partner, how to become a partner, and the costs of partnership have been added.
Strategic Advisory Board Meeting Minutes — The HathiTrust Strategic Advisory Board (SAB) met for the first time on June 17. The minutes of this meeting are posted on the HathiTrust website at http://www.hathitrust.org/sab [433]. Future minutes of the SAB will be posted here as well.
Large-scale Search – University of Michigan staff ordered additional servers to support large-scale search in June, and prepared space for them in the MACC data center in Ann Arbor. UM also continued to explore the use of common-grams in large-scale search with a focus on refining the set of common terms in order to strike a balance between Solr index size and performance. Performance testing that was conducted in the process generated unexpected results that led to the discovery of a bug in a custom Solr punctuation filter. The bug was fixed, and tests will be conducted again in July.
The large-scale search team has also encountered a problem when building full-text indexes for the beta large-scale search (http://babel.hathitrust.org/cgi/ls [434]), in which indexing stops when memory errors are encountered after about a day and a half of indexing. This problem will be investigated further in July.
Ingest – As the numbers for New Growth show, more than 375,000 new volumes were added to the repository in June. This large amount is due both to an increase in digitized volumes available from partner institutions and an increase in ingest capacity gained by bringing a server online that had been held as a spare, boosting ingest rates by up to 25%.
Data API – University of Michigan staff have completed a response to feedback received from California Digital Library on the Data API following the release of the Data API specification in April (http://www.hathitrust.org/data_api [103]). This response will be shared with CDL in early July. In the meantime, CDL continues to test end user functionality of the Data API alpha release. The Data API allows metadata of volumes in the repository, as well as OCR text and images of volumes themselves to be retrieved from the repository. Although it may have many uses, the Data API is intended to facilitate the development of custom applications by HathiTrust partners and others for delivering and using content in the repository.
Changes to Google Metadata – Over the last several weeks, Google library partners have worked with Google to incorporate improvements Google has made to the metadata it returns to partners with their digitized volumes. These include the addition of descriptive metadata, manual image auditing information, calibration information, and more. HathiTrust has accepted much of this new information into the HathiTrust METS metadata package that accompanies volumes in the repository. The changes occurred seamlessly and had no effect on the delivery of volumes through the PageTurner application.
Number of volumes added:
| June | Total | |
| Indiana University | 5,136 | 17,881 |
| University of California | 113,139 | 199,245 |
| University of Michigan | 223,460 | 2,882,371 |
| University of Wisconsin | 37,473 | 211,338 |
| Total | 379,208 | 3,310,827 |
July 9, 2010 [Download PDF [435]]
[1]
Shibboleth and Full-PDF Download – HathiTrust released Shibboleth as a mechanism for partner authentication in June. Authenticated users can now download full-PDFs of all public domain volumes in HathiTrust, and access the Collection Builder feature through local sign-on. Shibboleth also lays the groundwork for future augmented services to partner institutions, potentially including the ability to make uses of digital volumes allowed by Section 108 of U.S. copyright law, and allow full access to in copyright volumes for users with print disabilities.
Full-PDF Download: The release of Shibboleth was made in conjunction with improvements to PageTurner that enabled delivery of high-resolution PDF files with embedded OCR for entire volumes. While only individuals at member institutions have access to this service across the repository, all public domain volumes that were not digitized by Google are available for full-PDF download to members and non-members alike. Right now these include nearly 100,000 Internet Archive-digitized volumes that have been contributed by the University of California, and thousands of volumes digitized locally by the University of Michigan. The partners are poised to significantly increase the amount of non-Google-digitized content preserved in HathiTrust in the near future, making many more public domain volumes freely available for download and distribution.
SEASR – HathiTrust is in the process of investigating SEASR, the Software Environment for the Advancement of Scholarly Research, as a means to provide computational access to materials stored in the repository. Staff at the University of Michigan began installation of SEASR in the HathiTrust development environment in June, and expect to gain more knowledge about SEASR and what would be involved in applying it to HathiTrust over the next several weeks.
Discovery Interface – As of the end of June, there are nearly 3.1 million HathiTrust records in WorldCat. Record loading is now continuing at a quicker pace, and is nearly complete. Meanwhile, the working group is in the process of configuring the HathiTrust-OCLC catalog interface to make branding and design consistent with the existing HathiTrust Digital Library system. OCLC is also making several alterations to the catalog’s functionality to fully meet HathiTrust’s requirements. This work is expected to extend into early August, after which time the interface will be re viewed for public beta release.
Collaborative Development Environment – University of Michigan staff continued the migration of HathiTrust applications into the new development environment in June, performing testing and configuration of the GlusterFS distributed file system that will be used as the storage back-end for the environment as well. Michigan staff are in the process of setting up and testing the virtual MySQL and web service provisions of the new environment. An initial version of the development environment is being used currently by staff at California and at Michigan to make improvements to the existing PageTurner application. When configuration is complete, the environment will support HathiTrust development efforts broadly across the partnership.
Quality, Ingest, and Error Rate – The quality working group is still working through a set of scenarios for gating volumes of poor quality from entering HathiTrust, and developing a justification and recommendation for the best approach to follow. A set of larger issues around quality has also been identified, some of which deal with larger policy considerations.
Large-scale Search – The full text search index in Indiana was put into production by Michigan staff in early June, making the infrastructure for full text search fully redundant. Two new index build servers were also put into production in Michigan. All of the new systems have been functioning well, and the new build servers have substantially improved the performance of index building and maintenance.
Michigan staff began running tests in June to determine the effects of cache-warming on performance, as well as tests relating to scaling strategy and indexing speed. The goal of scaling tests is to determine the optimum size to use for index shards, or sections of the search index, that are stored on each index server, the optimum number of shards per server, and optimum memory allocation per server. Indexing speed is of critical importance for deploying new searching features, which often requires the entire search index to be rebuilt.
PageTurner – Additional progress was made on GnuBook integration with the current HathiTrust PageTurner. Michigan investigated in particular ways to optimize the serving of thumbnails. Performance optimization for the new page image server also continued, with a focus on common CGI performance mechanisms, including FastCGI.
Collection Builder – Integration of Collection Builder functionality with large-scale search is in the final stages of testing and will be deployed in July.
Storage Upgrade – Michigan staff have ordered and received additional storage for the Indiana and Michigan sites and will be putting it into service during July and August. The upgrade requires the installation of a new, larger storage network switch, so staff will be using the opportunity to introduce a new cabling layout for the entire system. In Indiana, the upgrade and recabling work will be combined with a recommended relocation of all server equipment to another area of the data center for improvements in air handling and a transition to high-voltage power distribution. No outage is expected for this maintenance work.
Outages – HathiTrust services were unavailable on Monday, June 7 from 7:10-10:00am and on Tuesday, June 8 from 5:00-5:30pm due to a connectivity problem with one of the web servers; and on Saturday, June 25 from 8:30-10:00am due to a database server disk space shortage.
Number of volumes added:
| June | Total | |
| Indiana University | 236 | 177,333 |
| Penn State University | 328 | 22,824 |
| University of California | 616 | 1,509,169 |
| University of Michigan | 34,605 | 4,056,835 |
| University of Minnesota | 173 | 73,856 |
| University of Wisconsin | 10,073 | 353,639 |
| Total | 46,031 | 6,193,386 |
Public Domain
Total (~20%) | 32,805 | 1,208,351 |
| ALA BISG/NISO Forum | June 25 |
Shared Digital Repository
March 2008 Update
11 April 2008
This is the first regular update on activities in the Shared Digital Repository (SDR). These updates will be made available monthly, typically on the 2nd Friday of the month, and will provide a variety of information about the general health of the repository and updates on the development of the SDR. Each update will be sent via e-mail to an official representative (typically the library director) of a participating institution, and will be posted on the SDR website. We plan to make an RSS feed for the updates available soon, in order to share the information as broadly as possible.
As of April 11th, the SDR contains:
No certification process currently exists to ascertain a digital repository’s fitness for long-term curatorial responsibility. We are, nevertheless, hard at work on ensuring a high degree of transparency about the SDR’s compliance on issues related to archiving responsibility. Content that we ingest is intensively reviewed to ensure that it is valid and has not been affected by transmission; we are working to develop regular routines that re-validate using stored checksums. We have also undertaken efforts to communicate our readiness or fitness for long-term archiving responsibility. First, we have completed a draft response to the required elements in the Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist, and will post a preliminary version of our response on the SDR website in the relatively near future. Second, we coordinated a site visit by a team from the Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) effort in the European Union, and they will make their report, which provides an extremely favorable review of the SDR, public soon.
Each month in this update, we will provide information on planned outages so that scheduled activities (e.g., classes and presentations) can work around these times. When it is necessary to interrupt availability of the SDR, we will schedule:
We will collect email addresses for people who should receive advance notification of planned outages as well as.
HathiTrust
Update on March 2009 Activities
April 10, 2009
Outages:
PLEASE NOTE: Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
Internet Archive Ingest – Staff at the University of California completed quality review of the pilot set of Internet Archive-digitized volumes in March, and submitted a set of final issues to team members at the University of Michigan. These have have largely been resolved. Michigan staff also determined the cause of the validation error reported in last month’s update. Correcting the error led to further revisions of the preservation metadata schema and re-evaluation of the validation routines put in place for Internet Archive-digitized content. Updates to these routines are currently being implemented. California sent bibliographic records for a set of 97,000 Internet Archive-digitized volumes to be loaded into HathiTrust. As soon as the updates to the ingest process and fixes for issues raised in QA are in place, download of these volumes will begin.
Local Digitization – The University of Michigan has begun to receive locally-digitized content from several partner institutions for ingest into HathiTrust. Two programmers hired by Michigan in February have started to evaluate the material, determining needs and requirements for ingest, both in terms of digital package specifications and content transformation routines. Jessica Feeman, a programmer at Michigan and the original developer of the data validation and ingest system for HathiTrust, left her position at the end of March to start a family (congratulations, Jessica!). A new position will be opening in April.
Discovery Interface – OCLC loaded test batches of HathiTrust bibliographic records into WorldCat in March. After the batches were reviewed by OCLC and the HathiTrust team, OCLC initiated full-scale loading. At the end of March, 1.1 million HathiTrust records had been added to WorldCat through OCLC’s eContent Synchronization mechanism, and the loading process continues.
HathiTrust and OCLC recently completed a first round of usability testing for the version 1 HathiTrust catalog, involving five participants in individual one-hour sessions. Members of the OCLC and HathiTrust teams are currently analyzing the results of the testing, particularly in relation to HathiTrust’s requirements for the version 1 catalog. Special thanks in this effort are due to the HathiTrust colleagues at Penn State University, where the testing took place, as well as to OCLC for providing gift cards as incentives to participants.
Collaborative Development Environment – Michigan staff are in the process of designing the architecture for the new development environment according to the general direction set by the working group. The design incorporates practicalities such as directory naming conventions that will be compatible with the version control strategy. Staff are also discussing initial provisions for virtualization within the environment, including one virtual web and database environment for each developer, one for pre-release integration testing, and numerous instances for public “beta” exhibition and review of new features. The group is working to transition active HathiTrust development at Michigan to the new environment in April.
Shibboleth – Staff at the University of Michigan staff have been discussing the most appropriate set of attributes to request for release to HathiTrust applications via Shibboleth, consulting with experts at partner institutions, including Michigan’s central Information and Technology Services, which will coordinate Shibboleth federation interactions for HathiTrust. Shibboleth will be a mechanism by which HathiTrust is able to provide specialized services, such as full-PDF download of repository volumes, to partners. The final attributes to be requested are eduPersonAffiliation, eduPersonScopedAffiliation, eduPersonTargetedID, and displayName. Registration of the HathiTrust Service Provider is in progress and we hope to release the service in April.
Large-scale Search – Programmers at the University of Michigan continue to investigate queries taking longer than 30 seconds to execute. The present theory is that certain components of the hardware (network cards) are causing intermittent problems that disrupt communication with the Solr server. The focus is on isolating and replacing the problematic cards.
HathiTrust team members from Michigan and Indiana are coordinating on the installation of new servers in Indianapolis to make the large-scale search service redundantly hosted at the Indiana and Michigan sites. This work has required the installation of new electrical and networking capacity in Indiana, which is almost complete. The setup and configuration of the new servers is expected to be fairly simple, as it is a near-replica of the architecture already in place in Michigan.
PageTurner – University of Michigan developers continued to improve the performance of a new service that will deliver full-volume PDFs of public domain materials to users at HathiTrust partner institutions. The service will be available to partner institutions via Shibboleth authentication. Michigan also began development on a new method of delivering individual page images to the HathiTrust PageTurner, that will scale, rotate, and watermark images on the fly. Development is about 75% complete, and the new method is already being used in the collaborative development environment as part of the University of California’s work to integrate GnuBook into the HathiTrust PageTurner. Michigan and California are working together on enhancements to the existing PageTurner interface to incorporate the GnuBook improvements.
Outages – Large-scale search service was unavailable from 10am-1pm EST on March 25 while software and firmware upgrades were applied to the storage systems in Michigan and Indiana. The upgrades did not result in outages for production systems. The large-scale search application is considered beta pending redundant hosting of the service in Indiana. In the future, we will work to communicate planned outages for services like large-scale search despite their beta status.
Number of volumes added:
| March | Total | |
| Indiana University | 138 | 175,020 |
| Penn State University | 1,469 | 6,613 |
| University of California | 1,940 | 1,164,225 |
| University of Michigan | 72,227 | 3,860,817 |
| University of Minnesota | 880 | 65,876 |
| University of Wisconsin | 11,923 | 315,650 |
| Total | 88,798 | 5,588,311 |
[Download PDF [439]]
HathiTrust has been certified by the Center for Research Libraries (CRL) for compliance with the Trustworthy Repository Audit and Certification (TRAC) [365] criteria for digital repositories. This important certification has been a key aim of the partnership since the repository’s founding in 2008, and one we intend to uphold in coming years. The full audit report is posted on the CRL website [440]. HathiTrust posted a news release [265] on the certification and updated documentation [253] on HathiTrust’s compliance with TRAC elements. In conjunction with this announcement, we have included a spotlight on HathiTrust technology below, posted also at http://www.hathitrust.org/technology [256].
Partners from across the country attended the HathiTrust new partners webinar on March 23. A variety of topics were addressed, including HathiTrust’s organizational structure and costs, our collections and services, and future directions. Partners also had an opportunity for Q&A with the presenters. The webinar will be offered on two additional dates: Tuesday April 12, 12:30-2:00pm, and Friday April 15, 12:30-2:00pm (both Eastern Daylight Time). If you would like attend, please RSVP to Jeremy York as soon as possible before each webinar: jjyork@umich.edu [269]. Please also include any questions or issues you would like the presenters to address.
Due to a high level of interest expressed by non-HathiTrust partner institutions, an open webinar will be held on May 3 and May 5 from 11am-12pm Eastern time. This webinar will be open to the public. As above, if you would like attend, please RSVP to Jeremy York as soon as possible before each webinar (jjyork@umich.edu [269]), and include any questions you would like the presenters to address.
In 2010, the Institute of Museum and Library Services granted the University of Michigan and Associate Professor Paul Conway funding to research quality in large-scale digital repositories. The grant project is using HathiTrust as a test-bed for review. Work on Phase One of the project commenced in late January 2011 with the creation of a project team to focus on defining error types and levels of severity, statistical analysis processes, a web application for data entry, and project management procedures. By the end of March, the team had identified initial project needs and accomplished the following: identified twelve initial error types including scales of severity, hired and trained two data coders, coded an initial random sample of 15 volumes from the public collection, analyzed variance in coding within the sample, and produced a first draft of procedures for quality evaluation. The team also connected with project members at the University of Minnesota who will be participating in the grant, sharing initial documentation and results. For further information regarding progress and updates, please see the HathiTrust grant projects webpage [305].
HathiTrust has been working to design and populate a database of information representing the print holdings of all partner institutions. This database will serve a number of important functions:
To date, approximately 119.5 million rows of data have been received from partners, with each row representing one copy of a single volume monographic print item that is (or previously was) held at a partner institution. At this point, we have outgrown the hardware where initial database testing and development took place. When new HathiTrust development environment hardware becomes available in early April, a new version of the database will be created, all of the data we have received will be loaded, and we can begin generating statistics and preliminary cost modeling data. At the same time, we will be working toward a near-term production release of the database to support services to users with print disabilities at partner institutions.
Upcoming development work will focus on improved duplicate detection and clustering mechanisms on two fronts: we are working with OCLC on the development of tools that will provide improved identification of potential duplicate bibliographic records; and we will be ramping up our work on duplicate detection/matching mechanisms for the parts of multi-part works to allow expansion of the print holdings database to include serials and multi-part monographs.
Staff members at the University of Michigan are currently investigating a sample of rare volumes digitized from Universidad Complutense de Madrid for deposit. Staff are also performing final evaluation of approximately 600 locally-digitized volumes submitted by Northwestern University.
Ingest of an initial set of more than 70,000 volumes from the Library of Congress, digitized in partnership with the Internet Archive, was completed in March.
The Collections committee continued to work on recommendations regarding duplicate volumes in HathiTrust, coordinated print management, and responding to users requests to contribute volumes to the repository.
HathiTrust figured prominently in the news in March, and the working group was in high gear to disseminate announcements regarding the Google Settlement ruling [266], HathiTrust’s agreement with Summon [441], and the positive outcome of the TRAC audit [265]. The group also began setting up a HathiTrust Facebook presence and conducted the first of three new partner webinars.
New, more powerful MySQL servers were installed in the development environment to support the additional performance requirements of the partner holdings database. The new servers are being synchronized in real time with the old in preparation for a cutover planned for early April.
The WorldCat Local Prototype usability test reported in last month’s update [442] ran for a few weeks in March. User experience experts from the Discovery Interface Working Group (DIWG) and OCLC are analyzing the data and drafting a report of findings for review. The Full-Text Search Subgroup, charged to “identify and prioritize features and functions anticipated to have immediate high-impact value to users presented it recommendations that can be reasonably afforded by the existing technology framework,” presented its analysis and recommendations [270] to the DIWG, where it received full endorsement.
Usability Working Group members continued their work as liaisons in other HathiTrust committees in March. The group also began to develop a set of personas and use cases to inform development and policy-making surrounding HathiTrust applications and interfaces. The Usability Group is still looking for people to join the new User Experience Special Interest Group (UX-SIG), reported in last month’s update [443]. Please contact Suzanne Chapman (suzchap@umich.edu [170]) if you are interested in joining this group or have any questions about participation.
The charge of the User Support Working Group was approved by the Executive Committee and is posted online [190]. The group plans to schedule its first call in April, and will become the primary body responsible for addressing user inquiries submitted through HathiTrust interfaces and the HathiTrust contact address.
The Metadata Management System development team at California Digital Library (CDL) continued development of the core database system in March. The team continues to review workflows for receiving bibliographic data from HathiTrust content-contributing partners, and has responded to changes in bibliographic processing at the University of Michigan by adjusting processes in the new system to mirror those changes. Team members continue to benchmark data loading performance and adjust computing resources for optimum results. Interviewing continues for a Principal Metadata Analyst. The position opening is posted on the CDL website [444