Navigation

Update on December 2011 Activities

January 13, 2012 Syndicate content

[Download PDF]

Late Breaking News


HathiTrust Passes 10 Million Volumes

View statistics and a timeline on the HathiTrust blog.

Top News

Changes to Tab-delimited Files

On February 1, HathiTrust will be adding three additional columns to the tab-delimited inventory files (“hathifiles”) available at http://www.hathitrust.org/hathifiles. The files are frequently used by partners and non-partners as a means to obtain full bibliographic records for HathiTrust items to load into local catalogs (see HathiTrust Data Availability and APIs). The additional columns will identify the publication date and publication location of volumes in HathiTrust, as well as volumes that have been identified as U.S. federal government documents.

Ingest

Works Digitized Locally and by Internet Archive

Staff at Michigan continued conversations with staff at the University of Florida regarding ingest of locally-digitized materials, and staff at several other institutions regarding ingest of Internet Archive-digitized materials.

Working Groups and Committees

Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups for more information.

Operational

Communications Working Group

The Communications Working Group continued to work on on a public services-oriented communications package, as well as announcements for new partners and the major milestone of 10 million volumes.

User Experience Advisory Group

The User Experience Advisory Group began reviewing the current home page and discussed additions and issues that will need to be addressed in a forthcoming redesign. Group member Jenny Emanuel contributed a "Perspectives from HathiTrust" blog post about the group's persona work that was completed in November. 

User Support Working Group

The User Support Working Group is still seeking nominations for new members. See the Update on November Activities for details.

The table below contains a summary of the issues received by the User Support Working Group in December.

Issue Type November December
Content 107
81

Quality

10271

Non-partner Digital Deposit

02

Collections

56
Cataloging 4330
Access and Use 103107

Copyright

55

59

Permissions

104

Takedown

12

Print on Demand

21

Inter-library loan

02

Full-PDF or e-copy requests

1528

Datasets

12

Data Availability and APIs

20

Reuse of content

11
Web applications 2418

Functionality problems

59

Problems with login specifically

11

General Questions about login

20

Partners setting up login

31

Usability issues

22

Feature requests

51
Partner Ingest 35
General 4750

Partnership

67

Infrastructure

00

Miscellaneous

4143

*See User Support Working Group Issue Types for a description of the types of issues included in each category.

Projects

Bibliographic Data Management System

Team members from California Digital Library continued work on processes to compare bibliographic records in Zephir, the new metadata management system under development, with records in HathiTrust’s existing system. Zephir team members continued to load and test new records as well, and refine the timeline for migration of bibliographic metadata management services to Zephir in coordination with staff at the University of Michigan.

HathiTrust Publishing (HTPub)

Staff at the University of Michigan revised the goal statement for HTPub (see the project web page) and plans for system architecture. Staff also began work on establishing a project timeline.

HathiTrust Research Center

Several changes were made to the HTRC leadership in December. John Unsworth, a key member of the Team at Illinois, accepted a position as vice provost for Library and Technology Services and chief information officer at Brandeis University. He will be leaving the University of Illinois but remain on the Executive Management Team. The Team will keep its base composition of 2 members from the University of Illinois and 2 from Indiana University, so this change will add one new member. Stephen Downie, Associate Dean for Research at the University of Illinois Graduate School of Library and Information Science, will fill the position left by John. Stephen’s research has focused on music information retrieval and data mining. This work has involved building significant infrastructure for research, including grappling with issues of allowing computational access to in-copyright material. Finally, Marshall Scott Poole is stepping aside as co-director of the HTRC for personal reasons, though he will remain on the Executive Management Team. Stephen Downie will take his place as co-director of the HTRC with Beth Plale, who is co-director on the Indiana University side. Beth also chairs the Executive Management Team. The changes are in effect as of January 1, 2012.

IMLS Quality Grant

In December, project staff completed physical review of more than 90% of the volumes in the first 1,000 volume sample drawn from HathiTrust. Staff are working to arrange on-site review with cooperation from HathiTrust member libraries for the approximately 70 volumes that are not available via inter-library loan due to poor condition, non-circulating collection, or other reason.

Project staff concluded page-level data collection for the second production sample in December (see the Update on September 2011 Activities for details on the composition of the sample). The full dataset will be sent to the project statistician in early January for analysis. Data collection for the third production run began in the late December. The third production run focuses on Internet Archive-digitized volumes published pre-1923.

Project staff continue to define requirements for a new quality review interface, targeted specifically for review of volume-level errors such as missing, duplicate, and out-of-order pages. Please visit the project website for updates.

Development Updates

Full-text Search

Michigan staff released a new version of the full-text search index in December. The new release corrected an error in the “Original Location” metadata facet and provided additional metadata for advanced search and relevance ranking. It also made it possible for full-text search results and facets to reflect whether or not users from partner institutions are able to view in-copyright items when lawful access is permitted (HathiTrust is currently pursuing providing access to in-copyright works to users who have print disabilities, for preservation uses, and in circumstances where works are copyright-orphaned). Access in these circumstances, which are still pending deployment to partners, is dependent on partner institutions owning or previously owning print copies of works in question and users’ location inside or outside the United States.

Michigan staff continued development on an advanced search feature for full-text search, including preliminary testing of the first working prototype in HathiTrust’s development environment.

California Digital Library continued work on a spelling suggestion feature for full-text search queries. A CDL developer established an account in the HathiTrust development environment and used a sample index of public domain materials to test strategies for automatically building a bigram dictionary of words with different spellings users might enter.

Tom Burton-West's proposed talk on "HathiTrust Large Scale Search: Scalability meets Usability", was accepted by popular vote for the 2012 Code4Lib Conference in Seattle, WA.

Throttling

Staff at Michigan released a new throttling mechanism for HathiTrust, which allows throttling levels to be set at more granular levels. Users are now less likely to be throttled in the course of normal use as the new throttling policies are applied to specific scenarios such as viewing thumbnail or page images, or downloading PDFs, as opposed to all use generally. Throttling ensures compliance with third-party restrictions on bulk download of materials, and helps to ensure a consistent and reliable experience for all users.

PageTurner

In connection with HTPub, Michigan staff continued work to adapt the HathiTrust PageTurner to display XML content.

Security Risk Assessment and Vulnerability Test

Michigan Library staff continue to work with central IT security analysts to complete the Risk Assessment that was started in November, and have received the final report of the vulnerability penetration test. The report revealed no vulnerabilities that enabled direct or indirect access to the repository, but noted software issues such as cross-site scripting vulnerability and also made recommendations for increased firewalling at the Michigan site. All software issues noted in the report were addressed in December. A broader firewalling project for the data center where the Michigan instance is hosted is already in progress but not yet complete, and so some provisional steps were taken to tighten security while that effort continues.

Outages

HathiTrust services were inaccessible or diminished for several periods in December due to problems related to the release of the new throttling system (all times EST): on Tue, Dec 13 4:25-4:30pm, Wed, Dec 14 11:10am-12:00pm, and Wed 12-21 7:30-10:30am, all page viewing was affected, and on Tue, Dec 13 3:45-5:00pm, full-book PDF download was affected. Additionally, page viewing of volumes classified as "Public Domain in the United States" in HathiTrust was intermittently unavailable on Wed 12-21 from approximately 1-4:30pm EST due to an apparent outage with an externally-hosted proxy detection system.

HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org.

Papers & Presentations


All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth


As of November 1:

  December Total
Columbia University 4
64,176
Cornell University 9,871
383,690
Duke University 21 4,522
Harvard University 434 53,440
Indiana University 324
186,912
Library of Congress 15,769 89,411
North Carolina State University 0 3,196
University of North Carolina - Chapel Hill 0 8,087
Northwestern University 237 5,649
New York Public Library 76
259,453
Penn State University 1,821
42,917
Princeton University 350
249,679
Purdue University 0
887
University of California 114,906
3,287,654
The University of Chicago 1,730
10,608
University of Illinois 0 14,503
Universidad Complutense 28 108,668
University of Michigan 22,907 4,504,601
University of Minnesota 916
90,239
University of Wisconsin 15,902 527,334
University of Virginia 12 47,396
Utah State 0 46
Yale University 0 23,674
Total 185,311 9,966,572

Public Domain (~27%)

Total* 50,434 2,712,626

January Forecast


  • Continue work on the advanced search feature for full-text search

You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust