HathiTrust Updates

2011 Year In Review

[Download PDF]

The close of 2011 marked 4 years since the first formal commitments were made to building HathiTrust, a broad collaborative of academic and research institutions that are working together to ensure the long-term preservation and accessibility of the cultural record.

2011 saw the solidification of the HathiTrust repository's position in the library community, as it received Trustworthy certification from the Center for Research Libraries. It also saw the solidification of HathiTrust services, as a new mobile interface was released, significant enhancements were made to the Full-text search, PageTurner, and Collection Builder applications, and a database of print holdings was incorporated into access systems, providing a mechanism to provide lawful access to in-copyright materials that are held by member institutions.

The HathiTrust partnership achieved a new level of cohesion and stability in 2011 as well, as the member institutions came together in a Constitutional Convention to make collective decisions about the structure and priorities of the initiative going forward. Agreements with a variety of entities (organizations, academic presses and vendors) to expand access to materials in HathiTrust and enhance their discovery further magnified the impact of the partnership’s work in the broader library community.

2011 offered partners the opportunity to reflect on the accomplishments of HathiTrust in its first years, and make collective plans to address the challenges libraries face in stewarding and provisioning the cultural record in years to come. We move into 2012 with optimism, based on what we have been able to achieve, in our ability to collaborate deeply and effectively to address these challenges, and maintain and even enhance the role that libraries play in the new, shared, digital future.

A summary of HathiTrust activities in 2011 is given below:

New Partners

HathiTrust grew from 52 to 66 partners in 2011. The new institutions that formally announced partnership include:

  • Boston College
  • Boston University
  • Lafayette College
  • University of Arizona
  • University of Connecticut
  • University of Florida
  • University of Notre Dame
  • University of Miami
  • University of Missouri

New Content

HathiTrust partners contributed 2,129,874 volumes to the repository in 2011, for a total of 9,966,572. 753,403 of these (2,712,626, or 27% overall) are either in the public domain or volumes that rights holders have given HathiTrust permission to make publicly available. HathiTrust exceeded 10 million volumes in early January 2012 (see the blog post and timeline of repository development).

Having completed a framework for ingesting volumes from varied sources at the end of 2010, in 2011 HathiTrust began to scale up ingest of locally-digitized content from partner institutions. Large-scale deposits continued as well. New institutions contributing content in 2011 included:

Large-scale digitization

  • Library of Congress
  • Harvard University
  • University of Virginia
  • Northwestern University
  • Purdue University
  • North Carolina State University
  • Duke University
  • University of North Carolina-Chapel Hill

Local or in-house digitization

  • Universidad Complutense de Madrid
  • University of Minnesota
  • Utah State University Press
  • Yale University

Conversations regarding ingest of locally-digitized materials were initiated with

  • Columbia University
  • Northwestern University
  • University of Florida
  • University of Illinois
  • University of Iowa
  • University of North Carolina – Chapel Hill
  • University of Utah
  • University of Pittsburgh

Constitutional Convention

In October 2011, HathiTrust partners convened a Constitutional Convention to determine directions for the partnership following its first 5-year period, which will conclude at the end of 2012. HathiTrust’s Strategic Advisory Board released a review of the partnership’s activities and progress over its first 3 years prior to the Convention to set the stage for ballot initiatives and partner discussion. 7 ballot initiatives were considered by partners at the Convention. 5 of these were accepeted:

  • To establish a distributed archive of print monograph volumes from partner institutions,
  • To establish an approval process for HathiTrust development initiatives ,
  • To establish a new governance structure (to be in place by mid-April, 2012 – see HathiTrust Governance for more information),
  • To initiate coordinated action to expand and enhance access to U.S. federal government documents, and
  • To establish a fee-for-service model of content deposit from non-partner entities.

Information about the Constitutional Convention, including notes from the convention, ballot initiatives, attendees, and the 3-year review are available on the Constitutional Convention information page. John Wilkin’s opening remarks and the presentation given by representatives of the Strategic Advisory Board are available on the Papers and Presentations page. John Wilkin’s remarks were also posted on the HathiTrust blog.

Fulfillment of Functional Objectives

With the ingest of image content from Minnesota, the establishment of a HathiTrust Research Center, progress to enable HathiTrust as a platform for digital publishing, certification by the Center for Research Libraries for compliance with TRAC, and the establishment of infrastructure to offer access to in-copyright works for users who have print disabilities (see further information on these below), HathiTrust has provided a meaningful deliverable for each of the initial objectives set by the founding partners (see HathiTrust Functional Objectives).

Lawful Uses of In-copyright Materials

  • In 2011, Utah State University Press and Duke University Press agreed to open back file publications in HathiTrust in exchange for perpetual archiving of the deposited volumes.
  • HathiTrust added support for Creative Commons licenses in the repository, giving rights holders the ability to use these licenses to open access to materials. Support includes the inclusion of Creative Commons licensing information as RDFa in the PageTurner application. The Brooklyn Museum, Society of American Archivists, University of Texas at Austin and many others were early adopters.
  • HathiTrust released the first iteration of a database of print holdings information from partner institutions. The database will act as the basis for the new pricing model to be implemented in 2013, and expanded access to in-copyright materials for members of partner institutions. See the Update on July 2011 Activities for more information. Note: expanded access has not yet been released.
  • HathiTrust leveraged work in the University of Michigan’s IMLS-funded Copyright Review Management System to begin to identify orphan works in the repository. Michigan, the University of Wisconsin, Cornell, Duke, Johns Hopkins, Emory University, and the University of California announced plans to provide access to orphan works identified through this process on a limited basis to faculty, students, and staff at their institutions. Information about the terms of access is available on the HathiTrust website. See also the Orphans Works Project page on the University of Michigan Library website.
  • The Authors Guild and others filed a lawsuit against HathiTrust alleging copyright infringement. HathiTrust partners are convinced of the value and legality of preserving in-copyright materials and continue their work as the lawsuit proceeds. Further information about the lawsuit is available on the HathiTrust website.

Discovery

HathiTrust signed agreements with ProQuest, OCLC and EBSCO to make the HathiTrust full-text index searchable through their discovery services.

Datasets

HathiTrust began to make datasets of public domain materials available on a large scale. See HathiTrust Datasets for more information.

Repository

  • HathiTrust was certified by the Center for Research Libraries as a Trustworthy Digital Repository in March 2011.
  • University of Michigan staff, with feedback from the UX Advisory group and significant contributions from California Digital Library, made a number of enhancements to the Collection Builder, PageTurner, and Full-text search applications. These included:
    • Collection Builder
      • The application was re-architected to leverage the full-text search index, allowing users to create collections of any size.
      • Interface enhancements were made to improve browsing and discovery of collections, including the abilities to search collections by title and description, and filter collections by their featured status, last time of update, number of items, and whether or not they belong to the current authenticated user.
    • PageTurner
      • Views were added to allow users to scroll through volumes, flip pages similar to a physical book, and view thumbnail images of all pages in a volume. The views were accomplished through backend enhancements and the integration of the Internet Archive’s open source BookReader in the PageTurner. Initial work on BookReader integration, including the development of thumbnail views was completed by staff at California Digital Library.
      • The interface was reorganized and streamlined to improve use. New features include more prominent display of copyright status, re-positioning of navigation features and volume information, and the ability to expand the viewing area to full-screen.
      • Quick-copy links to volume pages and permanent volume URLs were added.
      • A progress bar was added to improve the user experience for full-volume PDF downloads.
    • Full-text Search
      • The full-text search index was enhanced to include bibliographic metadata, allowing for improved relevance ranking and faceted display of full-text search results. The search engine used to search inside a book was upgraded from the XPAT search engine, to Solr, improving results display for multiword searches when searching within a book. Michigan staff developed a prototype for advanced full-text search and performed a preliminary user interaction/usability walkthrough. California Digital Library completed substantial work toward the implementation of a full-text search spelling suggestion feature.
  • HathiTrust began posting weekly reports on the ingest of partner volumes.
  • Michigan staff completed the first cycle of storage replacement for HathiTrust, on storage purchased in 2007, as well as the first replacement of HathiTrust database and ingest servers (see HathiTrust Technology for more information on storage and replacement).
  • University of Michigan staff drafted new security specifications for the Data API which will allow additional access to members of partner institutions for specific purposes.
  • Staff at Michigan implemented new procedures to perform periodic, generalized auditing of the repository, including checksum validation of deposited volumes and other types of analysis, such as investigation into usage of preservation or other associated metadata.
  • Michigan implemented improvements to repository throttling mechanisms to minimize interruptions to normal use while ensuring compliance with third-party restrictions on bulk download of materials.
  • Michigan staff designed and developed a mobile-friendly interface to the catalog search and PageTurner (read the blog post).
  • Staff at Michigan installed new infrastructure in the HathiTrust Development Environment to support performance requirements of the new print holdings database. Michigan also implemented services that allow code administrators to see differences between the last deployed versions of repository code when staging new code for development, and improve the process by which developers stage new code for testing. Partners interested in exploring the development environment should contact feedback@issues.hathitrust.org.

Governance, Working Groups, and Committees

  • Strategic Advisory Board
    • The Strategic Advisory Board welcomed two new members from the University of California in May 2011: Todd Grappone, Associate University Librarian for Digital Initiatives and Information Technology at UCLA, and Julia Kochi, Director, Digital Libraries and Collections at UC San Francisco. Todd and Julia took the place of Bernie Hurley, UC Berkeley and Bruce Miller, UC Merced.
  • Collections
    • The Collections Committee submitted a proposal to establish a distributed print monographs archive to the Constitutional Convention. This proposal was the first submitted and served as a model for subsequent proposals. It was accepted by the partners and is a foundational piece in HathiTrust’s strategy to coordinate shared storage strategies among the partnership. The Collections Committee also completed a report on treatment of duplicates in HathiTrust that is currently being reviewed by the Strategic Advisory Board, and began to consider processes for handling user requests to contribute volumes to the repository.
    • Tom Teper of the University of Illinois joined the group in 2011, stepping in for Kim Armstrong of the Committee on Institutional Cooperation.
  • Communications
    • The Communications Working Group organized a webinar, given several times in the spring, targeted towards members of the large number of new partner institutions that joined HathiTrust at the end of 2010. The webinar reviewed basic elements of the partnership and discussed current activities and future work. The group also organized a webinar for non-partner institutions interested in learning more about HathiTrust. Other working group activities included:
    • Three new members joined the group in 2011: Robin Bedenbaugh from Texas A&M University, Oya Rieger from Cornell University, and Stacy Kowalczyk from Indiana University, joining as a representative from the HathiTrust Research Center.
    • The group begins 2012 continuing work on a public services-oriented communications package, highlighting ways HathiTrust can be used to address a variety of research and reference inquiries.
  • Discovery Interface Working Group
    • OCLC released a prototype of the HathiTrust-OCLC catalog in beta in January. The catalog was the result of nearly two years of collaborative work between OCLC and HathiTrust, coordinated by the Discovery Interface Working Group. The effort included the loading of all HathiTrust records into WorldCat. After the catalog was released, the DIWG moved its focus to usability testing of the new prototype system and defining areas for subsequent improvement.
    • The DIWG launched a Full-text search subgroup to develop a prioritized list of features to implement in the full-text search application (see below). 
    • The DIWG fulfilled the work in its charge and was disbanded in June 2011.
  • User Experience
    • The User Experience Advisory group provided feedback on the interface improvements mentioned above for the HathiTrust PageTurner, Collection Builder, and Full-text search applications.
    • The group also released a set of HathiTrust User Personas to help staff working on HathiTrust learn more about HathiTrust users, discover how to better meet their needs, and identify areas in which to do more in-depth research.
    • One new member, Darcy Duke from MIT, joined the advisory group in 2011. Darcy had been an active contributor to the UX discussion list, which was launched by the group in 2011.
  • User Support
    • In March, The Executive Committee launched a new User Support Working Group to respond to feedback submitted through HathiTrust’s help email addresses and user interfaces. Staff at Michigan who had been managing the process previously helped to establish a new partner-wide ticketing system. An 8-member group began an on-call rotation to address user issues beginning in April. The group has posted statistics on inquiries in the monthly newsletter since that time. As of the end of December, 3 members, Nancy Spiegel and Todd Ito of the University of Chicago and Bob Kackley of the University of Maryland have had to leave the group. The group was pleased to welcome Kathryn Stine from the California Digital Library in November and is open to nominations from partner institutions. Please contact Jeremy York (jjyork@umich.edu) for information.
  • Full-text search
    • The full-text search subgroup of the Discovery Interface Working Group researched and estimated technical feasibility and implementation effort for potential new full-text search features. The group’s report, which includes a prioritized list of full-text search enhancements is available online. 3 of the top enhancements have been implemented (1a, 2, 3b) and 3 are currently under development (5a, 7, 1b).

Special Initiatives

  • HathiTrust Research Center (HTRC)
    • The HTRC initiative was formally launched by Indiana University and the University of Illinois in July 2011. The research center will offer computational access for nonprofit and educational users to public domain and, in the future, in-copyright works in HathiTrust.
    • The Research Center received a 3-year, $600,000 grant from the Sloan Foundation in July to investigate "non-consumptive" research on the full-text of materials in the HathiTrust corpus. Particularly in relation to in-copyright works, "non-consumptive" research is research that allows computation of results about a body of works but not significant reading or "consumption" of the works.
    • The HTRC technical team worked throughout 2011 to establish core cyberinfrastructure and data analysis tools for the Research Center, and develop access policies. A full demonstration of the HTRC is scheduled to be available in July 2012.
    • Michigan Staff developed an initial model for transferring data from HathiTrust to the HTRC (using rsync) and Indiana University staff began performing tests with sample data. Transfer of texts for public domain texts to the HTRC will occur in 2012.
    • More information about the HTRC is available on the HathiTrust website.
  • Government Documents and Copyright
    • Maliaca Oxnam of the University of Arizona initiated research in collaboration with HathiTrust to improve access to U.S. Federal government documents. Details on her work are available in the Update on October 2011 Activities.
  • IMLS Quality Grant
    • HathiTrust began its participation as a testbed for research in a 2-year IMLS-funded project to investigate quality in large-scale digital repositories. In its first year the project established a preliminary review interface and methodology for reviewing the quality of digital volumes, and procedures for reviewing the quality of corresponding physical volumes in order to correlate results. The project team completed digital review of two samples of 1,000 volumes drawn from HathiTrust. These were the first of several samples to be taken throughout the project to investigate quality within different date ranges and languages and from different digitization sources. The team is performing an additional physical review of volumes in the first sample to investigate relationships between the physical condition of volumes and errors observed in their digital surrogates. This review is largely complete. The project is working toward a distributed system that will allow partners to review and certify the quality of volumes in HathiTrust. This certification will inform and facilitate a number of partner activities, including handling of duplicate volumes in the repository and partners’ local and collaborative strategies for managing their print collections.
  • University of California Print on Demand
    • The University of California began offering reprints of UC-digitized public domain materials via HathiTrust.
  • Minnesota Digital Library Image Preservation Prototype Project
    • Nearly 60,000 images and associated metadata from the University of Minnesota and its statewide partners, the Minnesota Digital Library and Minnesota Historical Society, were ingested into HathiTrust in a project to develop a prototype workflow for depositing images and associated metadata into HathiTrust for access and preservation. More about the project can be found on the HathiTrust project page.
  • Bibliographic data management
    • California Digital Library made significant progress toward the establishment of a new bibliographic metadata management system for HathiTrust. CDL staff completed development of the core system, Zephir, and began to load bibliographic metadata for volumes transferred from the metadata management system at the University of Michigan. CDL worked closely with Michigan to understand existing processes for transforming and managing records, and requirements for generating various outputs, including metadata for the HathiTrust catalog, OAI feed, and tab-delimited inventory files. CDL and Michigan have established ongoing processes for syncing of bibliographic data from Michigan to CDL, and will begin to prepare for integration testing and deployment in 2012.
  • HTPub
    • HTPub, an effort of the University of Michigan Library’s MPublishing Division to develop mechanisms for ingest, display, and discovery of born-digital materials in HathiTrust, moved from initial planning to design and development stages in 2011. Team members at Michigan hired two new staff members to support the initiative, defined goals, requirements, and design principles for the project, began to design system architecture and determine archival package specifications for deposited content, and began initial development of content transformation tools and mechanisms for displaying content. Information about the project is available on the HTPub project page.

Update on December 2011 Activities

January 13, 2012

[Download PDF]

Late Breaking News


HathiTrust Passes 10 Million Volumes

View statistics and a timeline on the HathiTrust blog.

Top News

Changes to Tab-delimited Files

On February 1, HathiTrust will be adding three additional columns to the tab-delimited inventory files (“hathifiles”) available at http://www.hathitrust.org/hathifiles. The files are frequently used by partners and non-partners as a means to obtain full bibliographic records for HathiTrust items to load into local catalogs (see HathiTrust Data Availability and APIs). The additional columns will identify the publication date and publication location of volumes in HathiTrust, as well as volumes that have been identified as U.S. federal government documents.

Ingest

Works Digitized Locally and by Internet Archive

Staff at Michigan continued conversations with staff at the University of Florida regarding ingest of locally-digitized materials, and staff at several other institutions regarding ingest of Internet Archive-digitized materials.

Working Groups and Committees

Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups for more information.

Operational

Communications Working Group

The Communications Working Group continued to work on on a public services-oriented communications package, as well as announcements for new partners and the major milestone of 10 million volumes.

User Experience Advisory Group

The User Experience Advisory Group began reviewing the current home page and discussed additions and issues that will need to be addressed in a forthcoming redesign. Group member Jenny Emanuel contributed a "Perspectives from HathiTrust" blog post about the group's persona work that was completed in November. 

User Support Working Group

The User Support Working Group is still seeking nominations for new members. See the Update on November Activities for details.

The table below contains a summary of the issues received by the User Support Working Group in December.

Issue Type November December
Content 107
81

Quality

10271

Non-partner Digital Deposit

02

Collections

56
Cataloging 4330
Access and Use 103107

Copyright

55

59

Permissions

104

Takedown

12

Print on Demand

21

Inter-library loan

02

Full-PDF or e-copy requests

1528

Datasets

12

Data Availability and APIs

20

Reuse of content

11
Web applications 2418

Functionality problems

59

Problems with login specifically

11

General Questions about login

20

Partners setting up login

31

Usability issues

22

Feature requests

51
Partner Ingest 35
General 4750

Partnership

67

Infrastructure

00

Miscellaneous

4143

*See User Support Working Group Issue Types for a description of the types of issues included in each category.

Projects

Bibliographic Data Management System

Team members from California Digital Library continued work on processes to compare bibliographic records in Zephir, the new metadata management system under development, with records in HathiTrust’s existing system. Zephir team members continued to load and test new records as well, and refine the timeline for migration of bibliographic metadata management services to Zephir in coordination with staff at the University of Michigan.

HathiTrust Publishing (HTPub)

Staff at the University of Michigan revised the goal statement for HTPub (see the project web page) and plans for system architecture. Staff also began work on establishing a project timeline.

HathiTrust Research Center

Several changes were made to the HTRC leadership in December. John Unsworth, a key member of the Team at Illinois, accepted a position as vice provost for Library and Technology Services and chief information officer at Brandeis University. He will be leaving the University of Illinois but remain on the Executive Management Team. The Team will keep its base composition of 2 members from the University of Illinois and 2 from Indiana University, so this change will add one new member. Stephen Downie, Associate Dean for Research at the University of Illinois Graduate School of Library and Information Science, will fill the position left by John. Stephen’s research has focused on music information retrieval and data mining. This work has involved building significant infrastructure for research, including grappling with issues of allowing computational access to in-copyright material. Finally, Marshall Scott Poole is stepping aside as co-director of the HTRC for personal reasons, though he will remain on the Executive Management Team. Stephen Downie will take his place as co-director of the HTRC with Beth Plale, who is co-director on the Indiana University side. Beth also chairs the Executive Management Team. The changes are in effect as of January 1, 2012.

IMLS Quality Grant

In December, project staff completed physical review of more than 90% of the volumes in the first 1,000 volume sample drawn from HathiTrust. Staff are working to arrange on-site review with cooperation from HathiTrust member libraries for the approximately 70 volumes that are not available via inter-library loan due to poor condition, non-circulating collection, or other reason.

Project staff concluded page-level data collection for the second production sample in December (see the Update on September 2011 Activities for details on the composition of the sample). The full dataset will be sent to the project statistician in early January for analysis. Data collection for the third production run began in the late December. The third production run focuses on Internet Archive-digitized volumes published pre-1923.

Project staff continue to define requirements for a new quality review interface, targeted specifically for review of volume-level errors such as missing, duplicate, and out-of-order pages. Please visit the project website for updates.

Development Updates

Full-text Search

Michigan staff released a new version of the full-text search index in December. The new release corrected an error in the “Original Location” metadata facet and provided additional metadata for advanced search and relevance ranking. It also made it possible for full-text search results and facets to reflect whether or not users from partner institutions are able to view in-copyright items when lawful access is permitted (HathiTrust is currently pursuing providing access to in-copyright works to users who have print disabilities, for preservation uses, and in circumstances where works are copyright-orphaned). Access in these circumstances, which are still pending deployment to partners, is dependent on partner institutions owning or previously owning print copies of works in question and users’ location inside or outside the United States.

Michigan staff continued development on an advanced search feature for full-text search, including preliminary testing of the first working prototype in HathiTrust’s development environment.

California Digital Library continued work on a spelling suggestion feature for full-text search queries. A CDL developer established an account in the HathiTrust development environment and used a sample index of public domain materials to test strategies for automatically building a bigram dictionary of words with different spellings users might enter.

Tom Burton-West's proposed talk on "HathiTrust Large Scale Search: Scalability meets Usability", was accepted by popular vote for the 2012 Code4Lib Conference in Seattle, WA.

Throttling

Staff at Michigan released a new throttling mechanism for HathiTrust, which allows throttling levels to be set at more granular levels. Users are now less likely to be throttled in the course of normal use as the new throttling policies are applied to specific scenarios such as viewing thumbnail or page images, or downloading PDFs, as opposed to all use generally. Throttling ensures compliance with third-party restrictions on bulk download of materials, and helps to ensure a consistent and reliable experience for all users.

PageTurner

In connection with HTPub, Michigan staff continued work to adapt the HathiTrust PageTurner to display XML content.

Security Risk Assessment and Vulnerability Test

Michigan Library staff continue to work with central IT security analysts to complete the Risk Assessment that was started in November, and have received the final report of the vulnerability penetration test. The report revealed no vulnerabilities that enabled direct or indirect access to the repository, but noted software issues such as cross-site scripting vulnerability and also made recommendations for increased firewalling at the Michigan site. All software issues noted in the report were addressed in December. A broader firewalling project for the data center where the Michigan instance is hosted is already in progress but not yet complete, and so some provisional steps were taken to tighten security while that effort continues.

Outages

HathiTrust services were inaccessible or diminished for several periods in December due to problems related to the release of the new throttling system (all times EST): on Tue, Dec 13 4:25-4:30pm, Wed, Dec 14 11:10am-12:00pm, and Wed 12-21 7:30-10:30am, all page viewing was affected, and on Tue, Dec 13 3:45-5:00pm, full-book PDF download was affected. Additionally, page viewing of volumes classified as "Public Domain in the United States" in HathiTrust was intermittently unavailable on Wed 12-21 from approximately 1-4:30pm EST due to an apparent outage with an externally-hosted proxy detection system.

HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org.

Papers & Presentations


All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth


As of November 1:

  December Total
Columbia University 4
64,176
Cornell University 9,871
383,690
Duke University 21 4,522
Harvard University 434 53,440
Indiana University 324
186,912
Library of Congress 15,769 89,411
North Carolina State University 0 3,196
University of North Carolina - Chapel Hill 0 8,087
Northwestern University 237 5,649
New York Public Library 76
259,453
Penn State University 1,821
42,917
Princeton University 350
249,679
Purdue University 0
887
University of California 114,906
3,287,654
The University of Chicago 1,730
10,608
University of Illinois 0 14,503
Universidad Complutense 28 108,668
University of Michigan 22,907 4,504,601
University of Minnesota 916
90,239
University of Wisconsin 15,902 527,334
University of Virginia 12 47,396
Utah State 0 46
Yale University 0 23,674
Total 185,311 9,966,572

Public Domain (~27%)

Total* 50,434 2,712,626

January Forecast


  • Continue work on the advanced search feature for full-text search

You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust

Update on November 2011 Activities

December 9, 2011

[Download PDF]

Late Breaking News


Boston College Joins HathiTrust

HathiTrust is pleased to welcome Boston College as its newest member. The full announcement is available on the Boston College website.

Research Center on HathiTrust.org

A new “Our Research Center” portion of HathiTrust.org was launched in early December, containing information about the governance, timeline and deliverables, architecture, and access and use policies for the HathiTrust Research Center (HTRC), which is jointly led by Indiana University and the University of Illinois. “Our Research Center” also includes information about research partnerships and a demonstration tool that allows users to create tag clouds and perform limited analysis on a small number of works. The HTRC welcomed two new members from the University of Illinois library to the HTRC technical team in November: Kirk Hess and Harriett Green. Kirk and Harriett bring experience in user interfaces and services, areas that complement the technical strengths of Indiana staff currently working on the HTRC.

Is that the library in your pocket?

A new Perspectives on HathiTrust blog post authored by Suzanne Chapman, chair of the User Experience Advisory Group, was released in early December, highlighting HathiTrust’s new mobile interface.

Top News

Expansion of “Buy a Copy”

The “Buy a copy” option has now been expanded to include over 30,000 public domain volumes from the University of California. UC will incrementally add new volumes to the service. UC has partnered with Hewlett-Packard to create the reprints and make them available for purchase via Amazon.com.

User Support: New member and call for nominations

The User Support Working Group is very pleased to welcome a new member, Kathryn Stine from the California Digital Library. The working group is seeking nominations from partner institutions for up to 4 additional positions. Nominations should be sent to Jeremy York (jjyork@umich.edu) and include the name, title, and a short description of current job duties. Additional information that might relevant to participation in the group may be included as well. User Support members are on call at least one day per week and follow up on inquiries throughout the week, requiring between 2-4 hours of work. Staff that participate on the group will

  • Gain knowledge about HathiTrust’s user base, typical problems and questions that are raised and how they are resolved.
  • Become aware of new ways HathiTrust is being used, and features and functionality that users desire.
  • Gain knowledge of HathiTrust organizational and technical infrastructure, and policies and procedures relating to copyright, access, collection development, deposit of materials, and preservation.

The charge for the working group is available at http://www.hathitrust.org/wg_user-support_charge.

Volunteer Lucene Developer

HathiTrust is seeking a volunteer Lucene developer (from partner institutions or not) to work directly through the Lucene contribution process to improve indexing capabilities for Chinese-, Japanese-, and Korean-language (CJK) materials; more specifically, to add overlapping bigram functionality for CJK languages to the Lucene ICUTokenizer (view the Lucene JIRA ticket for this issue). A new HathiTrust large-scale search blog post on word segmentation for CJK languages provides additional context. Please contact Tom Burton-West for more information.

Changes to Tab-delimited files

On February 1, HathiTrust will be adding additional columns to the tab-delimited inventory files (“hathifiles”). A final description of the changes will be posted in the update on December activities. Proposed additions include the publication date and publication location of volumes, as well as an indication of whether volumes have been identified as U.S. federal government documents.

Updated Permissions Agreement

University of Michigan staff have updated the permissions agreement by which rights holder can open access to their works in HathiTrust. The agreement, which is now also available as a fillable PDF, is available at http://www.hathitrust.org/permissions_agreement, with instructions on completion and submission.

Ingest

Volume Projections for 2012

HathiTrust sent a call to partners in November for projections of volumes to be deposited in 2012. The projections will be used to estimate storage needs and fees for partners in the coming year. A variety of locally-digitized collections were identified for deposit, in addition to volumes digitized through Internet Archive and Google. More information on these and continuing work on ingest will be included in coming months.

Local Digitization

HathiTrust has ingested nearly all of approximately 200 rare manuscripts and incunabula from the Universidad Complutense de Madrid. Issues with some of the submitted volumes that prevented ingest will be investigated further by Michigan staff.

Working Groups and Committees

Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups for more information.

Operational

Communications Working Group

The Communications Working Group continued to make progress on a public services-oriented communications package, highlighting ways HathiTrust can be used to address a variety of research and reference inquiries.

User Experience Advisory Group

The User Experience Advisory Group finalized the user personas it began to develop over the summer. The personas and an accompanying overview of the project are available at http://www.hathitrust.org/personas. The purpose of the personas is to help HathiTrust staff and partners (developers, policy makers, user experience designers and researchers, reference and instruction librarians, etc.) envision different types of HathiTrust users in a more concrete way in order to inform our work. The group welcomes any questions or comments about the personas to be sent to Suzanne Chapman (suzchap@umich.edu), chair of the UX Advisory Group.

User Support Working Group

The table below contains a summary of the issues received by the User Support Working Group in November.

Issue Type October Issues November Issues
Content 154 107

Quality

142 102

Non-partner Digital Deposit

0 0

Collections

1 5
Cataloging 44 43
Access and Use 136 103

Copyright

75

55

Permissions

4 10

Takedown

4 1

Print on Demand

2 2

Inter-library loan

0 0

Full-PDF or e-copy requests

23 15

Datasets

1 1

Data Availability and APIs

2 2

Reuse of content

2 1
Web applications 29 24

Functionality problems

6 5

Problems with login specifically

3 1

General Questions about login

4 2

Partners setting up login

1 3

Usability issues

2 2

Feature requests

5 5
Partner Ingest 1 3
General 59 47

Partnership

8 6

Infrastructure

0 0

Miscellaneous

51 41

*See User Support Working Group Issue Types for a description of the types of issues included in each category.

Strategic

Collections Committee

The Collections Committee made several revisions to its draft recommendations for handling duplicates in HathiTrust and has submitted these to the Strategic Advisory Board for approval. Once the revisions have been approved by SAB and endorsed by the Executive Committee, the full document will be posted on the HathiTrust website. The committee is currently discussing mechanisms for responding to user-submitted requests and offers and is formulating plans to address other items on its work agenda. Suggestions for additional work items are always welcome and can be sent to the committee chair, Ivy Anderson

Projects

Bibliographic Data Management System

The California Digital Library team worked to develop an infrastructure to compare bibliographic records in Zephir, its core metadata management system, with records in HathiTrust. Some of the challenges are determining when a record has been updated in HathiTrust, and managing multiple (non-HathiTrust) identifiers for volumes. The Zephir team loaded and tested records, and refined the timeline for migrating records into Zephir as they work with staff at the University of Michigan. Further information about the project can be found at http://www.hathitrust.org/htmms.

HathiTrust Publishing (HTPub)

Staff in MPublishing began work in November on a tool to convert DOCX files to JATS XML and worked with broader stakeholders at the University of Michigan Library to specify additional design requirements and agree on a set of design principles for HTPub (available on the HTPub project page). MPublishing staff also reviewed notes from a session at THATCamp Publishing 2011 dedicated to shared infrastructure for publishing, to consider how such an infrastructure might affect the architecture of HTPub tools, and services that might be offered in the future using those tools.

IMLS Quality Grant

Physical review of volumes in the first 1,000-volume sample continued in November, with volumes requested through inter-library loan continuing to arrive at Michigan. Plans are being made for staff at partner libraries to conduct physical review of volumes in cases where the volumes are not available for inter-library loan. There was an error in the previous update with respect to the timing of results analysis for the quality review performed on the first sample of digital volumes. This will be available at a later time. Please visit the project website for updates.

Quality review on the second sample of 1,000 volumes from HathiTrust was completed in mid-November. Measures to evaluate inter-coder consistency required re-review of some volumes in the sample, as well as individual pages within specific volumes. This review began in late November and should be complete in the first week of December. As review of the second sample of volumes was completed, project staff prepared to begin review of a third sample of 1,000 volumes, which will include pre-1923 English-language monographs digitized by the Internet Archive.

Project staff continued to define requirements for a new quality review interface, targeted specifically for review of volume-level errors such as missing, duplicate, and out-of-order pages. The project developer began coding basic elements of the system. Combining this new interface and procedures with those in the first interface, which was designed to review page-level errors, will lead to a system for comprehensive review that will enable certification of volumes at different quality levels. The project team is in the process of drafting specifications for certifying volumes. The final model will based on the findings from statistical sampling and manual review at the page and volume levels.

Orphan Works

Work continued on the Orphan Works Project pilot phase, which will continue through the end of December. Reviewers from the University of Michigan and the University of California, Los Angeles have now researched the same set of approximately 50 works. Staff from both institutions are looking at the results and reviewing the process for accuracy. The pilot phase of the OWP is intended to serve as a test for an orphan works identification process, through which we will document examples and further define parameters for research.

Development Updates

Full-text Search

Staff at Michigan began a re-indexing process in November for all 9.8 million volumes in HathiTrust. The purpose was to correct an error in the “Original Location” metadata facet, and to provide additional metadata for advanced search, relevance ranking, and to determine the viewability status of volumes (see below). The re-indexing was 98% complete at the end of November and is anticipated to go into production in early December. This re-indexing, and the discovery of a bug in the way Solr processes Boolean queries, slowed development of the advanced search feature that was planned for release in November. A workaround for the bug will be implemented until the bug is fixed. The advanced search feature is now planned for release in January.

As the indexing enhancements were put in place, Michigan staff completed the coding necessary for full-text search results to reflect whether or not a user is able to view items in situations that depend on institutional print holdings and other factors. This will apply to search results that include orphan works (when available), volumes that may be available under Section 108 of U.S. copyright law, and volumes that are accessible to users at partner institutions who have print disabilities. In order to see the availability of these volumes, and access them, users from partner institutions will need to be logged in using their institutional account.

Michigan developers continue to work with staff at the California Digital Library on the development of a spelling suggestion feature. CDL is testing various algorithms on sample HathiTrust data including the Solr/Lucene Levenshtein Automaton and Martin Reynaert’s anagram hashing algorithm. The work is focusing both on the speed and scalability of the algorithms and on the accuracy of the suggestions. Experimental code to extract useful bigrams from existing HathiTrust indexes is in the works, which will obviate the need to maintain multiple indexes to support spelling correction, as is currently the case.

PageTurner

HathiTrust has implemented new policies regarding access to in-copyright works, where lawful access is permitted. Access for authorized users at partner institutions who have print disabilities is now only possible from IP addresses within the United States. Access is limited to one user per physical volume held by the user’s institution. Access to in-copyright works is also now recorded in HathiTrust system logs, in accordance with HathiTrust’s privacy policy: http://www.hathitrust.org/privacy.

In connection with the HTPub effort, Michigan staff continued work to adapt the HathiTrust PageTurner to display XML content based on initial specifications.

Throttling

Michigan staff tested and refined application-specific policies for throttling (e.g., in the PageTurner, Full-text search, and Collection Builder applications), and expect to enable the new policies in December.

New Web Servers

Michigan staff purchased and began installing two new replacement web servers for HathiTrust in November. These are the last of eight servers targeted for replacement this year (six others were replaced in July).

Storage Hardware Upgrade

The new storage brought online in June of this year was discovered by Isilon Systems, the storage provider, to have a subtle hardware issue requiring all drives and some internal components in eight nodes to be removed and re-installed in a new chassis. The upgrade was preventative in nature; the minor symptom caused by the hardware issue had not been observed by HathiTrust. The maintenance was covered under the existing support agreement, and carried out without any interruption to service by Isilon’s field service technicians under close supervision by Michigan staff.

Security Risk Assessment and Vulnerability Test

As part of a regular program for continuous improvement in IT security, Michigan Library staff have been working with analysts in University of Michigan central IT to conduct a thorough risk assessment and vulnerability penetration test of the HathiTrust infrastructure. The scope of the risk assessment, which follows a framework developed at the University, consists primarily of servers and storage hardware, but also includes coverage of aspects such as facilities, management practice and policy, and workflows involving sensitive data. The vulnerability test focuses on network security, and is a hands-on exercise conducted by a trained security expert who attempts to discover flaws in network security and evaluate their potential for exploit. Final reports on both analyses are due in December.

Outages

No outages were reported in November 2011.

HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org.

Papers & Presentations


Partner-Specific:

All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth


As of November 1:

  November Total
Columbia University 123
64,172
Cornell University 5,833
374,089
Duke University 15 4,501
Harvard University 163 53,006
Indiana University 393
186,588
Library of Congress 0 73,642
North Carolina State University 0 3,194
University of North Carolina - Chapel Hill 0 8,087
Northwestern University 57 5,412
New York Public Library 212
259,377
Penn State University 281
41,096
Princeton University 413
249,329
Purdue University 886
887
University of California 27,759
3,172,748
The University of Chicago 822
8,875
University of Illinois 0 14,503
Universidad Complutense 296 108,640
University of Michigan 34,744 4,481,254
University of Minnesota 728
89,323
University of Wisconsin 6,190 511,432
University of Virginia 54 47,384
Utah State 0 46
Yale University 0 23,674
Total 78,971 9,781,261

Public Domain (~27%)

Total* 6,032 2,662,192

December Forecast


  • Complete orphan works project pilot phase
  • Enable updated throttling policies
  • Complete installation of two replacement servers

You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust

Update on October 2011 Activities

November 11, 2011 Syndicate content

[Download PDF]

Late Breaking News


Constitutional Convention

HathiTrust has released a blog post on the outcomes of the October Constitutional Convention. The post includes a link to the official notes from the two-day meeting.

Top News


Government Documents Copyright

Maliaca Oxnam, an Associate Librarian from the University of Arizona, and current Chair of the Technical Report Archive and Image Library (http://www.technicalreports.org), has engaged a sabbatical research project with the goal of improving access to government documents in HathiTrust. The three primary areas of her work include 1) investigating the accurate identification of government documents, 2) analyzing the copyright status of the documents and the reasons for their copyright determinations in HathiTrust, and 3) securing permissions from government agencies to make government publications viewable to the public at large. The sabbatical work will be completed by July 2012 and a report with recommendations for future actions will be presented to the HathiTrust Executive Committee. Questions or comments about the research can be sent to Maliaca Oxnam (oxnamm@u.library.arizona.edu).

The Orphan Works Project

The Orphan Works Project (OWP) is in a pilot phase that will continue through the end of December. Researchers from the University of Michigan and the University of California - Los Angeles are conducting a parallel review of approximately 680 volumes in HathiTrust that do not have readily identifiable publisher contacts. Michigan staff have made significant changes to the research process and project tools in order to improve the rigor and reliability of investigation following a reevaluation of the orphan works candidate identification process in October. An overview flowchart of the new procedure is available at http://www.lib. umich.edu/orphan-works/documentation. Michigan staff will add more extensive documentation in the coming months. The pilot phase of the OWP is intended to serve as a test for an orphan works identification process, through which we will document examples and further define parameters for research. 

Ingest


Google Digitization

Ingest rates for Google-digitized volumes from all Google partner libraries were low in October due to problems with Google’s download mechanism. Rates are expected to pick up in November.

Internet Archive Digitization

HathiTrust began ingest of Internet Archive-digitized content from Duke University and the University of North Carolina in October, and worked with the University of Florida toward ingest of its IA-digitized volumes.

Local Digitization Ingest

Staff at the University of Michigan continued conversations with the University of Pittsburgh and University of Utah regarding bibliographic metadata for those institutions' contributed volumes. Staff at Michigan received the final set of rare manuscripts and incunabula from Universidad Complutense de Madrid and expect to finish ingest of the materials in November.

Working Groups and Committees


Working groups and committees in HathiTrust may have an operational or strategic focus. See http://www.hathitrust.org/working_groups for more information.

Operational

Communications

The Communications Working Group continued work to develop a public services-oriented communications package, highlighting ways HathiTrust can be used to address a variety of research and reference inquiries. The group also made progress on a FAQ for the HathiTrust Research Center, and worked with staff from Indiana University to prepare a presence for the Research Center on HathiTrust.org.

User Experience Advisory Group

The User Experience Working Group was pleased to welcome a new member, Darcy Duke, to the group. Darcy is the User Experience Librarian and Web Manager at MIT and has been an active member of the HathiTrust UX Special Interest Group. The group worked on finalizing the user personas it drafted over the summer and discussed details regarding a label change for the PDF download link in the PageTurner application.

User Support Working Group

The table below contains a summary of the issues received by the User Support Working Group in October. Nancy Spiegel, of the University of Chicago, stepped down from the User Support Working Group at the end of the month. The Executive Committee would like to heartily thank Nancy for her work on the group, and her contributions to establishing an ongoing body for user support in HathiTrust. Two positions on the working group are currently open. Nominations and inquiries can be sent to jjyork@umich.edu.

Issue Type September Issues October Issues
Content 171 154

Quality

154 142

Non-partner Digital Deposit

2 0

Collections

4 1
Cataloging 25 44
Access and Use 127 136

Copyright

73

75

Permissions

12 4

Takedown

3 4

Print on Demand

17 2

Inter-library loan

5 0

Full-PDF or e-copy requests

24 23

Datasets

1 1

Data Availability and APIs

7 2

Reuse of content

5 2
Web applications 22 29

Functionality problems

5 6

Problems with login specifically

0 3

General Questions about login

2 4

Partners setting up login

5 1

Usability issues

6 2

Feature requests

2 2
Partner Ingest 00 5
General 65 1

Partnership

12 59

Infrastructure

0 0

Miscellaneous

53 51

*See User Support Working Group Issue Types for a description of the types of issues included in each category.

Strategic

Collections

The Collections Committee submitted its recommendations on the treatment of duplicates in HathiTrust to the Strategic Advisory Board (SAB) in October. The recommenations will be posted to the HathiTrust website following incorporation of feedback from the SAB. The Committee will be turning its attention next to a process for responding to requests and offers to include additional materials in HathiTrust, among other pending items on its work agenda.

Projects


Bibliographic Data Management

The California Digital Library development team worked with staff at the University of Michigan on a workflow and timeline for migrating all bibliographic data from Michigan’s integrated library system to California. CDL's metadata analyst finalized the internal metadata schema to be used in Zephir, the core metadata management system. Further information about the project can be found at http://www.hathitrust.org/htmms.

HathiTrust Publishing

MPublishing staff at the University of Michigan gathered input from colleagues in library-based publishing programs in October as they worked to finalize requirements, architecture, and design principles for the new publishing system, and archival package specifications for the published content. Michigan developers began adapting the HathiTrust PageTurner to display the new content based on initial specifications. Details about the publishing effort are available at http://www.hathitrust.org/htpub.

IMLS Quality Grant

Data collection on the second sample of 1,000 volumes in HathiTrust continued in October; nearly 80% of the sample was reviewed by month’s end. October also saw the launch of the official grant project website, available at http://hathitrust-quality.projects.si.umich.edu/. The website features an overview of the project and detailed status reports by quarter, from the project’s beginning in January 2011 to the present.

Review of the physical copies of volumes included in the first 1,000-volume sample continued throughout October. The review focuses on capturing bibliographic information and physical characteristics of the volumes that may have an impact on errors observed in the digital volumes. By the end of the month, a volunteer staff of 12 students from the School of Information reviewed 476 volumes, or nearly 50% of the sample. Staff are coordinating inter-library loan requests with member libraries to facilitate efficient receipt of volumes, or on-site review of volumes by member library staff.

Initial analysis of the data from the first 1,000-volume sample was completed by the project statistician and will be available on the project website in November. The second round of data collection is expected to be complete in mid-November.

HathiTrust Research Center

Indiana University staff worked on implementing the technical security infrastructure for the Research Center in October. The first part of this involved setting up InCommon Federation security, which will allow researchers to login to the HTRC with the username and password issued by their own institution. Once logged in, researchers will have the ability to access data and analysis tools in ways not available to the public. Authenticated access to the HTRC is expected to be available on a limited basis to HathiTrust partners in spring, 2012. As the key architectural pieces of the HTRC are put in place, Indiana staff are examining the adoption of a single API by which researchers can access all pieces of the data infrastructure. The best candidate for this appears to be the HathiTrust Data API. Staff will be making proposed extensions to this API available for comment.

Development Updates


Collection Builder

Staff at the University of Michigan improved processes to synchronize bibliographic and rights metadata in the Collection Builder with metadata in the catalog and rights database.

Full-text Search

University of Michigan staff re-indexed the full-text search index to add additional bibliographic metadata, including title information that will enable title displays in full-text search results to match those in the bibliographic catalog. Staff also continued work on advanced search, prototyping several designs for the user interface and working to improve relevance ranking of results. Staff expect to release the advanced search feature in November.

Michigan developers continue to work with staff at the California Digital Library on the development of a spelling suggestion feature. Developers at CDL are investigating modifications to traditional spelling suggestion algorithms, which are generally designed for single-language corpora, to accommodate the many languages in HathiTrust, and testing alternative spelling suggestion algorithms against a sample index.

Staff at Michigan made minor changes to the full-text indexing process to automatically receive notifications when volumes need to be removed from the index, and to improve index monitoring.

PageTurner

Michigan staff completed enhancements to BookReader and underlying infrastructure to improve the speed that images from the repository are rendered on the Web. The image-serving application behind BookReader now estimates dimensions for images and updates them as the images are loaded in the Web browser, rather than inspecting each image prior to making the whole volume available. Further enhancements included better positioning of images in the thumbnail and scrolling views, and improved relative sizing of images when pages within a volume vary dramatically in size. 

Throttling

Staff at Michigan made progress on the development of new throttling mechanisms for the PageTurner and other applications, which will enter an initial internal testing phase in early November.

Outages

No outages were reported in October 2011.

HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org.

Presentations


All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth


As of October 1:

  October Total
Columbia University 7 64,049
Cornell University 110 368,256
Duke University 4,486 4,486
Harvard University 5 52,843
Indiana University 23 186,195
Library of Congress 2,224 73,642
North Carolina State University 0 3,194
University of North Carolina - Chapel Hill 8,087 8,087
Northwestern University 6 5,355
New York Public Library 7 259,165
Penn State University 8 40,815
Princeton University 2 248,916
Purdue University 1 1
University of California 3,646 3,144,989
The University of Chicago 11 8,053
University of Illinois 2 14,503
Universidad Complutense 6 108,344
University of Michigan 195 4,446,510
University of Minnesota 163 88,595
University of Wisconsin 893 505,242
University of Virginia 3 47,330
Utah State 0 46
Yale University 0 23,674
Total 19,885 9,702,290

Public Domain (~27%)

Total* 13,325 2,656,160

November Forecast


  • Complete HathiTrust user personas
  • Release results from first 1,000-volume quality review sample
  • Release full-text advanced search feature
  • Begin testing new mechanisms for throttling

You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust

Update on September 2011 Activities

October 14, 2011 Syndicate content

[Download PDF]

Late Breaking News


Constitutional Convention

On October 8-10, 2011, 130 representatives from 64 HathiTrust partner institutions, including library directors, chief information officers, and senior library administrators, gathered in Washington D.C. for an unprecedented “Constitutional Convention” to reflect on the accomplishments of HathiTrust since its launch in 2008, and determine directions and priorities for the partnership in its next phase. The business portion of the meeting consisted of deliberations and voting on 7 ballot initiatives presented by partner delegations prior to the convention. The final proposals and outcomes are available at http://www.hathitrust.org/constitutional_convention2011. A large portion of the Convention was also spent in general discussions on a variety of topics including the new pricing model for partner institutions, lawful uses of library-owned materials, and international cooperation. A more complete report on the Convention, its outcomes, and what they mean for the partnership, is forthcoming. The following presentations from the Convention are available on the HathiTrust website:

  • Opening remarks (view text or presentation): John Wilkin, Executive Director, HathiTrust
  • Report on HathiTrust 3-year review and Q&A (view presentation): Ed Van Gemert and Trisha Cruse, HathiTrust Strategic Advisory Board

University of Miami Joins HathiTrust

The University of Miami announced membership in HathiTrust in early October. We are very pleased to welcome Miami to the partnership.

HathiTrust Mobile

Following a soft release in August, HathiTrust is pleased to formally announce its new mobile interface (visit http://m.hathitrust.org). The interface offers mobile-friendly access to key functionality including searching the HathiTrust catalog and reading HathiTrust “Full view” texts. Users from HathiTrust partner institutions can download texts in PDF or ePub format. Since the mobile interface is web-based, it works on all platforms, and may be viewed either from mobile devices or from desktops and laptops. The interface has special functionality for tablets where there are two ways to read texts: either in the vertical scrolling format, or in a horizontal flip format. Please give the new mobile interface a try and don’t hesitate to send your comments and feedback!

Top News


Author's Guild Lawsuit

On September 12, the Author's Guild, the Australian Society of Authors, the Union Des Écrivaines et des Écrivains Québécois (UNEQ), and eight individual authors filed a lawsuit against HathiTrust, the University of Michigan, the University of California, the University of Wisconsin, Indiana University, and Cornell University for copyright infringement. The suit was updated on October 8. We believe this is a misguided and unnecessary lawsuit. A full statement by HathiTrust is available online, and links to statements by the University of Michigan and analysis from a variety of sources are available at http://www.hathitrust.org/authors_guild_lawsuit_information.

Requirements for New Partners

Beginning January 1, 2012, partners joining HathiTrust will need to provide information about their library holdings at the time of joining. The holdings data will be used for partner fee calculations and to offer access on a limited basis to in-copyright materials (see the Holdings Database update in the July newsletter for details). Partners must be configured with Shibboleth for their users to authenticate for partner services in HathiTrust. 

Ingest


Local Digitization Ingest

University of Michigan staff continued work with several partner institutions on ingest of locally-digitized materials, including Northwestern University, Universidad Complutense de Madrid, the University of Florida, the University of Iowa, the University of North Carolina-Chapel Hill, the University of Pittsburgh, and the University of Utah.

Working Groups


User Experience Advisory Group

The UX Advisory Group compiled and discussed a list of possible interface features and improvements that have been requested by users and staff at partner institutions. Three improvements were identified as high priority and will be ongoing topics of discussion until solutions are reached which can be passed to the University of Michigan development team. The improvements are:

  • Redesigning the page turner “landing page” for Limited (search-only) items to better communicate available options
  • Revising PDF download link labels in page turner to better communicate when a full PDF is available without login
  • Adding explicit page numbers or page status to page turner interface

User Support Working Group

The following is a summary of the issues received by the User Support Working Group in September.

Issue Type August Issues September Issues
Content 110 171

Quality

96 154

Non-partner Digital Deposit

3 2

Collections

8 4
Cataloging 26 25
Access and Use 111 127

Copyright

58 73

Permissions

23 12

Takedown

2 3

Print on Demand

6 17

Inter-library loan

0 5

Full-PDF or e-copy requests

14 24

Datasets

1 1

Data Availability and APIs

1 7

Reuse of content

7 5
Web applications 27 22

Functionality problems

5 5

Problems with login specifically

1 0

General Questions about login

3 2

Partners setting up login

4 5

Usability issues

11 6

Feature requests

7 2
Partner Ingest 2 0
General 59 65

Partnership

13 12

Infrastructure

1 0

Miscellaneous

45 53

*See User Support Working Group Issue Types for a description of the types of issues included in each category.

Projects


Bibliographic Data Management

The California Digital Library development team continued to work on improvements to Zephir, the core metadata management system, and adaptations of system components to HathiTrust ingest and management workflows. As part of these improvements, project staff developed a program that doubles the speed of ingest for normalized bibliographic records. The team also worked with University of Michigan staff to identify modifications that have been made to records in HathiTrust over time, part of a broader strategy for managing updates to records in the new system. 

HTPub

A project manager from the University of Michigan joined the team working on HTPub, a two-year project to develop a system that will enable MPublishing at the University of Michigan Library to use HathiTrust as a publishing platform for its journals. The team has refined the project goal and requirements and is formulating design principles, a use case specification, and the system architecture. A full-time software developer has joined MPublishing, focusing on the content ingest and publication management components of this system. 

HathiTrust Research Center

The Communications Working Group began working with staff at the University of Indiana to create a presence for the HathiTrust Research Center on HathiTrust. org. The new portion of the website is expected to be released in the next several weeks.

IMLS Quality Grant

In September, staff at the University of Michigan and University of Minnesota completed quality review of a sample of 1,000 public domain volumes selected at random from HathiTrust (the sampling strategy is described in the July newsletter). Data for more than 110,000 pages in all were collected. Two reviewers coded 10% of the sampled volumes as a check on inter-coder reliability. The project statistician is analyzing the data and initial findings will be available in October. 

In addition to review of the digital volumes, the project team launched a process to perform physical review on all volumes in the sample. The project programmer created a data collection interface for this review and a volunteer staff of students as well as project staff began to retrieve and evaluate the physical volumes according to a list of specific criteria. The volunteer staff reviewed approximately 10% of the physical volumes by the end of September. 

The project team also prepared for and began review of a second sample of 1,000 digital volumes. The second sample focuses on volumes published after 1922 and employs a different within-book sampling methodology. Whereas in the first run 100 pages at most were sampled from each volume, this run will review a number of pages in each volume proportional to the size of the volume. The second round of data collection is expected to be complete in mid-November. Background information on the project can be found at http://www.hathitrust.org/grants

Development Updates


Collection Builder

Staff at the University of Michigan implemented a new process for updating rights information for items saved to personal and private collections. 

Full-text Search

University of Michigan staff made modest modifications to full-text search indexing as part of a revised re-indexing strategy. Re-indexing of the full-text and bibliographic metadata for the entire corpus of 9+ million books began in late September and will be completed in early October. The re-index updates the full-text index to Unicode 6, and includes metadata changes that will improve title displays and provide the metadata needed to support access mechanisms that depend on holdings information (e.g., print disabled users). 

Michigan staff developed a prototype for advanced full-text search and performed a preliminary user interaction/usability walkthrough. Michigan developers provided query logs, N-gram data, and term frequency information to staff at the California Digital Library for use in developing and testing a spelling suggestion feature. 

PageTurner

University of Michigan staff worked on improvements to the algorithm used to estimate and update page image sizes for display with BookReader, resulting in a faster time for image display. Staff also included the “missing page” placeholder that appears in traditional views of volumes when pages are known to be missing to the thumbnail view. Pages may be missing from volumes for a variety of reasons, including the pages not being present in the physical volumes that were scanned, and errors in post-scan processing. 

Developers at Michigan made progress on new throttling mechanisms that will be implemented at the web application level. Once completed, these mechanisms will make it possible to adjust throttling thresholds depending on the type of content delivered and ultimately reduce the likelihood of users being throttled during normal use. 

Michigan staff put additional access controls into place in PageTurner, in anticipation of offering access to orphan works. The controls include limiting access to: 

  • One simultaneous user per print copy held by the user’s institution 
  • One page at a time download 
  • Only authenticated users on US soil 

Interface changes were also made to improve display of the copyright status of each work. 

Outages

No outages were reported in September 2011.

HathiTrust sends notice upon discovery and resolution of unscheduled outages and in advance of scheduled outages and maintenance work that may result in an outage. We welcome and encourage additional recipients for these notices. If your institution is not receiving outage notifications and would like to, please contact feedback@issues.hathitrust.org.

Presentations

 


 

All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth

As of September 1:

  September Total
Columbia University 0 64,042
Cornell University 10,815 368,146
Harvard University 24 52,838
Indiana University 33 186,172
Library of Congress 0 71,418
North Carolina State University 240 3,194
Northwestern University 165 5,349
New York Public Library 115 259,158
Penn State University 1,438 40,807
Princeton University 3,132 248,914
University of California 102,280 3,141,343
The University of Chicago 6 8,042
University of Illinois 0 14,501
Universidad Complutense 183 108,338
University of Michigan 14,018 4,446,315
University of Minnesota 181 88,432
University of Wisconsin 6,810 504,349
University of Virginia 19 47,327
Utah State 0 46
Yale University 5,289 23,674
Total 144,748 9,682,405

Public Domain (~27%)

Total* 49,797 2,642,832

October Forecast


  • Release advanced full-text search
  • Re-index entire corpus to support advanced search and to improve relevance ranking
  • Continue work on the spelling suggestion feature

You can follow HathiTrust on Twitter http://www.twitter.com/hathitrust

Update on July 2011 Activities

August 12, 2011 Syndicate content

[Download PDF]

Top News


New partners

Two new partners announced membership in HathiTrust in July: the University of Notre Dame and the University of Florida. Florida announced additionally that it will be offering students, faculty, and other users of UF libraries access to orphan works in HathiTrust that UF also holds in its print collections. We are very pleased to welcome these new institutions and look forward to the ways they will enrich our partnership. News releases can be found at the following links: University of Notre Dame; University of Florida.

3-Year Review

The 3-year review conducted by Ithaka S+R with oversight from the Strategic Advisory Board (SAB) was completed in July and is available at http://www.hathitrust.org/ constitutional_convention2011, with introduction from Deputy Director of Libraries at the University of Wisconsin-Madison and SAB chair Ed Van Gemert. The review had been planned from the time of the HathiTrust’s launch in 2008 to provide a meaningful assessment of the partnership’s accomplishments and outlook leading up to the HathiTrust Constitutional Convention, also planned to occur in the 3rd year. Institutions and consortia that were members of HathiTrust as of October 2010 will participate in the Convention this coming October to review HathiTrust sustainability and governance, and set new directions for the partnership. Details on the Convention are available at the link above. Questions or comments regarding the 3-year review should be directed to Ed Van Gemert at evangemert@library.wisc.edu.

Orphan Works Candidates

HathiTrust posted the first set of orphan works candidates to a public catalog in July. These are works for which, following an extensive review process, rights holders could not be found or contacted. As reported in last month’s update, works in the public catalog that are not claimed by the rights holder after a period of 90 days will be considered orphan works. The first 90-period will expire in October (the expiration date for each work is posted in the catalog). At that time, partner institutions that wish may begin to offer their users access to orphan works in HathiTrust.  More information about the orphan works project can be found at http://www.lib.umich.edu/orphan-works/. Further information on how access will work is included in the Holdings Database newsletter item below.

Collection List and Search Enhancements

Staff at the University of Michigan released several enhancements to the HathiTrust Collections list and full-text search application in July. The enhancements to the Collections interface include improved display of collections, the ability to search collections by title and description, and the ability to filter collections by their featured status, last time of update, number of items, and whether or not they belong to the current authenticated user. New full-text search features leverage the addition of bibliographic metadata to the full-text search index to offer faceting (refinement) of search results, and improved search results relevance ranking. These features were the top two prioritized by the HathiTrust Full-text Working Group for implementation. Staff at Michigan and the California Digital Library will continue to work on features in the prioritized list in August. The third feature, improvements to “within book search”, will be released in the next couple of weeks. Please give these new features a try and send feedback to feedback@issues.hathitrust.org.

Holdings Database: Update and Lawful Uses of In-Copyright Materials

Early in 2011, HathiTrust began development on a database of holdings information from partner institutions designed a) to support the new cost model that will be implemented for all partners in 2013, b) to form a foundation for the expansion of lawful uses of in-copyright materials to partner institutions (such as access to persons who have print disabilities and access to orphan works), and c) to facilitate collective collection development and management activities among the partnership.

The first iteration of this database, containing data for single part monographs at partner institutions, was put into production in July. Staff at the University of Michigan are in the process of incorporating information from the database into existing applications such as the catalog and PageTurner to begin offering partners access to orphan works in HathiTrust, as well as access to in-copyright volumes for users who have print disabilities. The systems needed to provide access in these scenarios are expected to be in place in late-summer/early-fall.

Access to orphan works

Beginning in October, authenticated users from HathiTrust institutions that have selected to grant their users access to orphan works will see orphan works appear as “Full view” in HathiTrust access systems. Access will only be available to orphan works in HathiTrust that are or had previously been held in the partner institution’s library system.

Access for users who have print disabilities

Beginning in late-summer or early-fall, users at partner institutions who are certified as having a print disability will be eligible to view the full text of all in copyright volumes in HathiTrust that are or had previously been held in the partner institution’s library system. In order to gain access institutions will need:

  • To be configured for authentication to HathiTrust via Shibboleth, (see http://www.hathitrust.org/shibboleth)
  • To have provided HathiTrust with information about their print holdings
  • To have a local process by which eligible users have been certified as having a print disability
  • To convey certification status through a new Shibboleth eduPersonEntitlement attribute

Specifics on the syntax of the attribute and any additional information will be disseminated to partners in the coming weeks.

Call for new member of User Support Working Group

Nominations have been extended for a new member of the HathiTrust User Support working group. Please send nominations to jjyork@umich.edu by August 19, 2011.

Ingest


Local Digitization Ingest

Staff at Michigan met with staff from Northwestern University to address questions related to ingest of a set of several hundred locally-digitized volumes. Staff at Universidad Complutense de Madrid began to transfer a second set of locally-digitized manuscripts and incunabula to the University of Michigan for ingest. The first set of locally-digitized materials from Madrid will be ingested in August.

Working Groups


Collections

The Collections Committee is putting the finishing touches on two major work items with which it has been occupied for the last several months: a ballot initiative for a Distributed Print Monographs Archive to be put forward at the Constitutional Convention, and a draft recommendation on the treatment of duplicates in HathiTrust. A draft of the print archives proposal was reviewed with a subgroup of the HathiTrust Executive Committee, which sponsored the initiative, in July; the final version will be forwarded shortly to the full Executive Committee for its approval. The draft duplicates paper will be shared with the Strategic Advisory Board in August for feedback and direction about next steps. A big thank you from the chair (Ivy Anderson) to her colleagues on the committee for terrific work in pulling these proposals together (the charge and membership of the group are available at http://www.hathitrust.org/wg_collections_charge). Once these items are finalized, the committee will turn its attention to other pending items on its work agenda, including a process for responding to individual requests and offers to include additional materials in HathiTrust. 

Communications

In July, the Communications Working Group focused on a number of topics including new partner announcements, a strategy to support public services staff in communicating about HathiTrust, soliciting authors and topics for the HathiTrust blog, and looking ahead to communication needs for the Constitutional Convention. The Communications group invites suggestions from partner institutions and others for topics to be covered in the HathiTrust blog. These should be directed to heather.christenson@ucop.edu.

Usability

The Usability Working Group discussed and provided feedback on the Collections list and full-text search features that were released in July. The group continued to review and track feedback received via the User Support Group on issues related to usability. The HathiTrust User Experience Special Interest Group (HT UX-SIG) has been active in discussions about feature requests and usability improvements to HathiTrust. The HT UX-SIG email group is open to anyone who is interested. Please contact Felicia Poe (Felicia.Poe@ucop.edu) to join.

User Support Working Group

The following is a summary of the issues received by the User Support Working Group in July.

Issue Type Count
Content 90

Quality

89

Non-partner Digital Deposit

1

Collections

2
Cataloging 20
Access and Use 81

Copyright

52

Permissions

2

Takedown

0

Print on Demand

36

Inter-library loan

9

Full-PDF or e-copy requests

13

Datasets

0

Data Availability and APIs

1

Reuse of content

2
Web applications 23

Functionality problems

7

Problems with login specifically

6

General Questions about login

4

Partners setting up login

6

Usability issues

3

Feature requests

8
Partner Ingest 2
General 23

Partnership

6

Infrastructure

0

Miscellaneous

17

Projects


IMLS Quality Grant

In July, grant project staff at the University of Michigan and University of Minnesota started to review the first of several production-level samples of volumes in HathiTrust, conducted according to the error type and severity model developed by the grant project team. The first sample includes 1,000 randomly selected volumes published before 1923 and digitized by Google. Staff will review a set of 100 pages, chosen at evenly-distributed intervals, within each of the 1,000 volumes. A subset of volumes will be reviewed by multiple staff members as a check on inter-coder reliability. The corresponding print versions of all volumes in the sample will undergo a physical assessment to identify potentially meaningful characteristics that affect quality, such as tight bindings, condition, and other physical features. A subset of the digital volumes will also be subjected to full-volume review to measure errors such as missing pages. The goals of the first production run are 1) to test the quality review system developed by the project team on a large scale; 2) to assemble a body of statistical data of sufficient size to begin to test the feasibility of sampling as a strategy to accurately describe error within a group of volumes; 3) to begin to explore the correlation of physical characteristics of books with observed errors in the digital scans. Review of the 1,000-volume sample is expected to be completed in mid-September.

HTPub

The University of Michigan has been examining schema options for representing encoded text journal content in the HathiTrust archival package. An investigation of publisher XML formats has yielded a recommendation to use the Journal Archiving and Interchange Tag Set of JATS (an application of NISO Z39.96) as the XML format for encoded text. UM staff are currently researching Portico’s use of a custom profile of an earlier version of this standard in content normalization.

HathiTrust Research Center

The HathiTrust Research Center has received a $600,000 award from the Alfred P. Sloan Foundation for the first investigation of non-consumptive research for a major large-scale digitized collection of content. The press release for the award is available at http://newsinfo.iu.edu/news/page/normal/19252.html.

The HathiTrust Research Center technical group is working on an end-to-end demonstration test of underlying infrastructure functionality. The test, which is planned to be completed in early September, is being conducted using a subset of the HathiTrust full-text Solr index and Indiana University public domain volumes deposited in HathiTrust. OCR text of the volumes was distributed to the Research Center from HathiTrust and is stored in a noSQL data store to be readily available for research purposes. The test scenario runs as follows: a user logs into the Research Center via an InCommon identity and simple algorithms are executed on the user’s behalf to pull word counts out of the index and do simple pattern-matching. The algorithms and services, which are available to all users, are registered in a web services registry where they can be queried by users. Results in this simple scenario are returned to the user as a URL. The test will allow the HTRC technical group to work out issues related to the HTRC’s core architecture, interfaces, and integrated security model.

Staff at the University of Michigan worked in July to prepare a dataset containing the OCR of approximately 240,000 publicly available non-Google digitized volumes in HathiTrust for distribution to the HathiTrust Research Center. The dataset will be delivered in August and also be available for public download. The HTRC is awaiting resolution on a data agreement that will allow it to host and use OCR text of the full HathiTrust public domain corpus. Pending that agreement, this dataset will allow the HTRC to conduct testing of its infrastructure on a larger scale.

Development Updates


Bibliographic Data Management

The California Digital Library (CDL) development team began the integration phase of the project in July, which focuses on adapting the new management system to the HathiTrust workflow. The team ingested bibliographic records into a virtual staging environment where integration testing with HathiTrust systems will occur. CDL has filled the second Metadata Analyst position for the project, advertised in previous updates. The new staff member will begin work in mid-September.

Data API

Staff at the University of Michigan continued development on security enhancements to the HathiTrust Data API. The enhancements are described at http://bit.ly/jozHQK. Interested parties are invited to submit comments and feedback to feedback@issues.hathitrust.org.

Mobile

Last February, University of Michigan staff began development on mobile interfaces to the HathiTrust catalog and PageTurner. Development of an initial version of these interfaces is nearly complete and staff hope to release beta versions for testing in September.

New Database and Ingest Servers

Michigan staff installed new database and ingest servers as part of the first periodic server replacement cycle, which keeps server infrastructure current on a 3-to-4-year cycle. The new database servers are a little ahead of schedule, but configured to support the higher transactional rates expected with the introduction of the print holdings database. The new ingest servers are expected to provide significantly increased throughput rates for ingesting volumes into HathiTrust.

PageTurner

Staff at Michigan experimented with ways to improve the speed that page images are loaded in the new views for scrolling and flipping through books that were implemented in April. Staff will continue to test the strategy, which involves estimating pixel dimensions of all pages in a volume based on a small sample and making adjustments as actual pages are retrieved, throughout August.

Michigan staff continued work on a more sophisticated throttling system to improve the experience of using HathiTrust while ensuring compliance with third-party agreements on content and offering equal access for all users to HathiTrust applications. The new system will provide throttling controls at finer levels so that, for example, delivering thumbnail page images to a user in PageTurner does not count as heavily against a user’s access quota and limit their ability to view full-size pages. 

Outages

There were no outages in July.

Presentations


All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth


Number of volumes added:

  June Total
Columbia University 95 64,001
Cornell University 17,542 345,094
Harvard University 17 52,727
Indiana University 12 184,887
Library of Congress 0 71,418
New York Public Library 136 258,828
Penn State University 29 39,174
Princeton University 2,489 241,595
University of California 489,975 2,983,660
The University of Chicago 164 6,467
University of Illinois 0 14,501
University of Madrid 7 107,954
University of Michigan 36,362 4,404,849
University of Minnesota 376 87,645
University of Wisconsin 15,020 488,031
University of Virginia 1 47,304
Yale University Library 0 18,385
Total 562,225 9,416,549

Public Domain (~27%)

Total* 94,470 2,508,391

* Includes volumes opened through copyright review or rights holder permissions.

August Forecast


  • Release updated search-within-a-book feature
  • Finalize proposal for a collaborative print archiving strategy.

Update on June 2011 Activities

July 8, 2011 Syndicate content

[Download PDF]

Top News


Access to Orphan Works

Following the announcement in May of a new initiative to identify orphan works in HathiTrust, the University of Michigan announced last month that it would be making orphan works identified in HathiTrust that are also held in its library collections available to Michigan students, faculty, staff, and other visitors to the UM libraries. Works that are identified as orphan candidates through an extensive review process will be posted in a public catalog at UM for 90 days. Works left unclaimed by rights holders after this time will be considered orphans. Michigan expects to begin offering access to orphan works beginning in the fall. Joining Michigan at the initial release will be the University of Wisconsin-Madison; other partner institutions may begin to make these uses in the coming months.

3-Year Review

Update on the Briefing Paper on Progress and Opportunities for HathiTrust Prepared by Ithaka S+R for the HathiTrust Strategic Advisory Board (SAB)

By Ed Van Gemert, Chair, SAB

The HathiTrust Strategic Advisory Board received the draft three-year review prepared by Ithaka S+R on 17 June 2011. The SAB along with Ithaka staff is currently working to revise that draft. Following the revision period, the final report from Ithaka is to be delivered to the SAB on 15 July 2011.

The SAB initially charged Ithaka S+R staff to challenge our collective thinking and the review has certainly done that. The final report, and portions thereof, will be broadly distributed prior to the October 2011 Constitutional Convention. Key areas of focus of the draft report suggests additional attention and work to include:

  • Clearly defining objectives for the next 3-5 years, possibly mapping out a rationale for those objectives in the context of a revised mission statement.
  • Enhancing information about HathiTrust’s strategic priorities to partner libraries.
  • Discussing the advantages and disadvantages of a membership-driven governance structure.
  • Demonstrating to partner libraries the sustainability and feasibility of the new cost model for HathiTrust.
  • Making decisions based on the most pressing goals and objectives for HathiTrust about how large the membership for the initiative needs to grow.

The SAB expects thorough discussions at the upcoming Constitutional Convention around these and other important questions regarding the future shape of HathiTrust and the role that current and future partner libraries will play in governing and sustaining HathiTrust.

HathiTrust Research Center (HTRC)

The HathiTrust Research Center hosted a reception at the Digital Humanities 2011 Conference held in Palo Alto, California June 20, 2011. The reception was sponsored by Indiana University and the University of Illinois, the institutions developing the HTRC, and by Google. Opening remarks were given by HTRC directors Beth Plale and John Unsworth, and Google Engineering Director John Orwant. The reception was well attended and well received. The HTRC stressed its receptivity to working with researchers broadly within the scope of available resources to provide computational access to the growing body of HathiTrust materials.  

The day before the reception HTRC directors traveled to Oakland, CA to meet with Laine Farley, the HathiTrust Executive Committee liaison to the HTRC, and Heather Christenson, chair of the HathiTrust Communications Working Group. The group was later joined by David Greenbaum of Project Bamboo. Discussions focused on interactions between the HTRC and HathiTrust and ways in which HTRC will collaborate with other projects such as Project Bamboo.

The HTRC is pleased to announce receipt of a $606,000 three-year award from the Alfred P. Sloan Foundation to explore architectural issues around large-scale non-consumptive research. Beth Plale is the PI of the project, with co-PIs Atul Prakash of the University of Michigan and Robert McDonald of Indiana University.  

The HTRC wrote letters of support for three proposals to the second round of the Digging into Data Challenge.

Call for New Member of the User Support Working Group

The Executive Committee is seeking nominations from all partner institutions for a new member of the User Support Working Group. One of the current 8 members will be stepping off the group at the end of July. User Support members are on call to answer inquiries at least one day per week and spend on average of 2-3 hours per week investigating issues and responding to users. Nominations should be sent to Jeremy York (jjyork@umich.edu) before August 1, 2011.

Ingest


Local Digitization Ingest

HathiTrust began ingest of the first large set of locally-digitized volumes from Yale University in June. More than 18,000 had been ingested as of July 1.

Working Groups


Collections

A draft ballot initiative for a print management proposal intended to be voted on at the Constitutional Convention will be shared with the HathiTrust Executive Committee’s print management subgroup in July. The Committee also expects to submit its draft discussion paper on duplicate volumes in HathiTrust to the Strategic Advisory Board in July for initial feedback. Recommendations for a process for responding to user-initiated requests has been put on temporary hold while the first two deliverables are finalized.

Communications

The Communications Working Group launched a new HathiTrust blog in June, “Perspectives from HathiTrust”, with its inaugural post by HathiTrust Executive Director John Wilkin. The blog will feature authors from among the partner institutions writing on a variety of topics. The group also released a mid-year update on HathiTrust activities in conjunction with the ALA annual conference.

Discovery Interface

After careful consideration and consultation with the HathiTrust Strategic Advisory Board, the Discovery Interface Working Group (DIWG) has officially disbanded. The DIWG, initially convened in spring 2009, fulfilled its charge to accomplish the implementation of the HathiTrust WorldCat Local Prototype interface. One important aspect of this project was working with OCLC to get all of the HathiTrust records loaded into WorldCat. Along the way, the DIWG also supervised the first phase of the HathiTrust Full-Text Search Subgroup and delivered a set of requirements to OCLC for the next phase of HathiTrust WorldCat Local catalog development in FY 2012.

At this point, the focus will shift from the group’s original charge to the ongoing maintenance and development of the HathiTrust WorldCat Local catalog. Julia Lovett of the University of Michigan will be the project manager for this effort, and will draw on the expertise of HathiTrust partner colleagues as needed. The DIWG executive team—John Butler, Lee Konrad, and Julia Lovett—would like to thank all the DIWG members for their contributions: Adam Brin, Patricia Martin, Christopher Walker, Lisa German, Kevin Clair, Suzanne Chapman, and Jon Rothman. Special thanks to John Wilkin and to the HathiTrust SAB for providing valuable guidance and input, and to Bill Carney and the OCLC WorldCat Local team for their very hard work on this project.

Usability

Work on the development of HathiTrust personas reported in April’s update continued in June. The group has also begun reviewing feedback received via the User Support Group to help discover and track usability issues.

User Support Working Group

The User Support Working Group and staff at the University of Michigan fielded more than 750 user inquiries from April through June 2011. The break-down of issues received during that time is shown in the table below. We will continue to report these statistics on a monthly basis.

Issue Type Count
Content 347

Quality

302

Non-partner Digital Deposit

2

Collections

21
Cataloging 54
Access and Use 246

Copyright

139

Permissions

14

Takedown

2

Print on Demand

16

Inter-library loan

3

Full-PDF or e-copy requests

59

Datasets

19

Data Availability and APIs

10

Reuse of content

11
Web applications 86

Functionality problems

30

Problems with login specifically

9

General Questions about login

8

Partners setting up login

3

Usability issues

13

Feature requests

19
Partner Ingest 12
General 68

Partnership

30

Infrastructure

5

Miscellaneous

33

See User Support Working Group Issue Types for a description of the types of issues included in each category.

Projects


IMLS Quality Grant

The grant project team’s work in June focused on preparations for production level data collection to begin in early July.  These preparations included continuing work to examine and improve inter-coder consistency, incorporating new data from the University of Minnesota review team, and undertaking several small sampling exercises to guide development of a model for systematic random sampling of HathiTrust volumes, and pages within volumes, for quality review. The project team, under the guidance of the Principal Investigator and team statistician, completed a draft of this model in June. The first large sample for production level analysis will be drawn in early July. Additional information on the project can be found at http://www.hathitrust.org/grants.

HTPub

The University of Michigan hired the first of two programmers to work on the HTPub project. Interviews will take place in July for the second opening. Meanwhile, Michigan continued to examine schema options for representing journal content in the HathiTrust archival package, and questions surrounding interoperability of the envisioned HTPub software components with the HathiTrust repository. Details on the project can be found at http://www.hathitrust.org/htpub.

Development Updates


Bibliographic Data Management

The California Digital Library team completed development of the major functionality for the core metadata management system, and on June 14, 2011, demonstrated the core system to staff at the University of Michigan. For initial testing, the system was loaded with approximately 200,000 metadata records from HathiTrust partner institutions. When it is implemented in 2012, the system will manage initially close to eight million.

The next major development effort is to adapt the new system to the HathiTrust workflow. This includes integrating the system with the HathiTrust rights management database and developing batch export functionality for metadata records. CDL is working with University of Michigan staff to understand the particulars of the HathiTrust workflow.

CDL continues to interview for the open Metadata Analyst position: http://www.cdlib.org/services/d2d/d2d_mda2.html.

Further information on the project is available at http://www.hathitrust.org/htmms.

Collection Builder

University of Michigan staff began to code enhancements to the Collection Builder interface in June. The enhancements will allow users to explore the list of collections more easily using new filtering and searching options. Deployment of the new interface is expected in July.

Data API

Michigan staff began development of security enhancements to the HathiTrust Data API in June. The enhancements are described at http://bit.ly/jozHQK. We invite interested parties to submit any comments or feedback to feedback@issues.hathitrust.org.

Development Environment

Michigan staff deployed a timestamp-based sentinel file in the development environment to make it easier for the Plack Perl module, which was implemented to support the new PageTurner functionality, to stay up-to-date when changes to Plack-based applications are deployed to production.

Full-text Search

Staff at Michigan completed development to replace the XPat search engine with Solr as the mechanism for searching inside individual volumes from Pageturner (details on the change were reported in the Update on May 2011 Activities). Use of the Solr back-end will eliminate differences between the ways that Solr and XPat work currently, which can interfere with searching activities, and improve relevance ranking of page-level results. Michigan staff have begun to test the current Solr configuration and search performance to optimize indexing and query response  times. The code supporting the new functionality will undergo final testing for production deployment after the release of the new faceting and relevance-ranking features for full-text search, which is projected for mid-July. The coding for these features, the top two identified by the HathiTrust Full-text Working Group, was completed in June, and usability and internal tests are underway in preparation for the mid-July release.

PageTurner

HathiTrust has throttling protections in place to prevent systematic download of materials in the repository for which, due to third-party agreements, this type of activity is not allowed (see the Message from John Wilkin in the Update on September 2010 Activities). Staff at Michigan have started a process to add more sophisticated capabilities to HathiTrust applications (for instance, optimization of thumbnail presentation) that will ensure compliance with such agreements while offering fewer interruptions to use.

New Storage

Michigan staff upgraded software on both Michigan and Indiana storage instances and added 100TB of new capacity with no service interruption.

Outages

There were no outages in June.

Presentations


All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth


Number of volumes added:

  June Total
Columbia University 0 63,906
Cornell University 16,321 327,552
Harvard University 0 52,710
Indiana University 156 184,875
Library of Congress 0 71,418
New York Public Library 1 258,692
Penn State University 23 39,174
Princeton University 21 239,106
University of California 37,321 2,493,685
The University of Chicago 132 6,303
University of Illinois 0 14,501
University of Madrid 1,203 107,947
University of Michigan 12,603 4,368,487
University of Minnesota 625 87,269
University of Wisconsin 7,598 473,011
University of Virginia 0 47,303
Yale University Library 18,114 18,385
Total 94,118 8,854,324

Public Domain (~27%)

Total* 35,670 2,413,921

* Includes volumes opened through copyright review or rights holder permissions.

July Forecast


  • Release new faceting and relevance ranking features for full-text search
  • SAB to receive 3-year review report from Ithaka S+R

Update on May 2011 Activities

June 10, 2011 Syndicate content

[Download PDF]

Top News


Orphan Works Research

With funding from HathiTrust, the University of Michigan Library Copyright Office has begun work to identify orphan works – works that are known to be in copyright, but whose rights holders cannot be identified or located – in HathiTrust’s growing repository. The goal of the project is to provide concrete data on the number of orphan works in HathiTrust, which could be used in the creation of legal or policy-based frameworks to allow broader access to orphan works for scholarly and research purposes. The official press release is available on the University of Michigan Library website.

New Members on the SAB

The Strategic Advisory Board welcomed two new members from the University of California in May: Todd Grappone, Associate University Librarian for Digital Initiatives and Information Technology at UCLA, and Julia Kochi, Director, Digital Libraries and Collections at UC San Francisco. Todd and Julia take the place of Bernie Hurley, UC Berkeley and Bruce Miller, UC Merced, who are stepping down from their duties on the SAB. The HathiTrust Executive Committee would like to thank Bernie and Bruce for their important contributions to the partnership on this committee.

Ingest


Local Digitization Ingest

Michigan staff continued to work out details of ingest with several institutions, including loading bibliographic metadata for additional volumes from Yale, and completing an initial pre-ingest transformation process for locally-digitized content from Universidad Complutense de Madrid.

University of Virginia

HathiTrust completed ingest of more than 47,000 volumes contributed by the University of Virginia in May.

Working Groups


Collections

The Collections Committee continues to work on its current key deliverables, including recommendations regarding duplicate volumes in HathiTrust, coordinated print management, and responding to user requests to contribute volumes to the repository. The group plans to share a draft discussion paper on duplicates with the Strategic Advisory Board in June or July for initial feedback, and will also review its work on print management with the HathiTrust Executive Committee’s print management subgroup during that same timeframe.

Communications

In May, the Communications Working Group continued planning for a HathiTrust Facebook presence, and made progress on the development of new HathiTrust promotional materials. The group also began discussions with the Usability group on working collaboratively to assemble user stories, and other potential synergies between the two groups.

Usability

Work on the development of HathiTrust personas reported in last months’ update continued in May. The group continued to solicit and collect real-life user stories, including in particular those based on librarian and patron interactions. The group’s liaison to the Communications group attended the May Communications call to give an update on the progress of the persona project and to discuss collaboration on, and use of, the collection of real-life user stories.

For the last few months the group has been soliciting members for a User Experience Special Interest Group (HT UX-SIG) and has now received around thirty volunteers. This new SIG will be activated in June.

User Support Working Group

The User Support Working Group assumed responsibility for the wide range of inquiries and feedback received through HathiTrust interfaces and help email addresses in May. The 8-member group has established an on-call rotation throughout the week, including weekends, to address issues in a timely and efficient manner. HathiTrust’s response to user feedback received several positive comments via Twitter during the last month. The working group is committed to maintaining a high level of service to address user comments, feedback, and suggestions.

Projects


IMLS Quality Grant

The grant project team’s work in May focused on the preparation of materials to orient and bring several newly-hired reviewers at the University of Minnesota on board to data collection and review. This work included updating and streamlining the quality review Web application to allow for efficient remote operation. Training of the staff at Minnesota commenced at the end of May and the new reviewers are set to begin work in June. As data from Minnesota are collected, the project statistician will be examining inter-coder reliability among all reviewers and working to establish a final model for sampling volumes in the repository. Gathering data for analysis will be the focus of the grant team’s efforts in June. Additional information on the project can be found at http://www.hathitrust.org/grants.

HathiTrust Research Center

Staff at Michigan have begun discussing mechanisms to synchronize the text and bibliographic records of public domain materials in HathiTrust to the HathiTrust Research Center. Michigan Staff developed an initial model for the data transfer (using rsync) in May, and Indiana University staff began performing tests with sample data.

Development Updates


Bibliographic Data Management

The California Digital Library (CDL) development team is preparing a demo of the Metadata Management core system for staff at the University of Michigan on June 14, 2011. As the first major component of the new HathiTrust Metadata Management System, this is a major milestone and deliverable. The next step will be for the CDL team to address feedback raised in the demo. The team continues to interview for the open Metadata Analyst position. In the meantime, a senior metadata analyst at CDL has conducted a full metadata audit, confirming the validity of the core system design. Further information on the project is available at http://www.hathitrust.org/htmms.

Data API

University of Michigan staff completed the first draft of requirements for improved security in the Data API. The draft has been made available for comment at http://bit.ly/jozHQK. We ask that interested parties submit comments to feedback@issues.hathitrust.org. Initial coding will begin as feedback is received.

Development Environment

Staff at Michigan implemented a “diff” service as part of support for administration of the HathiTrust Development Environment (HTDE). At the time when new code is staged for testing, the code administrator can now choose to see differences between the last deployed version of the code repository and the version being staged in preparation for the next deployment. Michigan staff also implemented topic branch staging for beta testing. This facilitates testing of code changes on a staged beta testing site without pushing the code branch to the central code repository before its desired time. Parties at partner institutions that are interested in exploring the development environment should contact feedback@issues.hathitrust.org.

Full-text Search

HathiTrust full-text search uses the Lucene-based Solr search engine to index content and provide volume-level results. However, when searches are conducted within a single volume, a different search engine known as XPat is used to dynamically index and search the volume and display page-level results. Differences between the ways that Solr and XPat work sometimes cause inconsistencies in the user’s experience. To remedy this, staff at Michigan have started a process to replace XPat with Solr. The majority of this work is to be completed in June, though testing and optimizing may result in a later release date. Accomplishing the change will achieve one of the higher priority features identified by the Full-text Search Working Group: improved results display for multiword searches when searching within a book. Staff will conduct this work in parallel with other full-text search improvements currently underway, including the use of bibliographic metadata for relevance ranking and faceting of search results, and, with development contributions by CDL, a “spelling suggestion” feature. Michigan staff aim to release the relevance ranking and faceting improvements by July 1st.

HTPub

Michigan filled one of two programmer positions advertised for the new HathiTrust publishing initiative, led by the MPublishing division at the University of Michigan Library. The new hire will start on June 27th. The search continues for the second position, which is posted at http://umjobs.org/job_detail/54579/application_developer.

MPublishing recently hired an intern who will be working over the summer to explore potential archival XML schema solutions for electronic journal content.

New Auditing Process and Servers

During their last visit to HathiTrust’s Indianapolis storage facility, Michigan staff installed two new servers that will perform periodic, generalized repository auditing, including checksum validation of repository content, using newly-developed auditing tools. The auditing tools can also perform ad-hoc cross-repository analysis as they run, culling information from the repository using custom one-time scripts. For example, staff may add a custom script to the next auditing run to analyze and report on a specific detail of PREMIS metadata usage.

PageTurner

Enhancements to the new PageTurner views were released in May in response to user feedback. Staff at the University of Michigan added a full-screen viewing mode, optimizing the use of screen space for content display, and improved landscape image viewing, aligning viewing controls to browser window dimensions when scrolling through the image viewport. Staff also researched ways of improving performance for larger books.

Storage Replacement Cycle

Michigan staff have completed security wipes on all recently retired storage equipment. The equipment was returned to the vendor for a credit, completing this (the first) annual replacement cycle. The next cycle is planned for the first quarter of 2012.

Outages

There were no outages in May.

Partner News


CDL Opens HathiTrust SFX Target to Broader SFX Community

In September 2010, California Digital Library’s Discovery and Delivery group released an SFX target for HathiTrust monographs, which was made available to partnering libraries. In May, CDL made the target available to libraries broadly via EL Commons CodeShare, a forum hosted by Ex Libris. The formal announcement is available on the CDL website. Please contact Margery Tibbetts (Margery.Tibbetts@ucop.edu) with questions and inquiries.

Presentations


All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth


Number of volumes added:

  April Total
Columbia University 5,423 63,906
Cornell University 121 311,231
Harvard University 1 52,710
Indiana University 838 184,719
Library of Congress 0 71,418
New York Public Library 0 258,691
Penn State University 135 39,151
Princeton University 2,051 239,085
University of California 47,637 2,456,364
The University of Chicago 999 6,171
University of Illinois 0 14,501
University of Madrid 2,947 106,744
University of Michigan 17,516 4,355,884
University of Minnesota 1,659

86,644

University of Wisconsin 11,081 465,413
University of Virginia 47,303 47,303
Yale University Library 110 271
Total 137,821 8,760,206

Public Domain (~27%)

Total* 174,061 2,378,582

* Includes volumes opened through copyright review or rights holder

June Forecast


  • Begin development of Data API security features
  • Release first wave of new full-text search features
  • Begin to implement improvements to the Collection Builder list of collections

Report on HathiTrust 3-Year Review


Ed Van Gemert, for the Strategic Advisory Board

This update is a follow-on to the report on the HathiTrust 2011 Constitutional Convention given in the Update on January 2011 Activities.

HathiTrust contracted in March with Ithaka S+R to conduct a three-year review of HathiTrust’s progress toward meeting the needs of libraries, scholars, students and other users. The review will inform discussion and promote participation at the October 8-10, 2011 Constitutional Convention in Washington DC. The Strategic Advisory Board (SAB) is providing oversight for the review, working closely with the Ithaka staff assigned to the project.

Ithaka’s efforts include gathering and preparing research on HathiTrust’s existing structure and needs, including background meetings with stakeholders, team members at the University of Michigan, and members of both the Executive Committee and the Strategic Advisory Board.  

A survey and review of user needs has followed the initial research. A survey was sent to the 52 HathiTrust Contributing Partners and Sustaining Partners (non-content contributing). Ithaka S+R is also interviewing 20 representatives from libraries that do not currently participate in HathiTrust, along with 12 scholars in the humanities and social sciences. The survey officially closed at the end of the day on 3 June and results are being formulated.  

Preliminary indicators from this research process provide useful data and commentary including:

  • Evidence of progress in meeting functional objectives.
  • Commentary on the value of HathiTrust to the Contributing and Sustaining Partners, including perspectives on cost savings, cost avoidance, and the continued need for clarity on the new cost model which is scheduled to go into effect in 2013.
  • Projected levels of contributed library staff support for development that will help to inform prioritization of projects requiring the right balance between centralized and decentralized staffing or expertise.
  • As the partnership expands, who will govern? Useful data and commentary indicating a need for a clear method for input into executive decisions.
  • Interest in the greater environment for HathiTrust and connections with other initiatives. The community is curious to know how HathiTrust may be connected with the Digital Public Library of America (DPLA).

Follow-up interviews by Ithaka S+R staff will probe further these and other issues.

Ithaka S+R is required to submit a draft briefing memo to the SAB on June 17, 2011. Ithaka will then take comments from the SAB and the Executive Committee until July 1. Following a two-week revision period, Ithaka will submit a final report to the SAB on July 15, 2011. The SAB will then distribute Ithaka’s report to the HathiTrust membership for full discussion and comment leading up to the Constitutional Convention in October.

Questions or comments regarding the three year review can be directed to Ed Van Gemert, (evangemert@library.wisc.edu) Deputy Director of Libraries at the University of Wisconsin-Madison and Chair of the HathiTrust Strategic Advisory Board.

Update on April 2011 Activities

May 13, 2011 Syndicate content

[Download PDF]

Top News


New PageTurner

HathiTrust released new functionality for its PageTurner application in April, improving the way volumes in the repository can be viewed and used. Enhancements to the PageTurner include:

  • New views that allow users to scroll through volumes, flip pages similar to a physical book, and view thumbnail images of all pages in a volume
  • Reorganized and streamlined interface including prominent display of copyright status, and re-positioning of navigation features
  • Quick-copy links to volume pages in addition to permanent volume URLs
  • Improved user experience for full book PDF downloads

Development of the new functionality was initiated by staff at the California Digital Library (CDL) in HathiTrust’s collaborative development environment, and completed by staff at the University of Michigan. The Usability Working Group provided input and feedback on the interface design. The new views were built using Open Library’s open source BookReader. The thumbnail view was created specifically for HathiTrust by CDL staff, and has been incorporated as a standard feature in the core BookReader software.

We welcome comments and feedback on the new PageTurner. Please use the “Feedback” link that appears in the upper right corner of the page when viewing HathiTrust volumes, or email feedback@issues.hathitrust.org.

Support for Publishing

HTPub is an effort of the MPublishing Division of the University of Michigan Library to enable the use of HathiTrust as a platform for publishing open access electronic journals. It was first reported on in the Update on October 2010 Activities, and has been in planning stages over the winter. MPublishing recently hired a summer intern who will be working with Michigan staff to define requirements for archival objects produced through HTPub. Michigan is in the process of hiring two full-time positions to support the work of the initiative. More information is available on the HTPub project page.

MDL Images

John Butler of the University of Minnesota, John Weise of the University of Michigan, and project consultant Eric Celeste briefed CNI membership at the Spring 2011 Membership Meeting on the Minnesota Digital Library-HathiTrust image content prototype project. A summary of the project and slides for the presentation are available at http://www.hathitrust.org/mdl_images. Access to the images, now in the HathiTrust repository, will be enabled in late May or June. MDL has yet to draw conclusions regarding deposit of images in HathiTrust beyond the prototype phase. However, much has been learned throughout the project and HathiTrust intends to use the prototype and the experience gained and a base for developing general image ingest specifications that can be used for ingest of images from partner libraries.

Ingest Reports

HathiTrust has begun to post weekly reports on the ingest status of content submitted by partner institutions. The reports are available on the HathiTrust website, as well as a description of the information the reports include. 

Ingest


Local Digitization Ingest

Michigan staff worked with Universidad Complutense de Madrid, Yale University, and the University of Illinois in April on ingest of locally-digitized volumes.  We expect to begin ingest of volumes from Madrid in May, as well as the full set of volumes from Yale (a sample was ingested in December).

Harvard University

Ingest of an initial set of more than 50,000 volumes from Harvard University was completed in April.

Working Groups


Collections

The Collections Committee continues to work on a series of recommendations regarding duplicate volumes in HathiTrust, coordinated print management, and responding to users requests to contribute volumes to the repository.  A draft discussion paper on duplicates will be shared with the Strategic Advisory Board in June for initial feedback.

Communications

The Communications Working Group finished a round of new partner webinars on April 12 and 15th. The webinars were well-attended and generated questions and rich discussion. The webinar slides and audio recording are available on the HathiTrust website. The working group also continued to craft a Facebook presence for HathiTrust, plan for a HathiTrust blog, and develop informational materials for use by partner libraries.

Usability

The Usability Working Group made significant progress in April in developing a set of personas for HathiTrust users and scenarios of use. To help inform this draft set, the group has been gathering real life use cases from user feedback, reference interactions with users, and uses of HathiTrust that have been posted in blogs and tweets. It has also been analyzing HathiTrust usage statistics for trends. The personas and scenarios are intended to inform development and policy-making surrounding HathiTrust applications and interfaces. The group anticipates having the draft set of personas and scenarios ready to share with partner institutions and other HathiTrust working groups in May. The personas will be refined over time as additional use cases are assembled and user research conducted.

The Usability Group is still accepting volunteers to join the new User Experience Special Interest Group (UX-SIG), reported in February’s update. Please contact Suzanne Chapman (suzchap@umich.edu) if you are interested in joining this group or have any questions about participation.

User Support Working Group

During March and April, the chair of the User Support Working Group chair coordinated with staff members at the University of Michigan who have been handling user feedback for HathiTrust, to configure a partner-wide issue tracking system using JIRA. User Support members began accessing the system in April and observing the preliminary processes that had been put in place. The working group will assume responsibility for responding to issues and directing feedback as apporpriate to partner institutions and working groups in May. Michigan staff will continue to play an integral role in addressing issues related to content quality and bibliographic metadata.

Projects


IMLS Quality Grant

The grant project team continued to refine definitions for the preliminary set of quality errors they have identified within volumes, and make improvements to the quality review application interface. The team continued to focus on dual review of volumes (two reviewers coding the same set of volumes) to identify problematic error definitions and refine descriptive wording to better illustrate each error type. The team also revised definitions for the scale of severity that is applied to errors, in order to improve inter-coder consistency. A second sample of 10 public domain volumes was reviewed by project staff to provide sufficient data for the project statistician to develop appropriate sampling techniques for Phase Two of the project: production level coding. The University of Minnesota will be joining in data collection efforts and will begin remote reviewing in the next two months after a series of training sessions with members of the project team. Background information on the project can be found on the grant projects page.

Development Updates


Bibliographic Data Management

The HathiTrust Metadata Management System team completed development of the core database system in April, as well as an API to export bibliographic data in XML format. Approximately 200,000 records have been loaded into the system for initial testing. The team is analyzing MARC records from current content-contributing partner institutions, received from the University of Michigan, looking for irregularities and performing a general survey of the record set. CDL staff continue to interview for a Principal Metadata Analyst. Details on the project are available at http://www.hathitrust.org/htmms.

Data API

Staff at Michigan have completed a rough draft of requirements for improved security in the Data API based on symmetric key cryptography. The draft will be made available for comment in the near future.

Development Environment

New MySQL servers installed in the development environment by staff at the University of Michigan have boosted performance of print holdings database operations by an order of magnitude. Similarly-configured servers will be installed in the production environment in May.

Full-text Search

Michigan staff began development work on priority features for full-text search as identified in the Full-Text Search Working Group’s report. The implementation team is focusing initially on relevance ranking of search results based on a combination of full-text OCR and bibliographic metadata, and on faceting of results using bibliographic metadata. The goal is to release significant new features that use the bibliographic data to enhance full-text search results by July 1, 2011.

Storage Replacement Cycle

All replacement storage equipment at the Michigan and Indiana storage sites is online and in use. The storage equipment that was replaced is being wiped for security purposes by staff at the University of Michigan and will be traded in for a credit on new storage that will be purchased in June 2011.

Outages

There were no outages in April.


Papers & Presentations

All HathiTrust papers, presentations, and reports are available at http://www.hathitrust.org/papers.

New Growth


Number of volumes added:

  April Total
Columbia University 3 58,483
Cornell University 40,729 311,110
Harvard University 52,709 52,709
Indiana University 893 183,881
Library of Congress 0 71,418
New York Public Library 0 258,691
Penn State University 18 39,016
Princeton University 8,810 237,034
University of California 41,512 2,408,727
The University of Chicago 0 5,172
University of Illinois 0 14,501
University of Madrid 15,486 103,797
University of Michigan 19,974 4,338,368
University of Minnesota 1,419 84,985
University of Wisconsin 10,602 454,332
Yale University Library 0 161
Total 192,155 8,662,385

Public Domain (~27%)

Total* 181,909 2,386,430

* This count includes volumes already in the repository to which rights holders have newly opened access

May Forecast


  • Continue work on the Data API security requirements
  • Continue work on full-text search enhancements

Update on March 2011 Activities

April 8, 2011 Syndicate content

[Download PDF]

Top News


TRAC Certification

HathiTrust has been certified by the Center for Research Libraries (CRL) for compliance with the Trustworthy Repository Audit and Certification (TRAC) criteria for digital repositories. This important certification has been a key aim of the partnership since the repository’s founding in 2008, and one we intend to uphold in coming years. The full audit report is posted on the CRL website. HathiTrust posted a news release on the certification and updated documentation on HathiTrust’s compliance with TRAC elements. In conjunction with this announcement, we have included a spotlight on HathiTrust technology below, posted also at http://www.hathitrust.org/technology.

HathiTrust Webinar

Partners from across the country attended the HathiTrust new partners webinar on March 23. A variety of topics were addressed, including HathiTrust’s organizational structure and costs, our collections and services, and future directions. Partners also had an opportunity for Q&A with the presenters. The webinar will be offered on two additional dates: Tuesday April 12, 12:30-2:00pm, and Friday April 15, 12:30-2:00pm (both Eastern Daylight Time). If you would like attend, please RSVP to Jeremy York as soon as possible before each webinar: jjyork@umich.edu. Please also include any questions or issues you would like the presenters to address.

Open Webinar

Due to a high level of interest expressed by non-HathiTrust partner institutions, an open webinar will be held on May 3 and May 5 from 11am-12pm Eastern time. This webinar will be open to the public. As above, if you would like attend, please RSVP to Jeremy York as soon as possible before each webinar (jjyork@umich.edu), and include any questions you would like the presenters to address.

IMLS Quality Grant

In 2010, the Institute of Museum and Library Services granted the University of Michigan and Associate Professor Paul Conway funding to research quality in large-scale digital repositories. The grant project is using HathiTrust as a test-bed for review. Work on Phase One of the project commenced in late January 2011 with the creation of a project team to focus on defining error types and levels of severity, statistical analysis processes, a web application for data entry, and project management procedures. By the end of March, the team had identified initial project needs and accomplished the following: identified twelve initial error types including scales of severity, hired and trained two data coders, coded an initial random sample of 15 volumes from the public collection, analyzed variance in coding within the sample, and produced a first draft of procedures for quality evaluation. The team also connected with project members at the University of Minnesota who will be participating in the grant, sharing initial documentation and results. For further information regarding progress and updates, please see the HathiTrust grant projects webpage.

Print Holdings Database

HathiTrust has been working to design and populate a database of information representing the print holdings of all partner institutions. This database will serve a number of important functions:

  • It will support analysis of the overlap of institutions’ print holdings with digital holdings in HathiTrust – this information is required in order to implement the new financial model.
  • It will form a foundation for the expansion of legal uses of materials in HathiTrust (e.g. services to users with print disabilities) by partner institutions.
  • It will facilitate collaborative collection development and management operations.

To date, approximately 119.5 million rows of data have been received from partners, with each row representing one copy of a single volume monographic print item that is (or previously was) held at a partner institution. At this point, we have outgrown the hardware where initial database testing and development took place. When new HathiTrust development environment hardware becomes available in early April, a new version of the database will be created, all of the data we have received will be loaded, and we can begin generating statistics and preliminary cost modeling data. At the same time, we will be working toward a near-term production release of the database to support services to users with print disabilities at partner institutions.

Upcoming development work will focus on improved duplicate detection and clustering mechanisms on two fronts: we are working with OCLC on the development of tools that will provide improved identification of potential duplicate bibliographic records; and we will be ramping up our work on duplicate detection/matching mechanisms for the parts of multi-part works to allow expansion of the print holdings database to include serials and multi-part monographs.

Ingest


Local Digitization Ingest

Staff members at the University of Michigan are currently investigating a sample of rare volumes digitized from Universidad Complutense de Madrid for deposit. Staff are also performing final evaluation of approximately 600 locally-digitized volumes submitted by Northwestern University.

Library of Congress

Ingest of an initial set of more than 70,000 volumes from the Library of Congress, digitized in partnership with the Internet Archive, was completed in March.

Working Groups


Collections

The Collections committee continued to work on recommendations regarding duplicate volumes in HathiTrust, coordinated print management, and responding to users requests to contribute volumes to the repository.

Communications

HathiTrust figured prominently in the news in March, and the working group was in high gear to disseminate announcements regarding the Google Settlement ruling, HathiTrust’s agreement with Summon, and the positive outcome of the TRAC audit. The group also began setting up a HathiTrust Facebook presence and conducted the first of three new partner webinars.

Development Environment

New, more powerful MySQL servers were installed in the development environment to support the additional performance requirements of the partner holdings database. The new servers are being synchronized in real time with the old in preparation for a cutover planned for early April.

Discovery Interface

The WorldCat Local Prototype usability test reported in last month’s update ran for a few weeks in March. User experience experts from the Discovery Interface Working Group (DIWG) and OCLC are analyzing the data and drafting a report of findings for review. The Full-Text Search Subgroup, charged to “identify and prioritize features and functions anticipated to have immediate high-impact value to users presented it recommendations that can be reasonably afforded by the existing technology framework,” presented its analysis and recommendations to the DIWG, where it received full endorsement.

Usability

Usability Working Group members continued their work as liaisons in other HathiTrust committees in March. The group also began to develop a set of personas and use cases to inform development and policy-making surrounding HathiTrust applications and interfaces. The Usability Group is still looking for people to join the new User Experience Special Interest Group (UX-SIG), reported in last month’s update. Please contact Suzanne Chapman (suzchap@umich.edu) if you are interested in joining this group or have any questions about participation.

User Support Working Group

The charge of the User Support Working Group was approved by the Executive Committee and is posted online. The group plans to schedule its first call in April, and will become the primary body responsible for addressing user inquiries submitted through HathiTrust interfaces and the HathiTrust contact address.

Development Updates


Bibliographic Data Management

The Metadata Management System development team at California Digital Library (CDL) continued development of the core database system in March. The team continues to review workflows for receiving bibliographic data from HathiTrust content-contributing partners, and has responded to changes in bibliographic processing at the University of Michigan by adjusting processes in the new system to mirror those changes. Team members continue to benchmark data loading performance and adjust computing resources for optimum results. Interviewing continues for a Principal Metadata Analyst. The position opening is posted on the CDL website. Interested individuals are invited to apply.

Collection Builder

Staff at Michigan completed modifications to Collection Builder, enabling it to support the creation of permanent, full-text-searchable collections of arbitrary size. Details on the modifications were reported in the February update. The first real-world test, the creation of a collection of more than 50,000 volumes, was completed without issue.

Data API

Now that the Collection Builder enhancements are done, Michigan staff will return to design and implementation of Data API security enhancements.

Full-text Search

Staff at the University of Michigan researched and estimated technical feasibility and implementation effort for potential new large scale search features for the Full-Text Search Working Group’s report. The Michigan implementation team began to mock up and prototype likely new features.

OAI

In conjunction with other changes to support Creative Commons licenses in HathiTrust, staff at the University of Michigan modified the Michigan OAI provider to include records for open access and Creative Commons-licensed items. Records for these items are available in the “hathitrust” and “hathitrust:pd” sets. Please see HathiTrust Data Availability and APIs for more information about OAI in HathiTrust.

PageTurner

Michigan staff reconfigured the single volume “Search in this text” mechanism to properly handle German double quotes. They also revised the rights algorithm to allow full book PDF download for Creative Commons-licensed volumes without authentication. For volumes where full-book PDF download is not allowed, an appropriately informative message is now displayed to the user.

More progress was made toward the release of an updated PageTurner with integrated BookReader functionality. Specifically, Apache2 and Plack were installed by staff at Michigan and tested in the Development Environment, and work is underway to do the same in production. Michigan remains on track for full production deployment of PageTurner with BookReader in April.

Storage Replacement Cycle

Michigan staff began the second half of storage replacement work at the Indiana and Michigan sites in March. All new equipment at the Michigan site is online and operational. The trip to complete replacement work at the Indiana site was delayed by several weeks while staff waited for a new database server and two new validation and dataset preparation servers to arrive; work on both projects will be combined and completed in a single trip.

Outages

HathiTrust Collection Builder was unavailable from 5:00pm to 6:20pm EDT on Wednesday, March 16 to change the underlying search engine to work against the full-text index.

New Growth


Number of volumes added:

  March Total
Columbia University 15 58,480
Cornell University 31,371 280,381
Indiana University 1,093 182,988
Library of Congress 71,418 71,418
New York Public Library 126 258,691
Penn State University 1,824 38,998
Princeton University 8,758 228,224
University of California 62,804 2,367,215
The University of Chicago 1,945 5,172
University of Illinois 73 14,501
University of Madrid 2,774 88,311
University of Michigan 15,038 4,318,394
University of Minnesota 4,068 83,566
University of Wisconsin 12,206 443,370
Yale University Library 21 161
Total 213,534 8,430,230

Public Domain (~26%)

Total 231,171* 2,204,521

* This count includes volumes already in the repository to which rights holders have newly opened access

April Forecast


  • Deploy new version of PageTurner with BookReader
  • Complete storage replacement work
  • Draft a specification for Data API security enhancements

HathiTrust Technology Spotlight


Cory Snavely, Head, Library IT Core Services, University of Michigan Library

HathiTrust is intended to provide persistent and high-availability storage for deposited files. In order to facilitate this, the partnership uses a storage architecture with a rich set of features designed for fault tolerance and long-term data retention.

Central to the storage architecture is the use of two synchronized instances of storage with wide geographic separation (located in Ann Arbor, MI and Indianapolis, IN) and an encrypted tape backup with 6 months of previous-version retention (located in a separate data center several miles from the Ann Arbor storage instance). All storage is physically secure, locked in racks within data centers that are accessible only to authorized IT personnel.

The need for continuous integrity checking is fundamental to HathiTrust’s data management strategy and underlies the choice of online (spinning magnetic disk) media for primary storage. Internally, each storage instance uses N+3 Reed-Solomon parity redundancy, which is analogous to but more fault-tolerant than conventional RAID 5 storage due to the additional parity redundancy. The storage system internally performs in-flight data integrity checks as well as periodic integrity checks of all at-rest data, and makes use of parity redundancy to permanently repair any errors encountered. External to the storage system, HathiTrust also conducts periodic validation of data with stored checksums to ensure that data has been ingested correctly and remains intact.

Storage equipment replacement is an ongoing annual process and assumes that equipment has a useful lifetime of 3-4 years. The storage system is modular and virtualized, with files split into blocks that are distributed across nodes of a cluster and automatically redistributed as needed to balance storage utilization equally. Storage replacement therefore requires no manual movement of data, as this balancing is a normal housekeeping function of the system. Storage nodes that have reached retirement age may be removed from the cluster with an administrative command, and new nodes may be added, with all movement of data managed internally while employing the in-flight integrity checks described earlier. The remove and add processes neither disrupt services nor diminish the N+3 redundancy.

The following links provide more detailed information about our storage, backup, and disaster planning:

Update on February 2011 Activities

March 11, 2011 Syndicate content

[Download PDF]

Top News


HathiTrust Webinar

The HathiTrust Communications Working group has scheduled a second webinar, following the HathiTrust 101 webinar offered last summer, to review basic elements of the partnership (including the business model, collections and services), discuss current activities and future directions, and answer questions from participants. The webinar is targeted specifically toward new partners, but is open to members of all partner institutions. The same webinar will be held at three different times in order to provide more opportunities for participation: Wednesday March 23,1:30-3:00pm, Tuesday April 12, 12:30-2:00pm, and Friday April 15, 12:30-2:00pm (all Eastern Daylight Time). If you plan to attend, please RSVP to Jeremy York as soon as possible before each webinar: jjyork@umich.edu. Please also include any questions or issues you would like the presenters to address (a week in advance will give time to prepare, though we are interested in receiving questions and feedback at any time).

Public Domain Distribution

HathiTrust is pleased to announce the availability of public domain texts on a large scale for computational research purposes. Approximately 120,000 texts are freely available; up to 2 million more can be obtained with institutional sponsorship through an agreement with Google. More information, including the Google agreement and directions for obtaining texts, is available at http://www.hathitrust.org/datasets. Unlocking the research potential of the collections assembled in HathiTrust is an ongoing goal of HathiTrust partners, and we are excited to take this step in enabling new forms of discovery and analysis. 

New User Support Working Group

HathiTrust is in the process of defining a new working group to respond to questions and issues received from users on a variety of topics, including searching and accessing content, copyright, quality, access to datasets, and more. A call for participants was sent to HathiTrust partner institutions in February; membership in the group will be finalized and the charge posted in the coming month. 

Minnesota Image Ingest

All of the nearly 60,000 images and associated metadata involved in the prototype project between the HathiTrust, the University of Minnesota, the Minnesota Digital Library and the Minnesota Historical Society have been successfully ingested into HathiTrust. Public access to the image content is pending approval of a formal agreement. Project members John Butler, John Weise, and Eric Celeste will give a project briefing at the upcoming CNI Sprint 2011 Membership Meeting.  More information about the project can be accessed at http://www.hathitrust.org/mdl_images.

Local Digitization Ingest

With the initial policies, specifications, and technical framework in place, HathiTrust is ready to begin to scale ingest of locally-digitized book and journal content from partner institutions. HathiTrust has begun working with institutions of the Committee on Institutional Cooperation (CIC) and will broaden its scope throughout the coming year. Partners with digital book and journal content should review the deposit guidelines and content deposit form available at http://www.hathitrust.org/ingest, to be apprised of ingest requirements and preparations of content that may be needed prior to submission.

Creative Commons Licenses

HathiTrust has enabled support for Creative Commons licenses. The Brooklyn Museum has posted an entry on its blog about the volumes it has opened. If you hold the rights to a volume or volumes preserved in HathiTrust and would like to open access using a Creative Commons license, you can do this by filling out and submitting a permission form.

Working Groups


Collections

The Collections Committee is working on draft recommendations for the treatment of duplicate scans in HathiTrust, which it hopes to have ready for SAB consideration in late March or April. The group has also begun preliminary work on a print management proposal for the Executive Committee in advance of the Constitutional Convention. Another project the Committee will be taking up is a process for responding to requests to add specific content to HathiTrust. There has been one membership change on the Committee: Tom Teper (University of Illinois) has recently stepped in to replace Kim Armstrong (Committee on Institutional Cooperation) and will be serving as a formal liaison to the Executive Committee for the print management work item.

Communications

The Communications Working Group is pleased to welcome 2 new members: Robin Bedenbaugh from Texas A&M University, and Oya Rieger from Cornell University. The departure of one member earlier this year left in vacancy in the group, and because of the expanding work of the group and excellent pool of nominees submitted by partner institutions, the Executive Committee decided to approve two new appointments. We are pleased to welcome Robin and Oya and add their knowledge and expertise to our communication efforts.

A draft of the working group’s Communications and Marketing Plan for 2011 was reviewed by the Strategic Advisory Board and the Executive Committee in February, and the group is now incorporating feedback into a final version. The working group also made progress on the development of a second webinar (see announcement above) and on a handout designed to communicate the basics of HathiTrust to a broad audience.

Discovery Interface

The Discovery Interface Working Group (DIWG) has begun to balance its efforts between advancing the full implementation of the HathiTrust WorldCat Local catalog, and enhancing HathiTrust Full-text Search. The DIWG-OCLC team is currently developing a list of desired enhancements to the functionality and interface for a second version of the HathiTrust WorldCat Local catalog. The HathiTrust Full-Text Working Group has continued to meet weekly, and is finalizing a list of features and functions to be deployed in the initial short-term phase of the Full-text Search enhancements. 

User experience experts from the DIWG and OCLC have finalized a WorldCat Local Prototype usability test, which will run for about 2 weeks during March.

Usability

The Usability Group continues to participate in other committees via liaison roles. Two group members are actively participating in the Full-text Search working group and another continues to be actively involved in the Discovery Interface Working Group.

The Usability Group is establishing a User Experience Special Interest Group (UX-SIG). Our intention is to find people at partner institutions with some experience or interest in user experience topics, including usability & interface design. In addition to being a place for user experience (UX) related discussions, this group will provide a base for the solicitation of volunteers to participate in various short-term activities related to the HathiTrust user interface (e.g., contribute to personae and use cases, provide feedback on proposed site changes, join a task force project). There is no implied commitment in joining the group unless a member chooses to participate in a project. Membership in the UX-SIG will provide an interesting opportunity to connect with your UX colleagues across the HathiTrust partnership! Please contact Suzanne Chapman (suzchap@umich.edu) if you are interested in joining this group.

Development Updates


Bibliographic Data Management

Staff at California Digital Library have completed development of the core file system, the first major component of the new HathiTrust Metadata Management System. The development team is now reviewing existing workflows for receiving bibliographic data from each HathiTrust content-contributing institution. This work includes testing record import and transformation functions and performance. Development of the next major component, the core database for the system, has begun, and CDL continues to interview candidates for a Principal Metadata Analyst position for the project. Ongoing project information is posted at http://www.hathitrust.org/htmms.

Collection Builder

Staff at Michigan have begun modifications to Collection Builder that will allow the creation of permanent, full-text-searchable collections of HathiTrust volumes of arbitrary size. The revised design leverages the Solr index used in Full-text Search instead of relying on a dedicated Collection Builder index. In the new configuration, items added to collections of less than 1,000 volumes will be full-text searchable immediately on inclusion. Full-text indexing of collections of more than 1,000 items will be slightly delayed - generally completed within 48 hours. Very large collections of more than 20,000 items will require staff mediation. While 98% of collections contain fewer than 100 items, there has been increasing demand from users for collections with tens and potentially hundreds of thousands of items. The necessary enhancements will be completed in March.

Data API

Work that was underway at Michigan to design and implement Data API security enhancements is temporarily on hold, with staff focusing on enhancements to Collection Builder. Michigan staff did create a simple API, however, to supply access and use statements to the HathiTrust OAI feed based on a combination of volume rights and source attribute values. This is not formally part of the Data API, and at this point is intended for internal use only.

Full-text Search

Tests were done that confirmed the viability of the plan to make Collection Builder reliant upon the full-text search index, described above.

PageTurner

Integration of BookReader into Page Turner was largely completed in February and the code is ready for production deployment. However, initial testing revealed that performance of the new interface could be increased significantly through the installation of the Plack (http://plackperl.org/) Perl module. Plack is now being deployed on HathiTrust web servers and production deployment of PageTurner with BookReader is expected in April.

A bug related to proper ID representation was fixed in PageTurner’s COinS implementation. COinS support was also added to PageTurner search results. COinS is an embeddable format that provides bibliographic metadata to citation tools such as Zotero.

Storage Replacement Cycle Continues

Michigan staff have completed half of the storage replacement work at the Michigan and Indiana storage sites with no service interruptions or other issues, and are continuing replacement work in March, starting in Michigan. The process for securely purging data from retired storage nodes has been finalized and put in place.

Outages

HathiTrust remained available during an extended scheduled outage of the main campus data center at the University of Michigan from approximately 2:00pm EST on Friday, February 18 until approximately 2:00pm EST on Sunday, February 20. There were no issues resulting from the maintenance.

New Growth


Number of volumes added:

  February Total
Columbia University 1,051 58,465
Cornell University 23,371 239,010
Indiana University 1,889 181,895
New York Public Library 482 258,565
Penn State University 2,653 37,174
Princeton University 10,910 219,466
University of California 224,373 2,304,411
The University of Chicago 765 3,227
University of Illinois 0 14,428
University of Madrid 6,125 85,537
University of Michigan 26,878 4,303,356
University of Minnesota 2,909 79,498
University of Wisconsin 8,074 431,524
Yale University Library 0 140
Total 309,480 8,216,700

Public Domain (~26%)

Total 125,144 2,098,494

March Forecast


  • Deploy new version of PageTurner with BookReader
  • Complete modifications to Collection Builder
  • Draft a specification for Data API security enhancements

Update on January 2011 Activities

February 11, 2011  Syndicate content

Top News


WorldCat Local Prototype

The HathiTrust Discovery Interface Working Group is pleased to report the availability of a prototype HathiTrust catalog. This new interface is the result of a partnership between OCLC and HathiTrust, leveraging our collective expertise to facilitate discovery of the materials held in the HathiTrust Digital Library. One of the project’s main goals is to situate HathiTrust’s multi-institutional holdings within the larger world of library holdings represented in WorldCat. The new prototype catalog, accessible at http://hathitrust.worldcat.org, is built on OCLC’s WorldCat Local platform. HathiTrust and OCLC are eager to receive user feedback to inform the design of a next version of this catalog. Feedback can be submitted to HathiTrust via http://www.hathitrust.org/feedback. For more details about this project, see OCLC’s press release at http://www.oclc.org/news/releases/2011/20114.htm.

Minnesota Image Ingest

From September through December 2010, HathiTrust worked with the University of Minnesota (UMN) and its partner, the Minnesota Historical Society (MHS) to add digital images from the state-wide Minnesota Digital Library and MHS collections to HathiTrust as a preservation archive. This prototype project was intended to begin addressing HathiTrust’s long-term functional objective to “support formats beyond books and journals.” Nearly 60,000 images and associated metadata were involved in this ingest project, providing a testbed for the evaluation of numerous technical, economic, and policy-related considerations now underway. Conclusions have yet to be drawn, but the report of one of the independent consultants for the prototype ingest effort is available at http://eric.clst.org/wupl/MDL/MDL-HT-report-110126.pdf. For additional information, please contact John Butler (j-butl@umn.edu).

Mobile Development

The University of Michigan Library’s User Experience (UX) Department will begin work in February on the development of mobile interfaces for HathiTrust, focusing primarily on interfaces for reading volumes and bibliographic searching. The Department will contribute the time of a mobile developer and two User Experience Specialists for the next 7 months to conduct research and design and develop the interfaces. The UX Department staff will be consulting both the Discovery Interface and the Usability Working Groups throughout the development process. Anyone interested in contributing to this project should contact Suzanne Chapman (suzchap@umich.edu).

CC licenses

HathiTrust now offers rightsholders the ability to open access to their works under Creative Commons (CC) licenses. The first CC licenses will go live in HathiTrust on March 1, at which time the license designations will also begin appearing in HathiTrust’s tab-delimited metadata files and OAI feed (information at http://www.hathitrust.org/data). The metadata files contain bibliographic and identifier information for every volume in HathiTrust.

Shibboleth

As of the end of January, users at three new partners institutions have the ability to login into HathiTrust to take advantage of additional services: the University of California-Los Angeles, the University of Utah, and the University of Washington. Current services include full-PDF download of all public domain materials and the ability to create permanent collections in HathiTrust’s Collection Builder using a local sign-on. HathiTrust uses Shibboleth to enable partner authentication. In order to be configured for Shibboleth, institutions must release required attributes to the HathiTrust Shibboleth Service Provider (see http://www.hathitrust.org/shibboleth). 

We continue to urge partners to configure Shibboleth to work with HathiTrust so that the full (and growing) array of services can be delivered to every partner institution. The institutions listed below are configured, and we are in the process of working with three other institutions (Utah State University, the University of California-Berkeley, and the University of Madrid) to enable access. If your institution is not on this list, we would appreciate your help in making the appropriate connections to enable login via Shibboleth for your institution.  

  • Baylor University
  • Columbia University
  • Cornell University
  • Dartmouth College
  • Indiana University
  • Johns Hopkins University
  • Michigan State University
  • Northwestern University
  • Pennsylvania State University
  • Princeton University
  • Purdue University
  • Stanford University
  • Texas A&M University
  • University of California-LA
  • University of California-San Diego
  • University of Chicago
  • University of Illinois at Urbana-Champaign
  • University of Iowa
  • University of Michigan
  • University of Minnesota
  • University of Utah
  • University of Washington
  • University of Wisconsin-Madison

New Partner Webinars

HathiTrust will be holding informational webinars in the second half of March, geared specifically toward new partner institutions. Additional details will be disseminated soon. Please contact Heather Christenson (heather.chistenson@ucop.edu) or Julie Bobay (bobay@indiana.edu) for more information.

Print Holdings Data

As noted in the Update on December Activities, partners are requested to provide information about their print holdings by the end of this month. Please contact Julia Lovett (jalovett@umich.edu) with any questions.

Working Groups


Collections

Members of the Collections Committee met with representatives from DLF, OCLC and others at ALA Midwinter to discuss the DLF/OCLC Registry of Digital Masters. The Committee has agreed to provide use cases and additional input for an assessment project that DLF is planning to mount to chart the future of the Registry. Discussions continue on several key work items, including the role of duplicates in HathiTrust and opportunities for shared print collection management.

Communications

The announcement of a number of new developments occupied the Communications working group in January; in particular, the rollout of the prototype OCLC WorldCat Local interface. The group also drafted a prioritized communications and marketing plan for 2011. Among the high priorities in the plan are repurposable materials for librarians to use in explaining HathiTrust to their constituencies, internal communications mechanisms for use among HathiTrust partners, and an introductory webinar for new partner institutions (look for an announcement soon).

Discovery Interface

In January, the Discovery Interface Working Group (DIWG) reached an important milestone in the release of the HathiTrust WorldCat Local prototype catalog. Now that the prototype has been released, the DIWG’s work will focus on gathering user feedback on the catalog and conducting formal usability testing.  

The Strategic Advisory Board would like to take this opportunity to thank everyone in the working group for their dedication to the catalog project: John Butler, co-chair (University of Minnesota), Lee Konrad, co-chair (University of Wisconsin), Julia Lovett, project manager (University of Michigan), Suzanne Chapman (University of Michigan), Kevin Clair (Pennsylvania State University), Lisa German (Pennsylvania State University), Patti Martin (California Digital Library), Jon Rothman (University of Michigan), Christopher Walker (Pennsylvania State University). Adam Brin (California Digital Library) is no longer with the group but his contributions during the requirements phase were vital to the group’s success. 

The Strategic Advisory Board and DIWG would also like to thank OCLC’s team for their very hard work, particularly Bill Carney, who served as OCLC’s project manager. In addition to the creation of the prototype interface, the collaborative process itself proved to be important in helping both organizations understand the inherent benefits and challenges to working on large-scale projects across disparate types of institutions. The processes that were developed for the coordination of communication, project management, design, user testing, metadata, and systems work will serve the DIWG and HathiTrust well in future projects and partnerships.

Full-text Search

January was an important month for the newly formed Full-Text Search working group, a subgroup reporting to the DIWG. The group held its first two meetings, and will continue to meet on a weekly basis. The group is currently developing a list of features and functions that will have a high impact value for users, and can be supported in the existing technology framework.

Usability

The Usability group continues to participate in other committees via liaison roles. Two group members recently joined the Full-Text Search subgroup to discuss the future of full-text search. The group also provided feedback on proposed designs for the improvements to PageTurner. The Usability group has begun to identify areas across HathiTrust that are in need of further development, usability research, or new design solutions.

Development Updates 


Bibliographic Data Management

Development at California Digital Library (CDL) on the core system for the new HathiTrust Metadata Management System progressed in January. CDL staff also consulted with staff at Michigan on documentation for the transformations involved in ingesting bibliographic records from partner institutions. CDL is in the process of hiring a Principal Metadata Analyst for the project. Ongoing project information is posted at http://www.hathitrust.org/htmms.

Data API

Developers at the University of Michigan updated the Data API in January to support Creative Commons licenses, return access and use statements for retrieved volumes, and provide access to coordinate OCR contained within volume packages.

Development Environment

Michigan staff made improvements to the development environment to facilitate testing of new code prior to release.

Full-text Search

Over the last 2 months, staff at Michigan worked to rebuild the entire full-text index of HathiTrust materials, composed currently of more than 8 million volumes. The new index is in production and will be updated as new volumes are ingested. The rebuilding process included an upgrade of the Solr search engine. This upgrade, coupled with a number of strategic modifications to the way the index is constructed, has resulted in faster indexing time, (staff originally estimated re-indexing would take up to 40 days but it was completed in 10), smaller index size, improved handling of non-Latin scripts (e.g., CJK, Thai, Devanagari), and the inclusion of additional catalog metadata.

PageTurner

Michigan developers made considerable progress on integrating BookReader into HathiTrust’s PageTurner application. Page layout modifications specified in December were implemented, leaving performance testing as the final area of work. Performance testing will be conducted in February and the enhanced PageTurner is planned for release in early March. The current interface to PageTurner will remain the default for the initial release, with BookReader functionality introduced as a “New” feature for users to try. 

Staff at Michigan also began work to include Creative Commons licensing information as RDFa in PageTurner application output. Coding will be completed in February. CC licensing information will appear in the PageTurner bibliographic metadata display. 

Storage Replacement Cycle Continues

Michigan staff completed [correction] half of the the storage replacement work described in last month’s update at the Michigan site, and are beginning the replacement process at the site in Indiana. Staff expect all storage replacement to be completed by the end of March. While the process is non-disruptive and both sites remain in live service during the replacement process, staff have paused ingest and full-text indexing work at crucial moments to be prepared to respond to unexpected problems. In conjunction with this work, staff are testing a process for purging data from retired storage nodes for security purposes before those nodes are decommissioned.

Outages 

There were no outages in January.

New Growth


Number of volumes added:

 JanuaryTotal
Columbia University9857,414
Cornell University29215,639
Indiana University655180,006
New York Public Library64258,083
Penn State University12134,521
Princeton University50208,566
University of California31,7922,080,038
The University of Chicago182,462
University of Illinois014,428
University of Madrid1,15679,412
University of Michigan26,8584,276,478
University of Minnesota21876,589
University of Wisconsin218423,450
Yale University Library0144
Total70,6667,907,220

Public Domain (~25%)

Total14,127
1,973,350

February Forecast


  • Test and possibly deploy the new version of PageTurner with BookReader
  • Draft a specification for Data API security enhancements
  • Finalize preparations to support CC licences 

Report on 2011 HathiTrust Constitutional Convention


Ed Van Gemert, for the Strategic Advisory Board

Over the past three years, HathiTrust has assisted research libraries in moving more than 8 million scanned volumes online. From an initial group of CIC libraries and the University of California System, HathiTrust has grown to include more than 50 partner libraries, including a small but growing number of international participants. Together, these contributions to HathiTrust represent a significant slice of the world’s research holdings. As HathiTrust’s library network and content base grows, the partnership will likely have new and different needs for governance, sustainability, and for technology.  

In order to address these needs the SAB is finalizing agreement with a consultant to provide the membership with an independent, thorough review prior to the October 2011 Constitutional Convention.

The consultant’s review will evaluate HathiTrust’s progress to date, using the functional objectives as guideposts. The SAB is also requesting a forward-looking view of the next steps that will be needed to sustain and grow the digital library. The SAB identified these questions as the most important to address in the review:

  • What do participating libraries value from HathiTrust, and what unmet needs do they have? 
  • What new services will draw non-participating libraries, including those that have little or no digitized content to contribute, into the HathiTrust collaboration?
  • Is the digital library appropriately designed to meet the needs of end-users, including academic researchers?  
  • In what ways can HathiTrust differentiate itself from other digital libraries and content hosting solutions, and how should it plan its future investments accordingly?
  • Does HathiTrust governance structure give partner libraries a great enough voice in the strategic direction of the digital library?  What is the optimal balance between governance by a consortium of libraries and independent decision-making by HathiTrust’s project team?
  • Will the existing staffing and the nascent HathiTrust cost model position the initiative for growth?

The review will be completed in time to allow discussion and comment from the membership. It is anticipated that the review document will play a crucial role at the HathiTrust Constitutional Convention in October 2011. 

Please direct questions or comments to any SAB member including: John Butler, University of Minnesota, Trisha Cruse, California Digital Library, Bernie Hurley, University of California-Berkeley, Bruce Miller, University of California-Merced, Sarah Pritchard, Northwestern University, Paul Soderdahl, University of Iowa, Ed Van Gemert, University of Wisconsin-Madison, (chair), John Wilkin, University of Michigan (ex-officio), and Bob Wolven, Columbia University.

 

Update on December 2010 Activities

January 14, 2011   Syndicate content

[Download PDF]

Top News


Minnesota Image Ingest

From September through December 2010, the University of Minnesota worked with HathiTrust on a prototype project to add 50,000 image objects and associated metadata from the collections of the Minnesota Digital Library, and another 8,000 from the Minnesota Historical Society. To date, numerous lessons have been learned regarding format standards, identifiers, and rights issues related to image data sourced from different institutions. The project is also expected to shed some light on the costs of archiving image data in HathiTrust relative to that for published books and journals.  Completion of the project and release of the final report are expected in the next month.  For more information, please contact John Butler (j-butl@umn.edu).

Local Digitization Ingest

Staff at the University of Michigan incorporated feedback received from a variety of sources in October and November into the policy and specifications framework for scaling ingest of locally-digitized partner materials. The framework was finalized and approved, and is available at http://www.hathitrust.org/ingest. The bulk of the enhancements to ingest systems to support this work were completed as well, and ingest of Minnesota images and a sample of Yale content have occurred in the new ingest environment. The new environment will eventually be used for Ingest of all materials, including those downloaded from Google and the Internet Archive.

CC licenses

Developers at Michigan began implementing changes to support Creative Commons licenses in the repository’s rights management scheme. Development is expected to be completed in February. Beginning March 1, CC licenses will be included in the “Rights” and “Rights determination reason code” fields of tab-delimited files HathiTrust makes available for download. These files contain copyright, identifier, and limited bibliographic information for all volumes in the repository.

Print Holdings Information

At the beginning of December, HathiTrust requested information from partners about the print holdings of their respective libraries. The information is being used to assemble a database that will support the new cost model all partners will participate under in 2013, facilitate legal access uses of materials in HathiTrust (e.g., section 108 uses and access for users with print disabilities), and form a base for collaborative collection management and collection development activities  among the partnership. Partners are requested to provide this information by the end of February.

Working Groups


Collections

The recently-formed HathiTrust Collections Committee is a new standing committee reporting to the Strategic Advisory Board charged with establishing strategic directions related to the collection, including collection building and management (see charge and membership). The Committee held its first meeting in October 2010. Examples of issues currently under consideration include the role of duplicates in HathiTrust, models for shared management of print collections, and a variety of rights-related concerns. A more general area of investigation will be an exploration of specific collection development opportunities that the partnership might pursue and recommendations for how such activities should be prioritized and carried out, including considerations relating to non-book formats and collaboration with other initiatives. The Committee is considering a survey of the membership in order to assemble a better picture of partner expectations and aspirations. Input from other HathiTrust partners is welcomed; feel free to contact Ivy Anderson, chair (ivy.anderson@ucop.edu) or another member of the Committee with comments and questions.

Communications

The Communications Working Group continued to craft a marketing and communication plan for 2011, and expects to send a draft to the Executive Committee and Strategic Advisory Board by the end of January.

Discovery Interface

Despite holiday vacations, December was a busy month as the Discovery Interface Working Group (DIWG) worked with OCLC to take the final steps towards releasing the version 1 prototype catalog. With endorsement by the Strategic Advisory Board, the DIWG is now pleased to announce that the public release will go forward as planned in mid-January. Keep an eye out for the official announcement from OCLC. In addition to planning for the scheduled release, the DIWG is also developing post-release processes for managing user feedback and monitoring the system, an operational responsibility that will be supported by the California Digital Library. A three-month period of user testing will take place post-release, which will provide valuable input and help shape version 2 of this important effort.  

Usability

The group reviewed a plan for a second round of usability for the HathiTrust-OCLC prototype catalog to be conducted in conjunction with OCLC and the Discovery Interface Group. The group also provided feedback on some proposed designs for a new PageTurner and revised home page.

Ingest


Content From Yale

A sample of digitized content from Yale University Library was ingested in December. The content is being reviewed by staff at Yale ahead of full ingest, which is expected to begin in January.

Development Updates


Bibliographic Metadata Management

In December 2010, the California Digital Library (University of California) and HathiTrust solidified business arrangements and posted a Principal Metadata Analyst position to support development of the HathiTrust Metadata Management System. The University of Michigan transferred input files and scripts describing current bibliographic metadata transformation practices so work can begin at CDL on developing routines for metadata ingest. Progress is also being made on the development of the core metadata storage system. Project information, including overview, milestones, and timeline, is available at http://www.hathitrust.org/htmms.

First Storage Replacement Cycle Begins

The original storage equipment purchased in late 2007 for HathiTrust has reached its retirement age of approximately 3 years. HathiTrust uses modular storage, and modules may be removed and replaced without disrupting service. Data migration is handled in the background and is fully automatic, though the process does take time to complete. Michigan staff have developed a plan for the upgrade at the Michigan site, and once in progress, will start a similar upgrade process at the Indiana site. The replacement of storage hardware will now be an annual or semi-annual process, shadowing historical patterns of growth and storage purchases.

Repository Auditing

Michigan staff have begun developing audit mechanisms to verify the integrity of content stored in the repository. These processes will augment existing features of the storage system that routinely scan, detect, and repair hardware-level data storage errors (commonly referred to as “bit rot”). As part of this initiative, a preliminary integrity check of all repository Zip archives--which are used as containers for image, text, and metadata files--was run. The check revealed an error in one page of one volume resulting from a problem with data synchronization from Michigan to Indiana; this was easily corrected. Developers are now coding and testing a comprehensive set of audit routines to ensure that all items recorded as being present in the repository are stored properly and are fully intact, including checksum validation.

Full-text Search

Work to re-index the full text of all volumes in the repository continues, and after encountering some out-of-memory problems, additional tuning and upgrades were made to Solr servers. Performance is almost an order of magnitude better than expected, owing to new optimizations that are being tested for the first time. This effort is on schedule for completion by the end of January.

PageTurner

Staff at Michigan outlined final steps for integrating BookReader with PageTurner. Changes to the user interface layout and performance testing are the main areas of remaining work. The layout design was completed in December, and will be coded in January. 

Data API

In November, developers at Michigan made a change to the structure of the URL used to retrieve content and metadata from the repository through the Data API. The old structure will no longer be supported as of March 1. Users of the Data API should consult the URL structure specified in the current Data API documentation.

Outages

There were no outages in December.

New Growth


Number of volumes added:

 DecemberTotal
Columbia University3457,316
Cornell University161,803215,610
Indiana University322179,351
New York Public Library191258,019
Penn State University1,03534,400
Princeton University364208,506
University of California139,8912,048,246
The University of Chicago42,444
University of Illinois014,428
University of Madrid78,25678,256
University of Michigan14,9564,249,620
University of Minnesota2,50776,371
University of Wisconsin6,922
413,987
Yale University Library144144
Total397,429
7,836,698

Public Domain (~25%)

Total138,527
1,959,223

January Forecast

  • Complete Minnesota image ingest pilot
  • Complete draft of marketing and communications plan
  • Complete full-text re-indexing
  • Continue work on BookReader integration

Letter from the Executive Director


HathiTrust and Growth in 2011

HathiTrust grew rapidly in 2010, increasing the relevance of the HathiTrust collection to the management of our print collections. The HathiTrust collection will reach two milestones in early 2011: in January, HathiTrust will reach 2 million public domain volumes and, soon afterwards, the collection as a whole will pass 8 million volumes. Thanks to ongoing collection analysis work by OCLC Research, we know that North American research libraries overlapped with the HathiTrust collection at a median rate of just over 33% and that the median rate of overlap with the Oberlin Group of Libraries was closer to 40%. In each of the last two years, the repository grew by nearly 3 million volumes, and the rate of overlap between ARL libraries and HathiTrust grew at about 1% per 240,000 volumes of HathiTrust growth.  

If you manage a rich library collection, you will find a significant percentage of your holdings online in HathiTrust; moreover, because of the size and diversity of the HathiTrust collection, you can add over one million new public domain volumes to your collection through the addition of HathiTrust links to your catalog. It is certainly true that the obstacles to using the in-copyright volumes for the delivery of mainstream library services are immense, but just as certainly this phenomenal collection can help us change the way that we administer the storage of little-used print collections. We can confidently say that in 2010 we made progress on HathiTrust’s mission-related goal “[t]o stimulate redoubled efforts to coordinate shared storage strategies among libraries, thus reducing long-term capital and operating costs of libraries associated with the storage and care of print collections.” With 2011 also comes a change in HathiTrust’s growth trajectory and the need for a better understanding of the challenges and opportunities for future growth as a tool in shaping our collection management. To date, the two largest depositors in HathiTrust have been California and Michigan, representing approximately 26% and 57% of HathiTrust’s total deposits. The growth of new content from these institutions will slow in 2011 and, barring significant changes, HathiTrust’s collection will grow by fewer than 2 million volumes in 2011.  Even this more modest growth is good news:  it may lead to a 40% overlap between HathiTrust and ARL libraries. 

We know from OCLC’s analysis (Cloud-sourcing Research Collections: Managing Print in the Mass-digitized Library Environment,” by Constance Malpas) that even 33% overlap is of significant value to many of our libraries.  Still, HathiTrust needs growth. HathiTrust’s value as a pivotal resource in viewing the aggregation of our collections benefits from growth. Building comprehensive and accessible online collections is a necessary part of our strategy for designing effective print storage and access strategies. This is true, for example, for US federal government publications, and is just as true for the large volume of mid-20th century publishing, much of which languishes in suboptimal off-site storage facilities in our libraries. While a 33% overlap between the HathiTrust collection and the collections of ARL libraries is valuable, 50% and 60% overlap can be a powerful catalyst to major changes in print storage.

Our growth is key for a broad array of library access and management opportunities. The case for HathiTrust as a catalyst for changed print management has become clear to our partners. There are other important reasons as well:

Large numbers of titles appear to be protected by copyright but are in fact in the public domain. Digital availability has been a necessary piece of the strategy that has helped HathiTrust partners open access to 55% of the books published in the US between 1923-1963.  A new effort will also open access to large numbers of non-US works.  Because of our investments to date, adding to the US 1923-1963 collection will also increase what we know to be in the public domain.

Partners are now working to assign resources to securing permissions for use of books and journals now online. Preliminary efforts have opened access to thousands of volumes. Online availability ensures that opening access is merely a matter of flipping a switch once permission is secured, increasing our incentive to work on the problem.

The richness of the collection makes possible important lawful uses of in-copyright materials. Many library volumes are eligible for uses under Section 108 provisions in US copyright law, and the online availability of a volume can help a library provide lawful access to an out-of-print work that is damaged, deteriorated, lost or stolen. HathiTrust partners are poised to follow Michigan’s lead and use the online volumes to provide services for their users with print disabilities. Again, this is an activity that can only happen when the volume in question is online.

Despite the likelihood of lower growth for 2011, the possibility of future HathiTrust growth remains great. Overlap between HathiTrust and ARL libraries will probably grow to “only” 40% in 2011, but based on current prospects, that overlap could grow to 60% in the coming year. The impediments to that growth are significant but tractable. For example, several newer HathiTrust partners who have also invested in local repository infrastructure have millions of volumes of digital content that would enrich the collection. Needless to say, the prior investments by these institutions make the additional cost of deposit in HathiTrust expensive, but because of a predominance of pre-1978 materials, this content is a rich resource for copyright determination work and would significantly increase overlap.  Some partners also face legal and contractual obstacles. The majority of volumes digitized from CIC are embargoed under the presumption that they are in copyright.  As we have learned through our copyright determination work, significant percentages of this content are actually in the public domain, and again these volumes would also increase overlap.

The growth of HathiTrust and the nature of the collection have created critical opportunities, but we must continue to push toward the goal of a nearly comprehensive digital collection in order to benefit fully from what that collection can offer. Copyright determination work, securing rights, and especially print storage management will all be furthered by growth. We will continue to address existing impediments and urge our partners to help round out HathiTrust’s large and increasingly comprehensive collection. 

John Wilkin

Executive Directory, HathiTrust

2010 Year In Review

January 7, 2011 Syndicate content

[Download PDF]

HathiTrust is an international partnership of academic and research institutions dedicated to ensuring the preservation and accessibility of the vast record of human knowledge. The partnership owns and operates a digital repository containing millions of public domain and in copyright volumes digitized from partnering institution libraries. The preserved volumes are made available in accordance with copyright law as a shared scholarly resource for students, faculty, and researchers at the partnering institutions, and as a public good to the world community. For more information, visit HathiTrust.org.

Highlighted Achievements and Activities


New partners and finalization of membership for 2011 constitutional convention

26 institutions joined HathiTrust in 2010, doubling the size of the partnership and making a total of 52 institutions that will participate in a constitutional convention next year. In this convention, partners will review repository governance and sustainability and determine directions for the next phase of HathiTrust. View the press release.

New content from partners

HathiTrust partners contributed 2.6 million volumes to the repository in 2010, raising the total number of volumes to more than 7.8 million. Nearly 2 million volumes are in the public domain. New institutions to contribute content in 2010 included:

  • Columbia University
  • Cornell University
  • New York Public Library
  • Princeton University
  • The University of Chicago
  • University of Illinois
  • University of Madrid
  • Yale University

Approval of new cost model

The Executive Committee approved a new cost model for HathiTrust in February 2010, which will be the basis of costs for all partners beginning in 2013. The new model is based on the overlap of partner institutions’ print collections with the digital volumes in HathiTrust. Institutions that do not have large amounts of content to deposit are able to join under the new model before 2013, and more than a dozen have already done so (view the full list of partnering institutions). A FAQ for the new model is available on the HathiTrust website.

Ingest of content from Internet Archive

Staff members at the University of California and University of Michigan worked together over a period of months to develop specifications and routines to ingest partner materials from the Internet Archive at scale. Well over 100,000 volumes from the Internet Archive have been deposited in HathiTrust by three institutions to-date, and more are on the way. This was a major step in the expansion of HathiTrust’s ability to accommodate content from a variety of digitization sources.

Formation of multi-institutional groups to address key operational and strategic activities

4 new groups were formed in 2010, reflecting both the growing number of partner institutions and the expanding work of the partnership:

  • Communications working group - operational, reports to the Executive Director
  • Usability working group - operational, reports to the Executive Director
  • Full-text search working group - strategic, reports to the Strategic Advisory Board
  • Collections Committee - strategic, reports to the Strategic Advisory Board

Implementation of inter-institutional authentication via Shibboleth

Authenticated users from partner institutions are able to access full PDFs of all public domain volumes in the repository, and use a local sign-on to build permanent public or private collections of volumes. More information about Shibboleth can be found on the HathiTrust website.

Expansion of copyright review work to new institutions

Over the summer, staff at Indiana University, the University of Wisconsin, and the University of Minnesota joined in work begun at the University of Michigan to review the copyright status of works in HathiTrust published from 1923 to 1963. More than 90,000 volumes have been reviewed since the project began two years ago and approximately 55% of those reviewed have been determined to be in the public domain.

Collection Builder improvements

University of Michigan staff added functionality to the Collection Builder application to enable users to add multiple items from full-text search results to public or private collections.

Full PDF download

Staff at the University of Michigan developed the capability to deliver full PDFs of all public domain materials through the HathiTrust PageTurner.

Redundancy of large-scale search

Mechanisms and servers were put in place to achieve full redundancy of the large-scale search index, with copies of the index at both the Michigan and Indiana storage sites.

Single web access portal

The Communications working group, in conjunction with the Usability working group and developers at the University of Michigan, combined existing interfaces to create a single portal at HathiTrust.org for accessing repository services and finding information about the HathiTrust partnership, infrastructure, and activities.

Collaborative Development Environment

Members of a multi-institutional working group completed the work of specifying requirements for, and developing, a collaborative environment for the development and enhancement of HathiTrust applications. Documentation of the new environment will be forthcoming in 2011.

Final report of working group on HathiTrust Storage

A multi-institutional working group was charged with exploring the value of adding a third instance of storage to HathiTrust’s infrastructure. The working group’s report is available the HathiTrust website.

Other Activities


Improvements to ingest

Staff at the University of Michigan made enhancements to ingest capabilities, including a general increase in processing throughput, improvements in barcode validation, preparation for PREMIS 2.0 support, cleaner integration with pre-ingest transformation processes (for non-Google-scanned materials), and new controls to automatically manage priority levels for content ingested from multiple sources.

New Bibliographic Metadata Management System

The University of Califonia began development of a new bibliographic metadata management system for HathiTrust in November 2010. The system is projected to be operational by the first quarter of 2012.

Discussions with the partnership

HathiTrust hosted several “HathiTrust 101” web- and phone-based discussions for new and existing partners in the summer and fall. More of these discussions and informational sessions are planned in 2011.

Demonstration application for the HathiTrust Data API

Staff at the University of Michigan created an application using only publicly available APIs to demonstrate how the Data API could be used to locate and download complete book packages for public domain volumes not digitized by Google (Google-digitized volumes can be accessed through the Data API one page at a time).

Participation in IMLS grant to Validate Quality

HathiTrust will serve as a testbed for research led by Paul Conway, Associate Professor at the University of Michigan’s School of Information, to develop a framework and methodology for validating the quality of content in large-scale digital repositories. Details can be found in the School of Information news release.

Framework for scalable ingest of locally-digitized materials

Significant progress was made on developing policies, specifications, and technological infrastructure to facilitate the ingest of locally scanned materials from partner institutions at scale.

Search widgets

Staff at the University of California developed search widgets for HathiTrust that can be embedded in local websites to execute catalog and full-text searches. The widgets are available at http://www.hathitrust.org/widgets.

Partner initiatives

  • Object Validation Tool - Staff at the University of California completed development of a tool to validate the completeness and correctness of volumes ingested into HathiTrust and retrieved through the Data API.
  • SFX Target for HathiTrust - UC Staff developed an SFX target for HathiTrust monographs. The target is available to partner institutions who also license the Ex Libris SFX software. A copy of the code can be obtained from the California Digital Library: email CDL-SFX-Tech-l@ucop.edu.

Upcoming Highlights


TRAC certification report

The Center for Research Libraries’ report on HathiTrust compliance with the Trustworthy Repository Audit and Certification criteria (TRAC) is expected in early 2011.

Minnesota image ingest

The University of Minnesota in conjunction with the Minnesota Digital Library (MDL) and the Minnesota Historical Society (MHS) have been working with staff at the University of Michigan to develop a prototype workflow for depositing images and associated metadata into HathiTrust for access, storage, and preservation. The prototype project, which includes tens of thousands of digital images from MDL and MHS, is nearing completion. Further details are available in the HathiTrust Update on October Activities.

OCLC catalog

A prototype of the HathiTrust-OCLC catalog will be released in beta in January.

Creative Commons licenses

HathiTrust will soon offer rights holders the option to attach Creative Commons licenses to works they wish to open access to in HathiTrust.

Fulfillment of functional objectives

With the ingest of image content from Minnesota, the establishment of a HathiTrust Research Center, progress to enable HathiTrust as a platform for digital publishing, and significant steps towards compliance with TRAC, HathiTrust will fulfill all of the initial objectives set by the founding partners (see http://www.hathitrust.org/objectives).

Integration of BookReader into PageTurner

A new version of PageTurner, including the scroll and flip functionality and other features of the open source BookReader software, will be released in early 2011.

Full re-indexing for full-text search

The first full re-indexing of HathiTrust volumes will be completed in January.

Approval for HathiTrust Research Center

The Executive Committee has approved the proposal of Indiana University and the University of Illinois for the creation of a HathiTrust Research Center. Details and an announcement will be forthcoming.

Distribution of public domain texts for scholarly research purposes

The University of Michigan has finalized the terms of an agreement with Google that will allow HathiTrust to distribute the texts of public domain volumes to researchers for scholarly purposes. Details and announcement will also be forthcoming.

Future Highlights


Framework for extending access for users with print disabilities

A group of partners from CIC institutions is at work to develop the legal framework and technical implementation criteria to extend full-text access to both public domain and in copyright materials in HathiTrust to users at partner institutions who have print disabilities. Further reports on this work will be given throughout 2011.

Developing support for publishing

As reported in the HathiTrust Update on October Activities, the MPublishing division of the University of Michigan Library has engaged in a 2-year effort to create ingest, management, and presentation tools that will enable the use of HathiTrust as a publishing platform for encoded text and page-image materials. The effort will focus first on journal content, with support for books planned at a later stage.

Print holdings database

HathiTrust has begun to assemble a database containing the print holdings of partner institutions. The database will facilitate the calculation of costs under the new cost model (see the new cost model FAQ), as well as broader partner activities around cooperative collection management and development. Work on the database will continue through 2012, to be completed by the time the new model takes effect in 2013.

Constitutional Convention

The HathiTrust partnership will hold a major meeting in October 2011 to conduct a formal review of HathiTrust governance and sustainability and shape future directions for the partnership.

Update on November 2010 Activities

December 10, 2010   Syndicate content

[Download PDF]

Top News

Extensions to Copyright Scheme – Staff at the University of Michigan outlined specifications and implementation details for supporting Creative Commons licenses in the repository’s rights management scheme, including system APIs. The CC licenses are a work in progress, but HathiTrust hopes to allow these additional CC options for rights holders to open access to their works early next year. In conjunction with this work, the full range of access and use statements for HathiTrust materials was revised to be more useful to end users. Please visit HathiTrust Access and Use Policies for more details.

Bibliographic Data Management – The California Digital Library kicked off development of the new metadata management system for HathiTrust in November. The system is expected to be operational by the first quarter of 2012.

Working Groups

Communications – The group focused on work with many new partners to create press releases and announcements, and on the announcement of a major milestone: the finalization of the partnership that will participate in next year’s constitutional  convention.

Development Environment – The new development environment continues to function well. Work on expanding storage capacity is ongoing, and new MySQL servers are on order, to be installed in late December or early January. Although the group may schedule additional discussions if issues emerge, the primary work of the development environment working group is now complete.

Discovery Interface – Changes requested by the Discovery Interface Working Group (DIWG) to the HathiTrust-OCLC catalog were implemented by OCLC in early November and reviewed by the working group. The DIWG is working with OCLC and the Communications working group to prepare publicity on a prospective release of the prototype catalog. These preparations include a strategy for linking to the prototype from the current HathiTrust search portal page. The DIWG is planning a period of user testing, feedback gathering, and analysis following release. 

Ingest

New Partner Ingest – Ingest of volumes from Cornell University began in November, and ingest testing was performed on a sample of Yale University volumes. Full ingest from Yale will begin in December.

Development Updates

Large-scale Search – Staff at the University of Michigan prepared in November to re-index the full text of all volumes in the repository, a process that is estimated to take 40 days. HathiTrust has been adding incrementally to the existing full text search index as new content has been deposited. This process will exercise for the first time the capability of the indexing system to operate in a dual-mode configuration, maintaining the currency of the production full text index while simultaneously building the new one. The new index will include several changes such as improved handling of non-Latin scripts (e.g. CJK, Thai, Devanagari) and additional cataloging metadata. Re-indexing is expected to start in early December and the new index is scheduled to be available by the end of January.

Further improvements to full-text search included the implementation of optimization and integrity checks as part of the daily index-building routine. Optimization increases consistency in query response times, and integrity checks prevent a corrupted index from being released into active service. 

PageTurner – In preparation for the deposit of historical photographs from the Minnesota Digital Library and Minnesota Historical Society, developers at Michigan modified PageTurner to support repository objects that do not have plain text OCR for some or all provided images. Michigan staff also made significant progress on BookReader integration with PageTurner. Next steps include additional modifications to the integrated interface, and performance testing.

Outages – There were no outages in November.

Partner News

CDL Object Validation Tool – The Object Validation tool that CDL began to develop in August is nearly complete. Next steps will be to share with the HathiTrust community and enlist partner institutions to participate in using and evaluating the tool. Interested partners should contact HOVA-L@listserv.ucop.edu.

New Growth

Number of volumes added:

 NovemberTotal
Columbia University357,282
Cornell University53,80753,807
Indiana University695179,029
New York Public Library1,066257,828
Penn State University233,365
Princeton University130,648208,142
University of California92,1531,917,335
The University of Chicago02,440
University of Illinois014,428
University of Michigan54,8144,234,664
University of Minnesota14173,864
University of Wisconsin6,774
407,035
Total340,107
7,439,273

Public Domain (~24%)

Total148,420
1,820,696

Presentations

Yale UniversityNovember 3

December Forecast

  • Continue work on BookReader integration, full-text re-indexing, and Data API security enhancements
  • Finalize framework for ingest of locally-digitized content and establish technical systems for routine ingest
  • Begin ingest of Yale volumes and content from new partner institutions

Update on October 2010 Activities

November 12, 2010  Syndicate content

[Download PDF]

Top News

New Partners – HathiTrust is pleased to announce that membership for the 2011 Constitutional Convention has been finalized. More than fifty institutions have joined HathiTrust and will take part in a collective process next year to determine the governance structure for HathiTrust in its next phase, and shape future directions for the partnership. The official announcement of partners will be made in the coming week. 

Minnesota Image Ingest – The University of Minnesota and its statewide partners – the Minnesota Digital Library (MDL) and the Minnesota Historical Society (MHS) – are working with the University of Michigan technical team to lead a project to develop a prototype workflow for depositing images and associated metadata into the HathiTrust system for access, storage, and preservation purposes. This effort will help HathiTrust meet one of its key functional objectives, support for non-book/non-journal digital content.  The project, scheduled to run from September 2010 to December 2010, involves a variety of content types, from simple continuous tone images, to compound objects made up of a series of images in a specified structural relationship. Demonstration content will include several tens of thousands of images from the MDL database and a 10,000 image subset of the MHS collection management system. Significant progress has been made in defining and testing the METS, PREMIS, and XMP data that are required for ingest into HathiTrust. Consultants to the project are Eric Celeste and Katherine Skinner, who are working very closely with the Minnesota and Michigan technical teams. Partner colleagues from Wisconsin and Northwestern are also involved in providing input and review to the project.  For more information, please contact John Butler <j-butl@umn.edu>.

Bibliographic Data Management – The UC project team has begun the first phase of work on the new metadatamanagement system, with an overall goal of fostering a transparent model for HathiTrust metadata management. The project will address decoupling of the HathiTrust and University of Michigan production systems to enable the HathiTrust development environment to support the development, integration and deployment of a major HathiTrust Repository component.  This endeavor is exemplary of how sub-projects may be delegated and managed within the HathiTrust collaborative structure.

IMLS Grant for Validating Quality – HathiTrust will be participating in a grant received by Paul Conway, Associate Professor at the University of Michigan School of Information, from the Institute of Museum and Library Services to validate the quality of products produced in large-scale digitization projects. See the full announcement for more details.

Developing Support for Publishing – The University of Michigan is beginning development of tools that will enable the use of HathiTrust as a publishing platform for both encoded-text and page-image content. This 2-year effort will include the creation of ingest, management, and presentation tools for journals; support for books is planned as a later stage of development. Published materials will be included in large-scale search and viewable in a new interface that can be configured to reflect individual brands. The MPublishing division of the University of Michigan Library will partner in the development of this platform with the intention of making it the permanent host of MPublishing’s Open Access journals.

Search Widgets – California Digital Library has developed a search box that can be placed on any web page to search HathiTrust directly from websites, learning management systems, library guides and more.  There are a number of different versions to choose from, all variations on bibliographic search and full text search. The code for the search box is available at http://www.hathitrust.org/widgets. Anyone may take this code and embed it in a web page. The search box may be especially useful in contexts where users are seeking to discover full-text materials published before 1923, government documents, and historical information.

Website Redesign – The HathiTrust.org website was launched in October 2008, when the new partnership was officially announced. At that time, HathiTrust did not have its own bibliographic catalog, and the ability to search the full text of materials in the repository was more than a year away. HathiTrust.org was launched as a separate project site with the expectation that it would one day be integrated with HathiTrust repository and search services in a single interface. That time has arrived, and with significant coordination between the Communications and Usability working groups, and developers at the University of Michigan, the project website has been restructured and redesigned, and the interface integrated with  HathiTrust applications to provide a single portal for all HathiTrust activity. Please visit the new website at HathiTrust.org.

Partner Local Digitization – Comments were received from a number of institutions on drafts of HathiTrust’s policy and specifications framework for accepting content from a variety of digitization sources. The comments are being collated and incorporated into the framework, which staff at the University of Michigan hope to finalize by the end of November. At that time, staff also expect to have the technical systems fully in place to begin routine ingest of locally digitized content.

Working Groups

Communications – The group continued their work on evaluating a draft of the new website design and content, and on announcing many new partners.

Development Environment – The new development environment is now being used actively for all HathiTrust development, testing, and production release processes at Michigan. The working group, which met less frequently during the intensive migration process, will now be discussing the current provisions of the environment and welcoming partners interested in using it to contact us for access. Incremental improvements to the environment will continue. Major planned refinements include increasing storage capacity, upgrading the MySQL service to new servers, and formalizing processes for refreshing sample content.

Discovery Interface – With the beta release of the phase 1 HathiTrust-OCLC prototype catalog quickly approaching, the Discovery Interface working group (DIWG) is working with OCLC and the Communications working group to prepare a public announcement of the catalog. HathiTrust staff members are also making adjustments to the current HathiTrust interface to accommodate the new beta catalog.

As of the end of October, the Full-Text Search Working Group, a subgroup reporting to the DIWG, now has a finalized charge and membership. This subgroup, to be chaired by Tom Burton-West and involving members from four partner institutions, will have a two-part focus: 1) the group will work on short-term improvements to the HathiTrust full text search, and 2) it will evaluate user needs and draft a long-term work plan for the full text search functionality and interface. The subgroup is expected to begin work in November.

Usability – The Usability Working Group has been actively participating in other committees via liaison roles. The Communications group liaison helped refine the information architecture of the HathiTrust.org website. The Discovery Interface Working Group liaison began discussions with OCLC for a second round of usability testing to evaluate WorldCat Local for HathiTrust. The group also consulted on a usability issue regarding login and made recommendations for improvements.

Ingest

New Partner Ingest – Ingest began in October of content from Princeton and the University of Chicago.

Development Updates

Large-scale Search – Staff at the University of Michigan are preparing to regenerate the complete full-text index to take advantage of new and improved functionality in Solr 1.4.1, extend metadata support in the schema, improve journal volume metadata, and improve overall performance with better pre-filtering to reduce the number of unique terms. The actual process of re-indexing is expected to take approximately 40 days and is targeted for completion by the end of January.

A temporary solution was put in place to solve the problem of too many unique terms described in a Large-scale Search Blog post in February. A more permanent solution is in the works. Details are posted in the Large Scale Search blog from October 5.

PageTurner – Michigan made progress on the integration of the Internet Archive BookReader software (formerly known as the GnuBook)  into the HathiTrust PageTurner. Building on the advanced working prototype developed by the California Digital Library, the present task is to fully integrate the code for production use. Some enhancements, such as the display of OCR text, are in the works.  During October, the University of California wrapped up its involvement in integration of the BookReader with the HathiTrust PageTurner and handed off development to staff at Michigan. UC’s resources will be redeployed to the HathiTrust Metadata Management System development. John Wilkin praised UC’s involvement as “incredibly helpful in moving forward this critical piece of our environment”.

Collection Builder – Developers at Michigan discussed development options for proposed changes to the Collection Builder application interface. The immediate objective is to make thelist of collections easier to use, based on guidance from the Usability WorkingGroup. Preliminary   conversations among Michigan staff about how to support full text search of significantly larger collections were also initiated.

Data API – Michigan staff began work in October to specify a security layer for the HathiTrust Data API, and enhanced the API to support dissemination of coordinate OCR.

Outages – HathiTrust's search-within-a-book feature  was unavailable from Tuesday, October 12 at 3:00pm EDT to Wednesday, October 13 at 12:00pm EDT due to a software change that had an unanticipated impact on this functionality.

Partner News

CDL Object Validation Tool – Staff at CDL demonstrated the Object Validation Tool to the team members at Michigan in late October. HathiTrust Teams at CDL and Michigan discussed strategies for generalizing the tool for other partner use. CDL is currently planning to check the code into the new code repositories in the HathiTrust development environment and solicit participation from another HathiTrust partner to experiment with running the tool.

New Growth

Number of volumes added:

 OctoberTotal
Columbia University51157,529
Indiana University232178,334
New York Public Library175,534256,762
Penn State University6
33,363
Princeton University77,49477,494
University of California18,1071,825,202
The University of Chicago2,4402,440
University of Illinois014,428
University of Michigan30,0224,179,850
University of Minnesota2973,723
University of Wisconsin16,636
400,291
Total318,571
7,096,726

 Public Domain

Total (~24%)

252,9701,672,276

Presentations

LuceneRevolution
October 7
ARL Fall ForumOctober 15

November Forecast

  • Continue work on BookReader integration into PageTurner, full-text re-indexing, and Data API security enhancements
  • Finalize framework for ingest of locally-digitized content and establish technical systems for routine ingest
  • Continue to work toward redesign of the Hathitrust.org website
  • Begin ingest of Yale volumes and content from new partner institutions

Update on August 2010 Activities

September 10, 2010 [Download PDF]Syndicate content

Late Breaking News

Princeton University Joins HathiTrust – The full announcement can be found at http://bit.ly/bEbkSb and more information is available at HathiTrust.org. We are very excited to welcome Princeton University Library and look forward to the ways they will strengthen and enrich our partenrship.

Top News

HathiTrust 101 – Members of the Communications working group and John Wilkin, the Executive Director of HathiTrust, hosted two informal “HathiTrust 101” sessions for working group members and directors of partner libraries in August. The webinars were initiated in connection with the recent growth in partnership and the deepening involvement of member institutions in new working groups and the Collections committee. The purpose was to provide an overview of foundational elements of HathiTrust, including mission, governance, finances, and collections, as well as updates on current activities and areas of focus. A third session is scheduled in September, and plans are being considered to hold similar sessions on a periodic basis to keep partners updated about recent and upcoming developments, answer questions, and receive feedback on partner activities and plans. Slides from the “HathiTrust 101” presentation are available at http://www.hathitrust.org/documents/HathiTrust101-201008.ppt.

September Meeting In Chicago – Staff from a number of partner institutions, including members of the Executive Committee, Strategic Advisory Board, and several HathiTrust working groups, will be meeting in Chicago on September 23 and 24 to discuss a broad array of issues and plans. Some of these, in addition to topics regularly reported in this newsletter, include the new cost model to be implemented in 2013, and the constitutional convention of partners to be convened in 2011. Institutions who join HathiTrust on or before October 31, 2010 will be eligible to participate in this convention, in which partners will conduct a formal review of HathiTrust governance and sustainability and shape new directions for the partnership.

Local Digitization Ingest – Staff members at the University of Michigan continued to work on the first draft of a policy and specifications framework for ingesting locally digitized content into HathiTrust. Staff have begun to use the framework to evaulate a sample of materials submitted by the University of Illinois, and the framework will go out to partner institutions for comment and further trial in September. HathiTrust plans to begin ingest of locally digitized content from Illinois and other CIC institutions in the fall.

Working Groups

Communications – The Communications Working Group continued to discuss issues surrounding the redesign of the HathiTrust website, as well as plans and processes for receiving new partners.

Development Environment – Staff at the University of Michigan have nearly completed migration of the code for HathiTrust applications to the development environment, including establishing the methods and scripts needed to deploy applications into production. Focus has shifted from migrating code to staging, deploying, and testing applications in development and production areas of the new environment. Developers at Michigan have begun to transition to the new environment and system administrators have configured and opened access to additional servers to support this transition. Networking changes to provide access from the integration testing area of the development environment to the full repository were also completed.

Discovery Interface – At the end of August, OCLC had loaded over 3.7 million HathiTrust records into WorldCat. This constitutes 98% of the available HathiTrust records. The Discovery Interface team is planning a beta release of the phase 1 HathiTrust-OCLC catalog at a date to be determined, pending some final adjustments to the interface to be completed by OCLC. The Discovery Interface team, in conjunction with OCLC, is planning usability analysis that will start before the catalog is released and continue throughout the beta release phase.

The Discovery Interface team is also looking forward to a face-to-face meeting in September, during a larger meeting of HathiTrust partners in Chicago. The agenda will include: taking stock of Discovery Interface projects and activities to date, setting the purpose and scope for future work, supporting the Discovery Interface Full Text Search subgroup, and creating a roadmap for phase 2 of the HathiTrust-OCLC catalog.

Usability – The Usability Working Group has begun regular meetings and is in the process of setting priorities and defining member roles in relation to other committees.

Ingest

Columbia – HathiTrust began ingest of volumes contributed by Columbia University in August, including both Google- and Internet Archive-digitized volumes. This was the first set of Internet Archive-digitized materials to be ingested since the initial deposit by the University of California in April, when specifications for Internet Archive-digitized content in HathiTrust were developed.

Yale – Staff from Yale and the University of Michigan have been working to determine the pre-ingest transformation steps needed for Yale’s Microsoft-digitized volumes and transfer the content to servers at the University of Michigan, where it will be ingested. Both of these tasks are nearly finished, and we hope to begin ingest of Yale’s initial set of volumes by the end of September.

Development Updates

Bibliographic Metadata Management – University of California staff are collaborating with staff at the University of Michigan to produce a series of planning documents for a HathiTrust Metadata Management system to replace the system currently in use. The goal is to prepare a set of documents for in-person review at the September meeting in Chicago. Teams are at work on documents that will codify goals, success criteria, system requirements, development, integration and migration strategies, acceptance testing and project timelines and milestones.

Large-scale Search – Michigan staff continued tests to determine the effects of cache warming on performance. Staff also continued the tests related to scaling strategy and indexing speed that were reported in the Update on June Activities.

PageTurner – Staff at Michigan improved the way that PDFs are created for books with landscape-oriented pages.

Storage Upgrade – Michigan staff completed the same upgrade at the Michigan storage site that was completed at the Indiana site in July: adding 160 terabytes of new storage, replacing cluster interconnect switches, reorganizing the equipment layout, and recabling all servers and storage. As reported in the Update on July Activities, the usable storage capacity at each site is now 475 terabytes.

Outages – HathiTrust full text search was unavailable on Friday, August 20 from 2:40-2:45pm EDT due to an accidental release of a software module from the new development environment while troubleshooting a full-text indexing problem. Full text search may also have been unavailable for some users from approximately 2:30pm on Friday, August 27 to Monday, August 30 at 3:30pm due to a network file system locking problem at the Michigan site.

Partner News

UC Validation Tool – Staff at the University of California are developing an automated tool to validate the completeness and correctness of objects ingested into HathiTrust and retrieved through the Data API. The tool will be used initially to validate samples of ingested Google- and Internet Archive-digitized objects in comparison with their pre-ingest originals. A prototype of the tool is scheduled for demonstration by the end of September.

SFX HathiTrust Target – California staff are packaging code for an SFX HathiTrust target for partners who also license the Ex Libris SFX software. UC expects to announce the availability of the code to partners in late September. The target will be offered through Ex Libris EL Commons wiki later in the Fall.

New Growth

Number of volumes added:

 AugustTotal
Columbia University56,73056,730
Indiana University286
177,962
Penn State University10,202
33,357
University of California133,9001,769,227
University of Michigan40,8664,130,008
University of Minnesota54
73,674
University of Wisconsin14,167
379,111
Total199,475
6,563,339

 Public Domain

Total (~20%)

55,1321,311,288

Presentations

HathiTrust 101
August 5 and 27
University of IcelandAugust 5
IFLA 2010 (paper and presentation)
August 15
  • Please see http://www.hathitrust.org/papers for links to all HathiTrust presentations, papers, and reports.

September Forecast

  • Hold committee and working group meeting in Chicago September 23-24
  • Add progress bar for full-book PDF generation to the PageTurner application
  • Improve PageTurner handling of volumes without OCR 
  • Finalize draft of policies and procedures for ingest of locally digitized content
  • Test procedures with content from CIC institutions and prepare for ingest
  • Continue work to redesign HathiTrust website

Update on September 2010 Activities

October 8, 2010

October 8, 2010  Syndicate content

[Download PDF]

Late Breaking News

HathiTrust Welcomes TRLN and Dartmouth – The Triangle Research Libraries Network (TRLN) and Dartmouth College have joined HathiTrust. TRLN will be contributing public domain volumes digitized through in-house initiatives and partnerships with the Internet Archive. Dartmouth joins HathiTrust as the first partner under HathiTrust’s new cost model. Visit http://www.hathitrust.org for more details.

Top News

September Meeting In Chicago – Staff from a number of HathiTrust institutions gathered in Chicago on September 23 and 24 for meetings of HathiTrust’s governing committees, its operational and planning working groups, and teams working on specific projects including ingest of locally digitized partner content, full text search, user interface collaboration, and others. Activities in the meetings are reported throughout the newsletter, and will be posted in the Executive Committee and Strategic Advisory Board meeting minutes.

Local Digitization Ingest – The first draft of a policy and specifications framework for receiving content from a broad array of digitization sources and workflows into HathiTrust was completed in September and shared with several partner institutions. The framework is now available publicly online in two parts: the HathiTrust Guidelines for Digital Object Deposit and the HathiTrust Deposit Form, which includes detailed specifications for submitted content. We would like to formally request comments and feedback on the framework from partner institutions and interested parties. To be included in our review and revisions, please send comments to hathitrust-info@umich.edu by October 31, 2010.

Copyright Review – For the last two years, staff at the University of Michigan have been conducting review of volumes in HathiTrust that were published in the United States from 1923 to 1963, releasing materials into the public domain that do not comply with U.S. copyright formalities. Over the summer, this work was expanded to additional HathiTrust partner institutions and staff at Indiana University, the University of Wisconsin, and the University of Minnesota were trained to work as reviewers. As of September 1st, 18 staff members from the four institutions are contributing to the project. The increase in staff has resulted in a larger number of volumes being opened up in HathiTrust on a monthly basis, from 470 volumes in June to over 2500 volumes in September. Approximately 85,000 of 188,000 current candidate volumes in HathiTrust have been reviewed since the project began. Close to 50,000 of these, or about 55% have been determined to be in the public domain and opened in HathiTrust.

Shibboleth – Four new partner institutions configured access to HathiTrust via Shibboleth in September. Logging into HathiTrust provides students, faculty, and other affiliates at partner institutions the ability to download a full-PDF of all public domain materials. It also enables use of HathiTrust’s Collection Builder tool with a local sign-on. HathiTrust plans to use Shibboleth to offer additional features and services to partner institutions in the future.

October 31 Partnership Deadline – A number of partners have joined HathiTrust in the last several weeks, and we will be announcing several more throughout October. Institutions are joining ahead of an October 31 deadline, by which institutions must become members in order to participate in a constitutional convention that HathiTrust will hold in 2011. In this convention partners will conduct a formal review of HathiTrust governance and sustainability and shape future directions for the partnership.

Working Groups

Communications – The Communications Working Group’s activities in the past month continued to focus on plans and processes for receiving new partners. In its in-person meeting in Chicago on September 24, the group made progress towards a communications and marketing plan and provided feedback to the website redesign project in an interactive session. An additional “HathiTrust 101” presentation was held on September 9. Slides from the presentation are available at http://www.hathitrust.org/documents/HathiTrust101-201008.ppt.

Development Environment – The transition of all ongoing HathiTrust development to the new development environment is in its final stages. Work in September focused on developing testing and release processes, and the transition is expected to be complete by mid-October.

Discovery Interface – The Discovery Interface Working Group (DIWG) convened in Chicago on September 24th, as part of the larger two-day face-to-face meeting of HathiTrust partners. The group had a productive discussion, largely focused on re-structuring the working group, scoping its future work, and clarifying the DIWG’s relationship to other HathiTrust working groups. Three areas of future focus are phase 2 of the HathiTrust-OCLC catalog, full-text search services, and usability for both of these projects (in collaboration with the Usability WG). Several other areas of potential development falling under the topic of “end-user services” were identified for further investigation. As a follow up to this meeting, the chair of the group took the DIWG’s ideas and questions to the HathiTrust Strategic Advisory Board, to whom the working group reports, for consultation. Meanwhile, a beta release of the phase 1 HathiTrust-OCLC catalog is expected by the end of 2010.

Usability – The Usability Working Group also met in a face-to-face meeting in Chicago in September. The group has been working on forming connections with other working groups, understanding existing HathiTrust interfaces and functionality, and determining the scope of the work the group will undertake. Formal liaisons to the Communications and Discovery Interface working groups were established in September.

Ingest

NYPL and Illinois – HathiTrust began ingest of content from New York Public Library and the University of Illinois in September, including more than 80,000 Google-scanned volumes from NYPL and more than 14,000 Internet Archive-scanned volumes from Illinois. Illinois is the third institution following the University of California and Columbia University to contribute volumes digitized by the Internet Archive. Ingest of content from Yale University will begin in October.

Development Updates

Bibliographic Metadata Management – During the month of September, HathiTrust teams in Michigan and California focused on producing a planning document for a HathiTrust Metadata Management Service to be developed, hosted, and run by the University of California. The document codifies the goals, success criteria, assumptions, and requirements of the system, as well as strategies for migration, integration, and acceptance testing. The planning document provided a focus for face to face meetings in Chicago on September 23-24. The University of California expects to begin developing the system later in the fall.

Large-scale Search – Staff at the University of Michigan conducted additional testing in September to better understand scalability and memory issues in full text indexing and to tune searching and indexing process. As a result of the testing, Michigan developers were able to solve issues related to the size of the index and memory use, improving the speed of full text searches.

PageTurner – Michigan staff completed work to add a progress bar for full-volume PDF generation in the PageTurner application. The new feature will be put into production in October. Staff at Michigan also began light experimentation with the coordinate OCR text format to investigate possibilities for use.

Collection Builder – Staff members at Michigan and California Digital Library discussed improvements that could be made to the collection builder interface.

Improvements to Ingest – Work on architectural improvements to ingest that was reported in the Update on July 2010 Activities is nearly complete. The major areas of enhancement are more thorough barcode validation, generalization of routines that create METS and PREMIS markup, an improved logging framework, and the use of XPath for XML validation. Along with these changes, a regression testing methodology is being developed to exercise all validation logic.

Outages – HathiTrust was unavailable on Tuesday, September 7 from 4:20pm to 5:00pm due to a software error that was undetected during release testing.

Partner News

SFX HathiTrust Target – As reported in the Update on August Activities, the University of California has created an SFX “target” to link to the HathiTrust Digital Library. HathiTrust partner libraries using Ex Libris’ SFX scholarly linking who implement the new target will be able to include a link to HathiTrust books in their SFX menu window. Library users will be able to see immediately whether a HathiTrust book is available electronically and if so, link to the full text in the HathiTrust Digital Library. For a copy of the code, email Margery Tibbetts, California Digital Library, at CDL-SFX-Tech-l@ucop.edu.

New Growth

Number of volumes added:

 SeptemberTotal
Columbia University3854,983*
Indiana University140178,102
New York Public Library81,22881,228
Penn State University0
33,357
University of California37,8681,807,095
University of Illinois14,42814,428
University of Michigan19,8204,149,828
University of Minnesota2073,694
University of Wisconsin4,544
383,655
Total158,086
6,778,155

 Public Domain

Total (~21%)

108,0281,419,306

*Incorrectly reported as 56,730 in the previous update

Presentations

HathiTrust 101
September 9
Indiana UniversitySeptember 27
Library of CongressSeptember 27

October Forecast

  • Finalize membership for the 2011 constitutional convention
  • Continue to work toward redesign of the Hathitrust.org website
  • Begin ingest of Yale content
  • Complete transition to new development environment
  • Receive feedback on ingest policies and specifications

Special Message On Security and HathiTrust

From: John Wilkin
To: HathiTrust Partners

Dear Colleagues,

In recent months, we have seen an increase in the number of incidents of large-scale downloading of HathiTrust resources, and even the availability of applications to aid in downloading and circumventing limitations on access.  HathiTrust is strongly committed to openness; even so, we occasionally encounter issues related to security.  For example, overly aggressive crawlers can consume such large amounts of system resources that they affect access for typical user access.  External agents may have a negative affect intentionally (a sort of “denial of service attack”) but most are simply poorly designed or the persons running them do not understand why limits have been put in place.  

In addition to general system resource concerns, HathiTrust holds many resources that have contractual obligations for limiting some types of systematic or large-scale downloading.  Although the most famous example of this is Google-digitized content, which requires the participating library and HathiTrust to prevent uncontrolled robotic activity, e.g., “to implement technological measures (e.g., through use of the robots.txt protocol) to restrict automated access to any part of such entity’s website where substantial portions of such Digital Copies are available” (http://www.lib.umich.edu/files/services/mdp/Amendment-to-Cooperative-Agreement.pdf). Publicly available publisher resources may also have these types of constraints.

HathiTrust uses strategies to enforce some limits on use, but balances this with strategies to provide more open access. As a general preventative measure, HathiTrust employs forms of “throttling” as one mechanism to protect the system from malicious external forces. To be frank, this sort of approach is fairly coarse and is insensitive to whether the user’s activities are appropriate or inappropriate; we hope to increase the sophistication of these mechanisms as time goes on to better distinguish and permit legitimate heavy use. One approach we take to give more generous access is through user authentication at partner institutions. In this case, Google-digitized content, which can typically only be viewed one page at a time, is downloadable by the authenticated user as a whole volume (or parts of a volume). We can also use the Data API as a vehicle for broader access. In cooperation with a partner institution, we can permit a specific IP address for authorized larger-scale uses. For example, we also hope to add functionality to the Data API to enable authorized uses, perhaps even of in-copyright materials. For most partner-digitized resources, where limitations are not required, the Data API is an excellent tool, and we have deployed a demonstration tool that facilitates large-scale downloading (temporarily available at http://www.lib.umich.edu/two-over-threehundred/).

We want to remind partner institutions that responsible management of the repository requires us to collaborate in managing types and levels of access. Many factors, including copyright law and contracts, come into play in guiding our strategies.  Institutional representatives may be asked to investigate and help resolve issues of apparent violations. We can also work together to provide broader access where possible.  I hope you’ll be able to aid us in managing this balance.

Sincerely,

John Wilkin, Executive Director
HathiTrust Digital Library

Update on June 2010 Activities

July 9, 2010 [Download PDF]Syndicate content

Top News

Shibboleth and Full-PDF Download – HathiTrust released Shibboleth as a mechanism for partner authentication in June. Authenticated users can now download full-PDFs of all public domain volumes in HathiTrust, and access the Collection Builder feature through local sign-on. Shibboleth also lays the groundwork for future augmented services to partner institutions, potentially including the ability to make uses of digital volumes allowed by Section 108 of U.S. copyright law, and allow full access to in copyright volumes for users with print disabilities.

Full-PDF Download: The release of Shibboleth was made in conjunction with improvements to PageTurner that enabled delivery of high-resolution PDF files with embedded OCR for entire volumes. While only individuals at member institutions have access to this service across the repository, all public domain volumes that were not digitized by Google are available for full-PDF download to members and non-members alike. Right now these include nearly 100,000 Internet Archive-digitized volumes that have been contributed by the University of California, and thousands of volumes digitized locally by the University of Michigan. The partners are poised to significantly increase the amount of non-Google-digitized content preserved in HathiTrust in the near future, making many more public domain volumes freely available for download and distribution.

SEASR – HathiTrust is in the process of investigating SEASR, the Software Environment for the Advancement of Scholarly Research, as a means to provide computational access to materials stored in the repository. Staff at the University of Michigan began installation of SEASR in the HathiTrust development environment in June, and expect to gain more knowledge about SEASR and what would be involved in applying it to HathiTrust over the next several weeks.

Working Groups

Discovery Interface As of the end of June, there are nearly 3.1 million HathiTrust records in WorldCat. Record loading is now continuing at a quicker pace, and is nearly complete. Meanwhile, the working group is in the process of configuring the HathiTrust-OCLC catalog interface to make branding and design consistent with the existing HathiTrust Digital Library system. OCLC is also making several alterations to the catalog’s functionality to fully meet HathiTrust’s requirements. This work is expected to extend into early August, after which time the interface will be re viewed for public beta release.

With the working group’s charge expanding to include development of the HathiTrust Full Text Search, the group plans to restructure its membership in order to specifically target different areas of focus. While the new structure is still being finalized, the goal is to form various task forces to address different aspects of the HathiTrust Discovery Interface: full text search, bibliographic data management, and the HathiTrust-OCLC catalog interface.

Collaborative Development Environment – University of Michigan staff continued the migration of HathiTrust applications into the new development environment in June, performing testing and configuration of the GlusterFS distributed file system that will be used as the storage back-end for the environment as well. Michigan staff are in the process of setting up and testing the virtual MySQL and web service provisions of the new environment. An initial version of the development environment is being used currently by staff at California and at Michigan to make improvements to the existing PageTurner application. When configuration is complete, the environment will support HathiTrust development efforts broadly across the partnership.

Quality, Ingest, and Error Rate – The quality working group is still working through a set of scenarios for gating volumes of poor quality from entering HathiTrust, and developing a justification and recommendation for the best approach to follow. A set of larger issues around quality has also been identified, some of which deal with larger policy considerations.

Development Updates

Large-scale Search – The full text search index in Indiana was put into production by Michigan staff in early June, making the infrastructure for full text search fully redundant. Two new index build servers were also put into production in Michigan. All of the new systems have been functioning well, and the new build servers have substantially improved the performance of index building and maintenance.

Michigan staff began running tests in June to determine the effects of cache-warming on performance, as well as tests relating to scaling strategy and indexing speed. The goal of scaling tests is to determine the optimum size to use for index shards, or sections of the search index, that are stored on each index server, the optimum number of shards per server, and optimum memory allocation per server. Indexing speed is of critical importance for deploying new searching features, which often requires the entire search index to be rebuilt.

Michigan staff also developed a Lucene utility in June (Solr uses Lucene) to read an index and print out the total number of occurrences of a term. The code has been contributed and committed to the stable Lucene development branch (3.x).

PageTurner – Additional progress was made on GnuBook integration with the current HathiTrust PageTurner. Michigan investigated in particular ways to optimize the serving of thumbnails. Performance optimization for the new page image server also continued, with a focus on common CGI performance mechanisms, including FastCGI.

Collection Builder – Integration of Collection Builder functionality with large-scale search is in the final stages of testing and will be deployed in July.

Storage Upgrade – Michigan staff have ordered and received additional storage for the Indiana and Michigan sites and will be putting it into service during July and August. The upgrade requires the installation of a new, larger storage network switch, so staff will be using the opportunity to introduce a new cabling layout for the entire system. In Indiana, the upgrade and recabling work will be combined with a recommended relocation of all server equipment to another area of the data center for improvements in air handling and a transition to high-voltage power distribution. No outage is expected for this maintenance work.

Outages – HathiTrust services were unavailable on Monday, June 7 from 7:10-10:00am and on Tuesday, June 8 from 5:00-5:30pm due to a connectivity problem with one of the web servers; and on Saturday, June 25 from 8:30-10:00am due to a database server disk space shortage.

New Growth

Number of volumes added:

 JuneTotal
Indiana University236
177,333
Penn State University328
22,824
University of California6161,509,169
University of Michigan34,6054,056,835
University of Minnesota173
73,856
University of Wisconsin10,073
353,639
Total46,031
6,193,386

 Public Domain

Total (~20%)

32,8051,208,351

Presentations

ALA BISG/NISO Forum
June 25
  • Please see http://www.hathitrust.org/papers for links to all HathiTrust presentations, papers, and reports.

July Forecast

  • Explore capabilities and requirements of SEASR
  • Continue configuration of the new development environment and migration of current development activities
  • Install storage upgrade at Indiana site

Update on May 2010 Activities

June 11, 2010 [Download PDF]Syndicate content

Top News

NYPL Partnership – We are pleased to announce New York Public Library as the newest partner in HathiTrust Digital Library. The New York Public Library is recognized around the world for its distinctive collections and services to users, and will bring valuable content and perspective to the partnership. NYPL will be contributing materials digitized in collaboration with Google, the Internet Archive, and Kirtas. The press release for the partnership announcement can be read at http://www.nypl.org/press/press-release/2010/05/24/nypl-takes-giant-step-preserving-its-digitized-collections.

6 Million Volumes, 1 Million Public Domain – As of May 26, HathiTrust preserves and provides access to more than 6 million volumes, over 1 million of which are in the public domain. These significant milestones draw attention to the growing value of HathiTrust as more and more volumes, representing an increasingly comprehensive collection of published literature, are contributed by partners, made available to users, and securely stored for generations to come.

Shibboleth – Implementation of authentication via Shibboleth was tested and finalized in May, and a formal release of the service scheduled for June 8. When it is released, users at partner institutions that have provided Shibboleth attributes to HathiTrust will be able to download full-PDFs of public domain volumes, and use their institutional sign-ons to access the HathiTrust Collection Builder. More information about Shibboleth in HathiTrust, including attributes, terms of use, and privacy, are available at http://www.hathitrust.org/shibboleth.

New Communications Working GroupThe Executive Committee has formed a new working group to address an array of communication needs in HathiTrust as the partnership and user base continue to expand. Information about the communications group, including goals and specific areas of focus can be found in the formal charge at http://www.hathitrust.org/wg_communications_charge.

Partner Local Digitization – Staff at the University of Michigan continue work to establish specifications and guidelines for ingest of non-Google- and non-Internet Archive-digitized materials from partner institutions. By August, staff hope to have a clear and efficient framework defined to begin to scale up ingest of content from local digitization efforts.

Working Groups

Development Environment – Michigan staff are working on migrating active development of repository applications and services including PageTurner, Collection Builder, Large-scale Search, and Ingest, to the new development environment. The design is being adapted on an ongoing basis in response to issues encountered along the way. Michigan ordered new network hardware to enable limited access from the development environment to content in the production repository for integration testing and troubleshooting (a subsection of the repository has been copied and made available to the environment to meet the majority of development needs). The working group continues to have regular conference calls to discuss progress on the transition to the new environment.

Discovery Interface On May 23, OCLC successfully installed the version 1 HathiTrust WorldCat Local instance. The catalog has been made available internally to the Discovery Interface Working Group, and is being tested and evaluated by both OCLC and HathiTrust. OCLC is now close to completing a full load of HathiTrust records into WorldCat, with just under 2.9 million records loaded. After the initial record load, OCLC will move to loading periodic HathiTrust update files.

The working group also recently drafted a charge document for its work on developing the HathiTrust Full Text Search. Some of the main goals of this project will be: charting a course of service refinement to meet scholarly need; contextualizing each of HathiTrust’s search services through interface design and presentation; recommending pathways from HathiTrust search to other services essential to patterns of scholarly workflows; and evaluating the effectiveness of the HathiTrust full-text search. The group is currently working on outlining a timeline and strategy for these efforts, as well as the full-text search membership.

Development Updates

Large-scale Search – New servers were installed and configured at the Indiana site by staff from the University of Michigan, and the process for releasing daily large-scale search index updates was developed and run in a test mode. The search service running on these new servers will be put into production by Michigan staff on June 8, making the full-text search service redundant in Michigan and Indiana. Two new index building servers were put into production in May, providing a substantial increase in index building performance and freeing one server to be repurposed for development and testing of index processes.

PageTurner – Michigan explored strategies for optimizing performance of the newly constructed image server, particularly in conjunction with its use in the GnuBook book viewer. Speedy extraction of image dimensions for an entire book and delivery of thumbnails are among the challenges. Performance optimization work will continue in June.

Outages – The beta* large-scale search service was unavailable on Monday, May 3 from 9:00-10:15am to apply security updates and on Thursday, May 20 from 9:00-11:25am to install new networking hardware.

*Beta services are typically non-redundant and/or volatile, and while we strive to minimize down time and report any that occurs, we do not attempt to adhere to non-peak outage windows for maintenance.

New Growth

Number of volumes added:

 MayTotal
Indiana University262
177,097
Penn State University5,222
22,496
University of California304,9971,508,553
University of Michigan82,5084,022,230
University of Minnesota348
73,413
University of Wisconsin13,522
343,566
Total406,859
6,147,355

 Public Domain

Total (~19%)

272,0271,175,546

June Forecast

  • Continue performance optimization for GnuBook
  • Continue configuration of the new development environment and migration of current development activities
  • Begin work on increasing the development environment’s available storage

Updates on April 2010 Activities

May 14, 2010 [Download PDF]Syndicate content

Top News

Internet Archive Ingest – Through the collaborative efforts of California Digital Library (CDL) and the University of Michigan, University of California volumes digitized by the Internet Archive began flowing into HathiTrust in April. This achievement is significant because of the amount of additional public domain volumes UC will make available and preserve in HathiTrust (approximately 200,000), and because of the new channel it opens for partner ingest. HathiTrust has offered deposit of Google-digitized materials at no cost to partners since its inception in 2008. This ingest service is now extended to partner institutions’ Internet Archive-digitized volumes. The collaboration has also significantly advanced HathiTrust’s progress towards repository-wide standards for digital object packages and guidelines for deposit. The first Internet Archive-digitized volume to enter HathiTrust was entitled “The Dawn of All”, appropriately marking this major step in HathiTrust’s ability to preserve the wide variety of non-Google digitized content produced by library partners. More than 90,000 of UC’s Internet Archived digitized volumes are in HathiTrust as of the time of newsletter release.

Partner Local Digitization – A group of staff members at the University of Michigan are developing a formalized process for receiving, analyzing, and preparing locally-digitized content from partner institutions for deposit. Leveraging experience gained with Internet Archive ingest, the group will be working over the next several weeks to publish content guidelines that will aid partners in assessing the readiness of their content for deposit. The guidelines will also provide consistent benchmarks for UM staff to use in performing the transformations and normalizations that may be necessary to assemble coherent and consistent archival packages.

Bibliographic Management System – The University of California is engaged in the process of designing a new bibliographic management system for HathiTrust. HathiTrust team members at CDL and the University of Michigan engaged in multiple teleconferences throughout April to define the scope of services and functions provided in the current management system at UM (Ex Libris’ Aleph product), and by the systems that it supports such as HathiTrust’s temporary catalog and Bibliographic API. Development of the new system will diversify the management of support systems in HathiTrust. It also presents a valuable opportunity to revisit current practices and assumptions and reengineer existing processes to be more efficient. Team members have worked through a number of architectural issues in the design and May discussions will focus on strategies for transition to the new system. Development of the system has not yet begun and there is no timeline currently for implementation.

Website Changes – Staff from UM and UC took part in a usability excercise in April geared towards improving the navigability of the HathiTrust.org “About” website. Improvements are ongoing, and UM staff have implemented the first in a series of changes to occur. Top navigation has changed and sub-navigation is now provided in the left side of the interface. Search functionality has also been added. New content and additional architectural changes will be made in the coming weeks as documentation assembled for HathiTrust’s audit with CRL for compliance with TRAC draws is made available, and additional usability tests are conducted.

Working Groups

Discovery Interface OCLC loaded more than 1 million HathiTrust records into WorldCat in April, bringing the total number of records to close to 2.3 million by the end of the month. These constitute over 60% of the total HathiTrust records that are currently available. Loading of these records will continue throughout May. In mid-May, OCLC will release version 1 of the HathiTrust catalog internally to the Discovery Interface Working Group. Staff at the University of Michigan are preparing to make necessary changes to HathiTrust websites to accommodate the new catalog. OCLC provided a first glimpse of the new catalog to the working group in a recent WebEx session.

In parallel to finalizing the version 1 catalog, the HathiTrust team is turning its attention to the full-text search application. The Discovery Interface Working Group will assume responsibility for further developing HathiTrust’s full-text search service, and is in the process of finalizing a formal charge and roadmap for this project.

Collaborative Development Environment – Michigan staff, in consultation with working group members, have completed an initial design of the development environment. The design includes specific plans and conventions for version control, file layout and naming, virtualization provisions for developers, and multiple test and beta instances. UM staff are now planning the details of migrating current development into the new scheme as well as configuring and building out the server resources for the environment.

Ingest

Penn State – HathiTrust recently received updated bibliographic records from Penn State University for several hundred PSU-contributed volumes. The volumes, which are all in the public domain, had received an incorrect bibliographic rights determination in HathiTrust because of problems with metadata, including missing a flag indicating that volumes were government documents. With the corrected records, all of these volumes are now freely accessible. HathiTrust will be monitoring rights determinations for volumes from institutions depositing public domain-only materials. If metadata corrections are required to make volumes available, HathiTrust will notify the institutions appropriately.

UC Delivery of Bibliographic Records – UC recently modified the way it makes bibliographic records available to HathiTrust for volumes that are being are scanned on an ongoing basis. Records will now be pulled from UC by HathiTrust, rather than pushed by UC, which will simplify the flow of ingest for these materials.

Development Updates

Shibboleth – Based on input from partners, staff at UM have further refined the final list of attributes needed to provide Shibboleth services. These attributes, and other information about Shibboleth in HathiTrust are available at http://www.hathitrust.org/shibboleth. UM staff also registered HathiTrust a service provider with the InCommon Federation. Two early adopter institutions successfully tested access to HathiTrust development systems via Shibboleth, and HathiTrust is planning to release the service publicly in mid-May.

Large-scale Search – University of Michigan staff developed a pair of index tools for reporting specific statistics about a Solr index in April. These tools help to identify frequently occurring terms, which can be used to improve performance. The tools have been committed to the Solr code base and will be part of future Solr releases.

Work by UM team members continues on installing new servers at the Indiana site. New electrical and networking capacity has been installed, and firewalling is being reconfigured to support new remote administration and installation capabilities. Once complete, the new servers will be configured and brought online, anticipated for late May.

Collection Builder – Staff at UM have developed functionality that allows a user to add multiple items at once to a Collection from the full-text search results. Cosmetic changes to the user interface are required to complete this effort.

PageTurner – Developers from UM and CDL have been collaborating to integrate new image serving capabilities at UM with the GnuBook reader. A prototype application combining these services has been developed and next steps will involve merging the prototype with the current PageTurner application. Code developed by CDL to produce thumbnail views of volumes in GnuBook has been incorporated into the mainline code, maintained by the Internet Archive.

Outages – The beta* large-scale search service was unavailable from 10:00am - 12:40pm EDT on Friday, April 9 to troubleshoot a hardware problem on an index server.

*Beta services are typically non-redundant and/or volatile, and while we strive to minimize down time and report any that occurs, we do not attempt to adhere to non-peak outage windows for maintenance.

Partner News

UC-eLinks (SFX) – HathiTrust books free of copyright restrictions, contributed by both UC and other HathiTrust partners, are now available via a link in UC’s UC-eLinks (SFX) menu window. CDL has developed a target for SFX that exposes HathiTrust public domain books utilizing the HathiTrust Bibliographic API. CDL plans to review statistics for the new target and work with HathiTrust staff to measure the load placed on HathiTrust APIs by UC usage. There are plans to share the target with HathiTrust partners, and in the future potentially contribute it back to ExLibris. UC Davis has also created a test implementation of this functionality within Aleph.

New Growth

Number of volumes added:

 AprilTotal
Indiana University1,815
176,835
Penn State University10,661
17,274
University of California39,0791,203,556
University of Michigan79,0273,939,722
University of Minnesota7,189
73,065
University of Wisconsin26,317
330,004
Total164,008
5,740,496

 Public Domain

Total (~17%)

48,729903,519 

Presentations

Princeton Forum on PreservationApril 9
Bilkent University, Ankara TurkeyApril 20
CDL Resource Liasons and Users Council WebinarsApril 20 and 30
Lucid Imagination WebinarApril 29
Association of Research LibrariesApril 30
  • Please see http://www.hathitrust.org/papers for links to all HathiTrust presentations, papers, and reports.

May Forecast

  • Deploy redundant hosting of large-scale search service at Indiana site
  • Release Shibboleth authentication and full-PDF public domain download for HathiTrust partner institutions

Update on March 2010 Activities

April 9, 2010 [Download PDF]Syndicate content

Partner Ingest 

Internet Archive Ingest – Staff at the University of California completed quality review of the pilot set of Internet Archive-digitized volumes in March, and submitted a set of final issues to team members at the University of Michigan. These have have largely been resolved. Michigan staff also determined the cause of the validation error reported in last month’s update. Correcting the error led to further revisions of the preservation metadata schema and re-evaluation of the validation routines put in place for Internet Archive-digitized content. Updates to these routines are currently being implemented. California sent bibliographic records for a set of 97,000 Internet Archive-digitized volumes to be loaded into HathiTrust. As soon as the updates to the ingest process and fixes for issues raised in QA are in place, download of these volumes will begin.

Local Digitization – The University of Michigan has begun to receive locally-digitized content from several partner institutions for ingest into HathiTrust. Two programmers hired by Michigan in February have started to evaluate the material, determining needs and requirements for ingest, both in terms of digital package specifications and content transformation routines. Jessica Feeman, a programmer at Michigan and the original developer of the data validation and ingest system for HathiTrust, left her position at the end of March to start a family (congratulations, Jessica!). A new position will be opening in April.

Working Groups

Discovery Interface  OCLC loaded test batches of HathiTrust bibliographic records into WorldCat in March. After the batches were reviewed by OCLC and the HathiTrust team, OCLC initiated full-scale loading. At the end of March, 1.1 million HathiTrust records had been added to WorldCat through OCLC’s eContent Synchronization mechanism, and the loading process continues.

HathiTrust and OCLC recently completed a first round of usability testing for the version 1 HathiTrust catalog, involving five participants in individual one-hour sessions. Members of the OCLC and HathiTrust teams are currently analyzing the results of the testing, particularly in relation to HathiTrust’s requirements for the version 1 catalog. Special thanks in this effort are due to the HathiTrust colleagues at Penn State University, where the testing took place, as well as to OCLC for providing gift cards as incentives to participants.

Collaborative Development Environment – Michigan staff are in the process of designing the architecture for the new development environment according to the general direction set by the working group. The design incorporates practicalities such as directory naming conventions that will be compatible with the version control strategy. Staff are also discussing initial provisions for virtualization within the environment, including one virtual web and database environment for each developer, one for pre-release integration testing, and numerous instances for public “beta” exhibition and review of new features. The group is working to transition active HathiTrust development at Michigan to the new environment in April.

Development Updates 

Shibboleth – Staff at the University of Michigan staff have been discussing the most appropriate set of attributes to request for release to HathiTrust applications via Shibboleth, consulting with experts at partner institutions, including Michigan’s central Information and Technology Services, which will coordinate Shibboleth federation interactions for HathiTrust. Shibboleth will be a mechanism by which HathiTrust is able to provide specialized services, such as full-PDF download of repository volumes, to partners. The final attributes to be requested are eduPersonAffiliation, eduPersonScopedAffiliation, eduPersonTargetedID, and displayName. Registration of the HathiTrust Service Provider is in progress and we hope to release the service in April.

Large-scale Search – Programmers at the University of Michigan continue to investigate queries taking longer than 30 seconds to execute. The present theory is that certain components of the hardware (network cards) are causing intermittent problems that disrupt communication with the Solr server. The focus is on isolating and replacing the problematic cards.

HathiTrust team members from Michigan and Indiana are coordinating on the installation of new servers in Indianapolis to make the large-scale search service redundantly hosted at the Indiana and Michigan sites. This work has required the installation of new electrical and networking capacity in Indiana, which is almost complete. The setup and configuration of the new servers is expected to be fairly simple, as it is a near-replica of the architecture already in place in Michigan.

PageTurner – University of Michigan developers continued to improve the performance of a new service that will deliver full-volume PDFs of public domain materials to users at HathiTrust partner institutions. The service will be available to partner institutions via Shibboleth authentication. Michigan also began development on a new method of delivering individual page images to the HathiTrust PageTurner, that will scale, rotate, and watermark images on the fly. Development is about 75% complete, and the new method is already being used in the collaborative development environment as part of the University of California’s work to integrate GnuBook into the HathiTrust PageTurner. Michigan and California are working together on enhancements to the existing PageTurner interface to incorporate the GnuBook improvements.

Outages – Large-scale search service was unavailable from 10am-1pm EST on March 25 while software and firmware upgrades were applied to the storage systems in Michigan and Indiana. The upgrades did not result in outages for production systems. The large-scale search application is considered beta pending redundant hosting of the service in Indiana. In the future, we will work to communicate planned outages for services like large-scale search despite their beta status. 

New Growth 

Number of volumes added:

 MarchTotal
Indiana University138
175,020
Penn State University1,469
6,613
University of California1,9401,164,225
University of Michigan72,2273,860,817
University of Minnesota880
65,876
University of Wisconsin11,923
315,650
Total88,798
5,588,311
  • 34,904 public domain volumes were added in March, bringing the total number of public domain volumes to 854,790 (approximately 15% of total content).

April Forecast 

  • Complete and deploy Collection Builder Integration with large-scale search
  • Deploy redundant hosting of large-scale search service at Indiana site
  • Register HathiTrust Shibboleth Service Provider with the InCommon Federation

Update on February 2010 Activities

March 12, 2010 [Download PDF]Syndicate content

Top News           

In the last several months, the HathiTrust partners have made steady progress in expanding the repository’s ability to support the variety of digital outputs produced at their local institutions. While the bulk of content in HathiTrust currently is the result of Google’s digitization efforts, preserving and delivering content from libraries’ non-Google sources is an important part of HathiTrust’s mission to meet the needs of libraries broadly, and assemble a comprehensive collection of published materials that is co-owned by libraries themselves. Three items in this month’s update highlight our efforts in this area: our progress in ingesting materials digitized by the Internet Archive, the hiring of two new programmers to focus on the transformations and normalizations involved in bringing in diverse content, and the creation of a demonstration application that uses the HathiTrust Data API to deliver master repository content from non-Google sources to users. We will be highlighting developments such as these in the coming months.

Internet Archive Ingest – Ingest of UC volumes digitized by the Internet Archive was delayed in late February due to a validation error that UM staff encountered, but ingest of more than 200 pilot volumes was begun in early March. Following quality review of the volumes by UC staff and the resolution of any associated issues, download of UC’s Internet Archive-digitized volumes will formally begin. Staff at UC and UM are in the process of compiling technical and procedural documentation related to Internet Archive ingest to share with partner institutions and the community at large.

New Programmer For Non-Google Ingest – UM has hired two new programmers, for a total of 1.7 FTE, to concentrate on developing ingest routines and common workflows for non-Google-produced materials. These will include materials digitized by the Internet Archive and through local digitization efforts at partner institutions.

Data API – The interface to the Data API demonstration application that was undertaken by Michigan in January is available at http://www.lib.umich.edu/two-over-threehundred/. The goal of the application was to use HathiTrust’s Data API to facilitate the location and download of complete book packages for public domain volumes not digitized by Google. The code used to produce the demonstration is also available. The application is still processing the HathiTrust data files, and so will only display a subset of the full data.

Working Groups

Quality, Ingest, and Error Rate – The working group kicked off activities under its recently revised charge in February , and will be meeting on a monthly basis. At this stage, the group is undertaking information gathering and doing planning for work items, including building a framework for defining quality principles and developing a varied set of scenarios under which content would be gated from entering HathiTrust. This work will help to spur discussion and identify larger issues that are play. Members of the group include Paul Fogel (California Digital Library), Peter Gorman (University of Wisconsin), Bryan Skib (University of Michigan), and Paul Soderdahl (University of Iowa).

Discovery Interface The HathiTrust-OCLC team made significant strides in February towards the version 1 catalog beta implementation, with some adjustments to the projected timeline. Due to changes in OCLC’s product release cycles, the catalog is now expected to be complete in May 2010. The HathiTrust library team is now exploring strategies and requirements for the catalog’s public release, with the guidance of both the HathiTrust Strategic Advisory Board and Executive Committee.

The load of HathiTrust bibliographic metadata to WorldCat remains on schedule. OCLC is currently testing the first batch of records, and large-scale loading will take place throughout the month of March. Preliminary user testing is currently underway at Penn State and will be complete in mid-March, thanks to the collaborative efforts of OCLC and HathiTrust’s usability group.

Collaborative Development Environment – The working group reconvened via conference call in February to discuss strategies for version control. All agreed that the version control tools used should facilitate development at local sites as well as within the environment itself, and allow public availability of the source code. Modern distributed version control systems, including some third-party systems such as GitHub, fit well with these needs, and UM staff will propose an architecture to the group at their next meeting in early March for approval. The group also discussed building logical divisions in the environment to segregate its use for various purposes, such as active code development, integration testing and staging for production release, the presentation of relatively stable “beta” versions of software systems, and replicating and troubleshooting issues live in production.

Ingest

University of Minnesota – Ingest of content from the University of Minnesota began in February, with nearly 65,000 volumes being deposited. All of these volumes are government documents, and are part of a larger effort of the Committee on Institutional Cooperation (CIC – the Big Ten plus the University of Chicago) in partnership with Google to digitize more than 1 million U.S. Federal Documents from their combined collections. The Minnesota documents themselves can be found by clicking on the University of Minnesota facet in the HathiTrust Catalog.

Development Updates

Shibboleth – UM is in the process of finalizing Shibboleth attribute release requirements for HathiTrust applications in coordination with partner institutions, and is registering HathiTrust as a service with the InCommon Shibboleth federation. The release of this enhancement to HathiTrust applications is still planned for a March timeframe.

Large-scale Search – The large-scale search index grew to the point in February that it exceeded the Solr/Lucene limit of 2.1 billion unique terms. Core Lucene developer Michael McCandless graciously provided a patch raising thelimit to 274 billion unique terms. Michigan continued performance tests aimed at identifying optimal shard sizes. Staff at Michigan also led team members at CDL on a walk-through of the large-scale search implementation in mid-February.

Four new redundant servers for index service arrived at Indiana and will be installed once additional power and networking infrastructure work has been completed, probably in late March. Two new servers for index building arrived in Michigan and are tentatively scheduled for March installation as well, pending staff availability.]

PageTurner – Michigan revamped the PageTurner code that generates PDFs from the repository in February, optimizing it for high performance delivery of full-book PDF files containing full-resolution page images. The ability to download full PDF files of HathiTrust public domain volumes will be available to partner institutions when Shibboleth is implemented. Michigan also explored pipelines for fast on-the-fly generation of scaled, rotated, and watermarked page images and developed a prototype image server. Once completed, it will serve all individual page images not encapsulated in PDF.

Outages – There were no outages in February.

New Growth

Number of volumes added:

  February Total
Indiana University 23,066
174,882
Penn State University128
5,144
University of California 5,976 1,162,315
University of Michigan 50,873 3,781,841
University of Minnesota64,96664,966
University of Wisconsin 35,683 303,727
Total 108,977
5,434,537
  • 54,555 public domain volumes were added in February, bringing the total number of public domain volumes to 818,886 (approximately 15% of total content).

March Forecast

  • Deploy the new page image server and related changes to the HathiTrust PageTurner
  • Release Shibboleth authentication support
  • Continue large-scale search performance monitoring
  • Complete quality assurance processes for pilot ingest of Internet Archive-digitized materials
  • Begin ingest of all UC's Internet Archive-digitized volumes

Update on January 2010 Activities

February 12, 2010 [Download PDF]Syndicate content

Top News           

New Cost Model The HathiTrust executive committee approved a new cost model for partnership in December that will be adopted by all partners beginning in 2013. In the new model, partners will share in the cost of public domain and open access volumes preserved in HathiTrust, and in the cost of in copyright volumes that they hold, or have held, in their physical collections. The model will distribute the costs of curating and managing the digital collections in a way that more accurately reflects the benefits each partner receives from deposited volumes. It will also allow institutions to join HathiTrust who do not necessarily have content to deposit, but who wish to support and benefit from the long-term curation and access services that HathiTrust provides. Such institutions are eligible for partnership effective immediately, and do not need to wait for the 2013 general adoption. Details of the new cost model are available at http://www.hathitrust.org/documents/hathitrust-cost-rationale-2013.pdf. Please contact hathitrust-info@umich.edu for additional information and inquiries about partnership.

Disaster Recovery Planning – Following an evaluation of disaster preparedness performed last summer by an IMLS-funded intern, and the hiring of a preservation librarian in November, the University of Michigan is taking steps to formalize and expand HathiTrust’s policies and practices relating to disaster recovery. The UM preservation librarian is leading a process to form a Disaster Recovery Planning Committee and, with support of a winter intern from the UM School of Information, has begun to gather key inventory, personnel, and workflow documentation. Guided by industry standards such as TRAC and best practices in the digital preservation community, the committee will ensure a high level of preparedness for known and unknown risks to the long-term integrity and use of materials in the repository. A preliminary meeting of key staff will occur in February, and membership in the Disaster Recovery Planning Committee will be finalized soon thereafter.

Digital Library Profile – As part of its participation in an NSF EAGER grant awarded in September 2009, HathiTrust completed a technological profile of its repository based on two frameworks developed by Johns Hopkins University. The profile can be found at http://www.hathitrust.org/technology.

Working Groups

Quality – In July 2009, the Strategic Advisory Board (SAB) assembled a working group to investigate issues surrounding the quality of partner institution volumes downloaded from Google. The working group was asked to research and provide recommendations on a quality threshold HathiTrust uses to limit ingest of poor quality volumes. The working group presented its recommendations to the SAB in January and the SAB decided to continue the working group with a revised and expanded charge. The new charge is to a) develop a set of quality principles for HathiTrust, b) monitor quality control as related to user experience, c) track developments in a separate quality working group established by Google and Google library partners following the Google partner summit in October, and d) evaluate HathiTrust practices with regard to thresholding or limiting ingested content. Membership in the new group, called the HathiTrust Quality Ingest and Error Rate Working Group, is currently being determined.

Discovery Interface With the version 1 catalog beta release only a few months away, the Discovery Interface Working Group is turning its focus to the usability of the catalog and its integration with existing HathiTrust Digital Library services (Collection Builder, Page Turner, and Full-Text Search). The Working Group formed a usability subgroup, which will collaborate with staff at OCLC to begin usability testing of the catalog before it is released. Testing will also be performed in post-release phases. Aspects of the pre-release analysis will include verifying accurate functionality and fulfillment of agreed-upon requirements. 

In preparation for loading HathiTrust volumes into Worldcat for the version 1 release, staff at UM provided an API that will allow OCLC to display HathiTrust volume information in Worldcat records.

Collaborative Development Environment – UM staff have been gathering specific topics for the working group to discuss when it reconvenes (now planned for late February), and have developed a draft timeline for the steps ahead. A message to reassemble the group was sent in early February, and scheduling is underway. The area the group will address first is the design of a version control system. UM staff have also begun to research the GlusterFS cluster file system as a storage back-end for the environment.

Storage – The working group tasked with making recommendations on a third instance of storage for HathiTrust presented its final report to the Executive Committee in January. The group concluded that although there were significant benefits to implementing a third instance of storage, given the high level of preservation confidence in HathiTrust and the absence of economic conditions favorable for acquiring and operating new storage, there was no urgency in establishing a new instance. The group noted, however, that HathiTrust should be prepared to establish a third instance of storage if such a course becomes more economically feasible.

The Executive Committee would like to solicit broader feedback from partner institutions regarding these recommendations (especially from a collection development perspective), and requests that thoughts on the report and a third instance of storage be sent by email to hathitrust-info@umich.edu. Those who wish to remain anonymous should indicate this in their email. The full report of the working group is available at http://www.hathitrust.org/projects#wg_storage.

Ingest

General – Ingest rates were low in January, due in part to challenges UC experienced in retrieving bibliographic records from one of its systems. UM loaded the first set of bibliographic records for Minnesota, but could not begin ingest because of problems with Google’s delivery of the content files. Ingest numbers from other institutions were also low because HathiTrust caught up with the rate that partner volumes were made available from Google.

Internet Archive Ingest –UM began testing validation routines on a batch of 200 volumes of Internet Archive-digitized volumes from the University of California in January. The teams are revising validation strategies based on the findings of these tests and the results of quality assurance performed by UC staff on transformed, but not yet ingested objects. UM and UC will proceed with the ingest pilot in February, testing all aspects of bibliographic and content loading, validation, and access. Completion of the pilot is projected for late February.

New Programmer For Non-Google Ingest – UM extended the bidding period for the new programmer position through mid-January, and several new qualified candidates have been interviewed. UM staff are in the final stages of selecting candidates, and expect to have a new full-time staff member and a new part-time staff member on board by the end of February.

Development Updates

Shibboleth – Shibboleth implementation in HathiTrust is nearly complete. Major portions of the code are in place and UM staff have begun to contact partner institutions to exchange information that will allow individuals from partner institutions to authenticate into HathiTrust. The initial benefit to partners will be increased facility in creating personal collections in Collection Builder, though non-partners will still have the ability to create collections using the University of Michigan “friend account” system. Within the next couple of months, however, full-PDF download of all public domain volumes will also be available to partners. In the long term, HathiTrust hopes to use Shibboleth to extend services such as enhanced access for users with print disabilities and U.S. Copyright Section 108 uses to all member institutions. Deployment of Shibboleth is planned in March.

Data API – In January, staff at the University of Michigan began work on a web application that will use the Data API to facilitate the location and download of complete book packages for public domain volumes not digitized by Google. The application is being created entirely with data and services available to the general public and is meant to demonstrate uses that can be made of the API. The first step of crawling the repository for eligible volumes is in progress, and release of a beta version of the application is expected in February.

Large-scale Search – UM improved logging and log analysis in January, enabling staff to monitor search performance in a way that more closely resembles the user’s experience. UM staff documented changes to large-scale search hardware in a new blog post entitled “Scaling up Large Scale Search from 500,000 volumes to 5 Million volumes and beyond”.

New index servers were ordered for the Indiana site and are scheduled to be in service before the end of March. The current index release process already synchronizes an updated version of the index to be stored in Indiana on a daily basis. Acquisition of the new hardware will provide full redundancy of the large-scale search application servers as well. Two additional servers that will be used exclusively for index building are on their way to the Michigan site, and one server originally purchased for production service is being re-purposed for testing and development.

PageTurner – PageTurner development was slowed in January but will pick up in February and March as staff time devoted to the ingest of materials from the Internet Archive decreases.

Outages – There were no outages in January.

New Growth

Number of volumes added:

  January Total
Indiana University 38,344
151,816
Penn State University05016
University of California 972 1,156,339
University of Michigan 71,904 3,730,968
University of Wisconsin 691 268,044
Total 104,342
5,312,183
  • 5,384 public domain volumes were added in January, bringing the total number of public domain volumes to 764,331 (approximately 14% of total content).

February Forecast

  • Complete and deploy Shibboleth authentication support
  • Complete quality assurance processes for pilot of UC’s Internet Archive-digitized materials and begin ingest into the repository
  • Continue large-scale search performance monitoring
  • Make progress toward the integration of Collection Builder functionality in full-text search results

Update on December 2009 Activities

January 15, 2010 [Download PDF]Syndicate content

Top News           

Columbia Partnership – HathiTrust is very pleased to welcome Columbia University as its newest partner. A representative of HathiTrust will be travelling to Columbia in late January to give a full introduction to repository operations, current activities, and future plans. We look forward to the experience and expertise that Columbia will bring to the enterprise, and the new possibilities that are opening for HathiTrust as it continues to expand its membership and its collections. A full press release on the new partnership can be read at http://www.columbia.edu/cu/lweb/news/libraries/2009/20091216.hathi.html.

5 Million Volumes – A significant milestone was passed in December as HathiTrust exceeded 5 million volumes in digital holdings. More than 3/4 of a million of these are in the public domain. A steady rate of growth is expected  to continue in 2010, and partner collections are projected to grow to more than 8 million volumes.

TRAC Audit In early December, HathiTrust began a process with the Center for Research Libraries (CRL) to assess the digital repository in relation to the Trustworthy Repositories Audit and Certification (TRAC) criteria. The assessment is scheduled to proceed until mid-February, and the findings will be publicly available. More information about the audit can be found on the CRL website at http://www.crl.edu/archiving-preservation/digital-archives/certification-and-assessment-portico-and-hathitrust.

Bib API HathiTrust has released a new bibliographic API that enables retrieval of descriptive and rights information for objects in the repository based on standard identification numbers (e.g., ISBN, ISSN, LCCN, OCLC). The API is a replacement for the (now deprecated) Rights API and the specification is available at http://www.hathitrust.org/bib_api.

Working Groups

Discovery Interface – OCLC is completing preparations for the import of HathiTrust data into WorldCat Local (WCL). The installation of a HathiTrust WCL instance is scheduled to be complete in late February, and loading of records into this first version of the joint catalog will begin in March 2010.  Looking towards version 2 of the catalog, the HathiTrust-partner working group began reviewing its scope and membership needs as its purview expands beyond bibliographic metadata in the catalog to include the integration of features such as full-text search and the HathiTrust Collection Builder. The group was renamed the HathiTrust Discovery Interface Working Group (from HathiTrust/OCLC Catalog) to reflect this broadening scope. The HathiTrust Executive Committee approved a proposal to have the working group report to the Strategic Advisory Board (SAB) in December, ensuring stronger alignment of the development and delivery of discovery services with future directions in HathiTrust as a whole.

Collaborative Development Environment Staff at the University of Michigan completed setup of one of the servers that will be used in the initial proof-of-concept partner development environment. The server is configured with all of the tools and software needed to support the PageTurner development that the University of California and Michigan engaged in collaboratively in 2009. A developer at UC has begun to test features of the environment and will be reporting and providing feedback to the working group when the full group is re-engaged in January.

Research Center The RFP produced by the working group was approved by the Executive Committee in December and is available on the HathiTrust website at http://www.hathitrust.org/documents/hathitrust-research-center-rfp.pdf.  

Ingest

Internet Archive Ingest – During the month of December, staff from UC and UM finalized many of the procedures and conventions related to the ingest of Internet Archive-digitized books into HathiTrust. These included file identification, preservation and technical metadata elements, content transformation and validation processes, error logging, and exception handling. UC delivered bibliographic metadata for an initial set of IA-digitized volumes to UM, and UM worked steadily on coding the transformation and validation processes for ingest. An end-to-end pilot test, including download, ingest, and quality review of ingested items will be performed in late-January. 

New Programmer For Non-Google Ingest – Applications are still being taken for a programmer to receive and prepare non-Google materials for ingest into HathiTrust. Review of applications and interviews are being conducted simultaneously. The bidding process will close in mid-January, but will be extended again if an applicant is not selected. Full-time and part-time positions are being considered, and it is increasingly likely that one of each may be filled.

Development Updates

Shibboleth In the near future HathiTrust will be implementing Shibboleth as a mechanism for inter-institutional authentication into HathiTrust. Distributed authentication will make it easier for users to take advantage of personalized services in HathiTrust, such as the Collection Builder. It will also enable the delivery of enhanced services to HathiTrust partner institutions. Staff at UM discussed the implementation strategy for Shibboleth in December and installed the Shibboleth service provider software on development servers to begin the work of integration. A forecast for the timeline of implementation will be included in the next update.

Large-scale Search – Staff at UM continue to refine the daily index update and release workflow, making it more resilient to problems that are sometimes encountered during indexing. New server equipment will soon be purchased for use at the Indiana site, and a schedule projected for continuous new hardware acquisition to maintain performance levels as the size of the index grows. As part of index and query response time testing, UM staff also updated and released a revised cache-warming procedure based on production log analysis. Warming (pre-populating) the cache of completed queries improves search performance.

Outages – There were no outages in December.

Partner News

(What is your institution doing with HathiTrust? Let us know!)

UC and SFX – A University of California group has started work on a project to demonstrate proof-of-concept success in exposing HathiTrust public domain books through UC’s UC-eLinks service (SFX). The project is investigating the various HathiTrust APIs capable of supporting this service, and in addition to gathering usage statistics for the new target, will report on the functionality, usefulness, and viability of each of the APIs for future endeavors. The target will eventually be made available to ExLibris so that it can be added to the SFX package for all customers, but will be available to HathiTrust partners who use SFX before then.  

New Growth

Number of volumes added:

  December Total
Indiana University 16,923
133,482
Penn State University2335016
University of California 263,089 1,155,367
University of Michigan 230,881 3,659,874
University of Wisconsin 12,137 267,353
Total 516,514
5,221,092
  • 41,006 public domain volumes were added in December, bringing the total number of public domain volumes to 758,947 (approximately 15% of total content).

January Forecast

  • Staff visit to Columbia
  • Begin Internet Archive ingest pilot
  • Discuss the development of a validation mechanism for repository content using the Data API
  • Begin to explore ingest and delivery of born-digital objects
  • Finalize a draft report and recommendation on a third instance of HathiTrust storage

Update on November 2009 Activities

December 11, 2009 [Download PDF]Syndicate content

Top News

Release of Large-scale Search Application – On November 19, HathiTrust launched a new service enabling full-text search of all volumes in the repository. Indexing of newly ingested volumes is ongoing, but the release of the first production index (containing approximately 4.6 million volumes) is the culmination of more than a year of research and benchmark testing conducted by staff at the University of Michigan. This new service dramatically changes the way researchers are able to use our collections and, along with the release of the bibliographic catalog in May, demonstrates HathiTrust’s commitment to providing sophisticated ways of accessing and using collections preserved in the digital repository. The official news release is available at http://www.ns.umich.edu/htdocs/releases/story.php?id=7426. More can be read about large-scale search in the Development Updates section below.

Development Opportunities – This month we provide the second in a series of ‘columns’ about development opportunities in HathiTrust. These are opportunities that have been identified by HathiTrust partners, and are available to HathiTrust partners, to create key systems or services that will benefit the partnership as a whole. Each month we will provide a brief description of one of these opportunities, give a sense of the level of priority that it has, and provide additional information about what might be involved in developing and supporting it. The opportunities are also listed on the HathiTrust website at http://www.hathitrust.org/projects. The opportunity described this month is usage reporting.

Usage reporting

Description: A clearer sense of the level of use of library materials in HathiTrust will help shape extended activities such as collection management and further digitization. Volumes in HathiTrust may, in some cases, be read in their entirety, while in other cases they may only be searched. To what extent are search-only materials viewed?  Which works that are fully viewable are displayed? Where does that access originate? As HathiTrust introduces authentication, to what extent do users authenticate to get access to a fuller array of services? How frequently is the HathiTrust catalog searched, and how does that use compare to the use of full text indexes? These are some of the questions that an improved service for usage reporting will  help to answer.

Resources available: HathiTrust retains raw log data and registers some uses through Google analytics.

Priority: moderate

Additional details: An institution that undertakes this work must:

  • clearly outline a commitment to undertake appropriate measures with regard to user privacy (e.g., with regard to IP addresses and, at such time that HathiTrust implements Shibboleth, user authentication information). Such efforts should include secure storage of sensitive data, appropriate aggregation of data so as to anonymize use by specific individuals, and a commitment to not transfer private user data to a third  party;
  • outline a process for design and specifications with a group of interested partner libraries;
  • give consideration to producing reports consistent with appropriate library community standards (e.g., COUNTER and SUSHI).

Working Group on Computational Research Center – Working group members provided final feedback on the Call for Proposals for a HathiTrust Research Center that will be distributed to HathiTrust institutions in December. The Research Center will make textual and image data in HathiTrust available for a wide variety of computational research and analysis purposes, including research in areas of digital humanities, linguistics, automated translation, and searching and indexing techniques.

Working Group on  Collaborative Development Environment –  Additional effort devoted to the release of Large-scale Search in November delayed further progress on the development environment, but it is now a prime area of focus. The first milestone, a preliminary proof-of-concept environment that supports current development efforts will be ready for developers at the University of Michigan and the University of California to begin testing in the first half of December. Once this milestone is reached, the working group will be re-engaged to discuss the current provisions of the development environment and explore next steps.

New Programmer For Non-Google Ingest –  Staff at the University of Michigan received and reviewed applications for a position to aid in the transformation and modification of non-Google content for ingest into HathiTrust. Five candidates were interviewed by phone in November and three were invited for in-person interviews. After a period of review, the search committee decided to continue the search and repost the position. An additional avenue, involving hiring one or more part time student employees operating with close supervision, is also being considered.

Internet Archive Ingest – Much progress was made toward the ingest of content digitized by the Internet Archive in November. The University of California shared specifications for a preferred set of files to be downloaded into HathiTrust with the broader community of Internet Archive digitization partners, and received constructive feedback from the group. In continued weekly calls, staff at UC and UM discussed procedures and conventions for content transformation, file-identification, and preservation and technical metadata, as well as error logging, exception handling, and policy issues surrounding the deposit of digital objects. The ingest team is working to have practices surrounding many of these issues finalized by mid-December, when UC will deliver bibliographic metadata for an initial set of IA-digitized volumes to UM. Once the transformation and validation processes for ingest have been finalized and coded, UM will conduct a pilot test, downloading and ingesting this initial set of volumes. It is hoped that the full pilot, including quality review of ingested volumes,   will be completed by mid-January.

Changes to Tab-delimited HathiTrust Metadata Files –  As of December 1, rights determination reason codes are included in the metadata files available for download at http://www.hathitrust.org/hathifiles. Please see the file specification at http://www.hathitrust.org/hathifiles_metadata for updated information.

Development Updates

Large-scale Search – The launch of HathiTrust’s large-scale search application was postponed in October in order to acquire additional hardware to accommodate new index growth. Due to a variety of factors including a delay in hardware delivery, staff at the University of Michigan altered their index storage strategy and reconfigured the Solr index servers at Michigan to use the Isilon storage system as a back-end. In addition to solving issues related to the size of the index, moving from existing direct-attached storage to the Isilon network-attached storage more readily accommodates the significant index growth that occurs during routine index optimization. The move to Islion is a temporary  strategy, however, and staff at UM will be investigating alternative options for storing the large-scale search index over the long-term. 

After the storage reorganization, a small backlog of indexing was completed and a new automatic daily indexing process was developed. The University of Michigan launched the full-text service in mid-November and it is performing well.

With an eye toward achieving full redundancy of the search service, staff at UM implemented a nightly synchronization of the index to the Indiana site. Work toward redundancy is ongoing, however, and will involve further research to determine the optimal size of index shards. The size of index shards will help to determine the optimal number of index servers to deploy to guarantee adequate search performance, as well as the additional server deployments and workflows needed to support continuing testing of the search system, routine indexing, and volume re-indexing. Once complete, additional equipment will be purchased and installed at both the Michigan and Indiana sites as appropriate to establish full redundancy.

In additional ongoing work, staff at UM performed analysis of post-release query logs to improve performance testing and cache warming.

HathiTrust/OCLC Catalog – On November 20th in Chicago, the HathiTrust Discovery Interface team met with the corresponding OCLC-WorldCat Local implementation project team for a productive visioning session of the HathiTrust catalog beyond version 1 due in April 2010. Each group shared its long-term vision for the project, and together began to identify areas of common interest and commitment for the year of work following the release of version 1. The HathiTrust team’s draft vision document is available for review and comment at http://www.hathitrust.org/documents/hathitrust-discovery-vision.pdf.

Ingest – The University of California sent shipments of bibliographic data from its Santa Cruz and San Diego campuses to the University of Michigan for ingest in November, totaling approximately 400,000 volumes. Ingest of these volumes, in addition to 200,000 more that are expected from UC’s North Regional Library Facility, will bring HathiTrust to more than 5 million volumes by the end of the year. UM received an initial shipment of bibliographic metadata from the University of Minnesota in November as well. As these and subsequent records from Minnesota are loaded in HathiTrust, ingest of the digital volumes will begin.

A lower number of new volumes were ingested into HathiTrust in November than expected because of a large number of volumes that were re-processed and made available by Google. Google continually re-processes images and OCR of volumes to make improvements and corrections, and these volumes enter a single queue with newly processed volumes for ingest.

Collection Builder –  Following the meeting with OCLC staff in Chicago, the focus of Collection Builder integration in the temporary catalog has shifted to integration in the full-text search application. This move sidesteps cross-site linking issues that were encountered, and will provide useful experience on which to build Collection Builder inclusion in the HathiTrust Catalog at a later time.

Outages – There were no outages in November. 

New Growth

Number of volumes added:

  November Total
Indiana University 32,427
116,559
Penn State University1084,783
University of California 105,864 892,278
University of Michigan 11,729 3,428,993
University of Wisconsin 12,511 255,216
Total 115,890 4,697,829
  • 15,980 public domain volumes were added in November, bringing the total number of public domain volumes to 717,941 (approximately 15% of total content).

December Forecast

  • Refine indexing methods, including frequency of complete index optimization and best index shard size
  • Develop processes for rebuilding the entire index
  • Finalize specifications for content digitized by the Internet Archive and prepare for ingest pilot
  • Add Collection Builder functionality to the HathiTrust full-text search interface

Update on October 2009 Activities

November 13, 2009 [Download PDF]Syndicate content

Top News

Development Opportunities – This is the first in a regular ‘column’ of development opportunities for HathiTrust. System and software development in HathiTrust is performed by contributions by HathiTrust partners. Although many HathiTrust systems and services must sit on central servers, our initiative relies on open systems and modularity, making it possible for partner institutions to develop key pieces of functionality. In this new column, each month we will provide a brief description of a system or service that has been proposed by HathiTrust partners, attempt to give a sense of the level of priority for that system or service, and provide additional information about what might be involved in developing and supporting it. These services will also be listed on the HathiTrust website at http://www.hathitrust.org/projects. This month we focus on an opportunity that has arisen directly from the expansion of HathiTrust day-to-day operations and the needs of new partners:

Ingest reporting

Description: The deposit of digital volumes and associated metadata into HathiTrust, referred to as “ingest,” involves a significant number of updates to administrative systems — bibliographic records added, digital volumes ingested, and access rights established. Many data elements will be of interest to the contributing institution, and each institution may drive local processes based on the current status of content in the repository (e.g., the percentage of in-copyright works may highlight the value of performing copyright determination work, or a low number of items available in the Google Return Interface may stimulate exploratory discussions with Google). A system that combines all of the available streams of administrative data into a simple web-based reporting system may have considerable value not only for transparency but also for local decision-making.

Resources available: Staff at the University of Michigan and the University of California have assembled a table of relevant data feeds with a brief description of each in the following document: http://bit.ly/2Jk5mm.

Priority: moderate

Additional details: An institution that undertakes this work must:
outline a process for design and specifications with a group of interested HathiTrust partner libraries.
in consultation with partner libraries, give consideration to authentication and authorization needs for this system.

Upcoming Opportunities

  • Usage reporting
  • Print holdings database
  • Ingest transformation

HathiTrust participates in grant from Mellon Foundation – With support from the Andrew W. Mellon Foundation, Associate Professor Paul Conway of the University of Michigan is leading a one-year research and planning project to find and test new procedures for validating the quality and usefulness of digital objects in HathiTrust. The short-term goal of the project is to prepare and submit a funding proposal to a federal granting agency to explore possibilities for validating these characteristics through manual and automated methods. The long-term goal is to develop criteria and methods to brand the trustworthiness of volumes in HathiTrust and other digital repositories for fulfilling specific purposes (e.g., reading, printing volumes on demand, and performing computational research). Such a branding or certification process would give assurance that content within a repository is worthy of preservation, and increase the value of that content in broader discussions about storage and management solutions for both digital and print collections.  

Google Summit – At a periodic meeting between Google and partner libraries, HathiTrust members worked with Google on issues related to the ingest of materials digitized by Google. Some topics discussed included strategies for improved metrics with regard to the quality of materials, and volumes rejected as duplicates from Google’s scanning workflow. The metrics discussed around quality could potentially be used to characterize or filter content that enters the repository (e.g., in the case of poor quality, to prevent ingest). The duplicate analysis conducted by HathiTrust partners is now being factored into Google’s continuing development of duplicate detection and return. Evaluators at the University of Michigan will continue to examine volumes returned as duplicates throughout the semester.

Working Group on Computational Research Center – The working group submitted its final report to the HathiTrust Executive Committee in October, containing specifications for a HathiTrust Research Center and a request for proposals from interested HathiTrust institutions to build and host the Research Center. The Executive Committee has reviewed the document and pending final edits from the working group, will distribute the RFP to the partner institutions in November.    

Working Group on  Collaborative Development Environment –Michigan staff observed a problem with a hard drive in one of the nodes in the development environment cluster and spent time in October troubleshooting the problem and investigating other potential options for hard drive configuration on the nodes. As a result of this investigation, the system BIOS on all nodes will be upgraded and one of the nodes will need to be rebuilt. Work continues on setting up a preliminary development environment on the first node.

New Programmer For Non-Google Ingest – Applications for a programmer position at the University of Michigan to aid in the transformation and normalization of content to be ingested from a variety of digitization sources have been received and reviewed.
UM has started the interview process and hopes to have the new programmer in place as soon as possible. The partners made the decision to centralize this ingest functionality initially in order to expedite the inclusion of non-Google content in the repository. Over time it is expected that individual partners will take a greater role in validating and preparing their content for ingest, leveraging tools and processes that result from this initial investment.

Internet Archive Ingest –  Weekly conversations centered on the ingest of content digitized by the Internet Archive continued in October between staff at the University of Michigan and University of California. Particular focus was placed on determining the standard identifier scheme that should be use for the content when it is ingested into HathiTrust. The University of California’s ARK identifiers, which exist for nearly all of its Internet Archive volumes, appear to be the most promising. Staff at UM have begun to test these identifiers in repository processes to detect any issues that may arise.

The University of California revised its set of preferred files to be downloaded from the Internet Archive for inclusion in the HathiTrust ingest package. The spec will be distributed to other IA partners in the near future for comments. UC also engaged in analysis of bibliographic data of IA-digitized files from its different campuses and continued development of an approach to authoritatively identify an institution’s volumes in the Internet Archive.

Upcoming Changes to Tab-delimited HathiTrust Metadata Files – As reported in last month’s update, beginning with the full metadata file produced on December 1, 2009, additional fields will be added to the tab-delimited HathiTrust metadata files that are provided at http://www.hathitrust.org/hathifiles (a description of the files is available at http://www.hathitrust.org/hathifiles_metadata).
Fields to be added include the copyright determination reason code and the date the database entry was last updated. With this data included, the tab-delimited files will become an ongoing accessible source for information on how and when rights determinations are made. The new tab-delimited fields will be added to the end of the current record structure in order to minimize any potential disruption for existing users of these files.

Development Updates

Large-scale Search – Staff at the University of Michigan successfully indexed all volumes in HathiTrust using the newly acquired hardware. However, the official launch of the large-scale search application was postponed in order to acquire additional hardware to accommodate new index growth. The original estimate of storage requirements turned out to be low once common-grams technology was introduced. Common-grams offer significantly better search performance but result in an increased index size. The very large number of volumes ingested into the repository in October contributed to the immediate need for more indexing space as well. Optimization of the index, a process occurring at regular intervals, requires as much as 3 times the size of the index shard being optimized.

Faceting of search results, a feature supported by Solr, was further explored in October. Faceting requires the addition of bibliographic data to the full-text index. A faceted index was built across two shards to look for potential problems in scaling. Early indications are that performance is only affected slightly with the facets employed.

HathiTrust/OCLC Catalog – After finalizing metadata requirements for the version 1 catalog in September, the HathiTrust/OCLC Catalog team turned its attention in October to interface requirements. The team is currently finalizing interface requirements for version 1 of the catalog and has agreed to engage in collaborative usability testing during the first quarter of 2010. Meanwhile, OCLC’s e-content synchronization work for HathiTrust remains on schedule, and is expected to be completed by the end of the calendar year.

Ingest – HathiTrust ingested a record 553,963 volumes in October. These included nearly 5,000 volumes from Penn State and initial loads of volumes from the University of California’s Santa Cruz and San Diego campuses. Ingest of volumes from Penn State will continue in November. Subsequent shipments of metadata for up to 600,000 additional volumes from UC campuses are expected in November. Ingest of these volumes will begin shortly thereafter.

Prototype for New HathiTrust PageTurner –  Enhancements to the HathiTrust PageTurner application and integration with the open source GnuBook were on hold in October as development efforts at Michigan focused on large-scale search and initial configuration of the collaborative development environment. The collaborative environment will enable staff at the University of California to fully test and troubleshoot GnuBook functionality in production conditions. Development of an “image API” is still needed to deliver page images from the repository for display in GnuBook.

Collection Builder – Michigan further explored integration of Collection Builder functionality into the temporary catalog search interface. Some difficulty was encountered due to cross-site linking restrictions, but options will continue to be explored.

Outages – There were no outages in October. 

New Growth

Number of volumes added:

  October Total
Indiana University 64,614
84,132
Penn State University4,6754,675
University of California 264,710
786,414
University of Michigan 206,283 3,417,264
University of Wisconsin 20,430 242,705
Total 553,963 4,535,190
  • 60,791 public domain volumes were added in October, bringing the total number of public domain volumes to 701,961 (approximately 15% of total content).

November Forecast

  • Fully deploy comprehensive full-text search
  • Continue to explore facets in full-text search
  • Continue to research solutions for adding Collection Builder functionality to the HathiTrust catalog search interface
  • Begin to develop HathiTrust METS specifications for content digitized by the Internet Archive
  • Begin preparations to conduct usability testing on the HathiTrust/OCLC catalog interface

Update on September 2009 Activities

October 9, 2009 [Download PDF]Syndicate content

Top News

HathiTrust participates in grant from NSF – Sayeed Choudhury of Johns Hopkins University, John Wilkin of the University of Michigan, and Amy Friedlander of the Council on Library and Information Resources (CLIR) are co-PIs in an NSF EAGER grant to determine the needs and requirements for developing an open-access repository for publications arising from NSF-funded research. The PIs will leverage Johns Hopkins’ experience in evaluating digital repositories, HathiTrust’s experience with large-scale infrastructure and ingest of digital objects, and CLIR’s experience and facility in bringing together groups of experts to determine next steps and directions on targeted issues. CLIR will host a series of workshops focusing on technical requirements, business and policy concerns, and organization and operations issues relating to the open-access repository. Johns Hopkins and HathiTrust will evaluate various technical systems based on the recommendations from the workshops. The creation of a sustainable, efficient, and scalable model to deliver the products of NSF-funded research to users at no cost will have a transformative impact on the dissemination and use of this valuable work.

University of Michigan Press Backfile and "Buy a Reprint" Links in HathiTrust – HathiTrust has begun ingest of the majority of the published backfile of the University of Michigan Press. More than 350 volumes are now available in the temporary catalog and the HathiTrust PageTurner, with an option to purchase print copies of many of the volumes in the PageTurner. The collection is the first of what is hoped will become many collections or bibliographies in HathiTrust that are maintained by official sources such as organizations, faculty, and librarians. The partners are still working on a name for these types of collections. More information about the Press partnership, including links to the official press release and the collection itself, are available at http://press.umich.edu/digital/hathi. Full-text search is available inside of the UM Press collection, and all other HathiTrust collections (see the Collection Builder home at http://babel.hathitrust.org/cgi/mb?a=listcs;colltype=pub).

Returned Duplicates — The University of California, the University of Wisconsin, Indiana University, and the University of Michigan have undertaken a review of volumes returned by Google as duplicates to better understand how duplicate determination takes place. During the month of September, staff members evaluated materials that were rejected by Google in August, identifying matches and potential mismatches. Results are currently being compiled and analyzed, and will be presented at the Google Partner Summit.

Working Group On Computational Research Center – The Research Center advisory group has completed their initial round of discussions on the demand, structure, content inclusion, legal considerations and funding of the Research Center. A report on that work will be submitted to the Executive Committee in the coming weeks. The group identified the need for additional strategies to gather specific information about the composition and ongoing use and support of the Research Center. A plan to assemble and incorporate this information should be in place in October as well.

Working Group on Storage – A series of teleconferences have led to the construction and refinement of a table defining the important decision criteria for adding a third instance of HathiTrust storage. By mid-October the group will develop a version of these criteria with institutional-specific weighting factors. It will then work to reconcile the weightings and develop a final recommendation.

Working Group on Collaborative Development Environment – Michigan staff have completed operating system installs on the initial development environment equipment. Staff will next configure one of the development servers with the base set of software required to support known demands on the environment, including shared development with staff at the University of California on the HathiTrust PageTurner. The initial configuration will be documented and discussed with the working group for further revisions and enhancements.

New Programmer for Non-Google Ingest – In the near future the HathiTrust partners will hire a developer dedicated to receiving non-Google materials from their respective institutions and preparing them for ingest into the repository. The new hire will speed the addition of these materials to HathiTrust and develop specifications and processes that will be applicable to content from new partners in the future.

Internet Archive Ingest — Staff members from the University of California, the University of Michigan, and the University of Illinois held a teleconference in late September to discuss the file formats for Internet Archive-digitized content that will be included in the HathiTrust book package. The partners are working to build consensus on a package that will meet the needs of all institutions contributing this content. The University of Michigan and University of California held two teleconferences in September to discuss issues surrounding ingest itself, such as book package identifiers and ways of preparing ingested OCR for use in full-text searching and viewing applications.

Upcoming Changes to Tab-delimited HathiTrust Metadata Files — Beginning with the full metadata file produced on December 1, 2009, additional fields will be added to the tab-delimited HathiTrust metadata files that are provided at http://www.hathitrust.org/hathifiles (a description of the files is available at http://www.hathitrust.org/hathifiles_metadata).

Fields to be added include the rights determination reason code and the date of last rights determination. With this data included, the tab-delimited files will become an ongoing accessible source for information on how and when rights determinations are made. The new tab-delimited fields will be added to the end of the current record structure in order to minimize any potential disruption for existing users of these files. More details on this change will be included on the website as they become available.

Development Updates

Large-scale Search Launch October 19 – In September, the University of Michigan worked to revise and debug production index-building routines to support a comprehensive index of HathiTrust volumes. This index is distributed across five servers with two Solr shards, or index fragments, on each server. In the process of running the routines it was confirmed that Logical Volume Manager (LVM) snapshots could be used effectively to deploy index updates. Concurrent testing of the indexes in the new search environment showed a significant improvement in performance over the current environment, as had been expected. The new full-text search service is targeted for release on October 19. When it is live, the full text of the more than 4 million volumes in HathiTrust will be searchable by anyone with a Web browser. At that time, a new portal interface will replace the current page at http://catalog.hathitrust.org, providing access to full-text search, bibliographic search, and linking to custom collections in the Collection Builder.

With the release of full-text search on the horizon, HathiTrust has begun exploring options for offering faceted browsing of content in conjunction with full-text search. The University of Michigan has built and performed preliminary testing on an index of 500,000 volumes that includes metadata suitable for faceting of search results. The tests suggest that the impact of faceting on full-text search performance will be tolerable in the new environment.

Principal developers for the open source Solr software integrated Michigan’s contribution of common-grams code into the Solr code base. It is now a permanent feature of Solr and, of course, the HathiTrust indexing process.

HathiTrust/OCLC Catalog – The HathiTrust/OCLC Catalog team recently reached an agreement on metadata requirements for the version 1 catalog. To finalize these requirements, input was sought from catalogers both within and outside of the regular group. The team is also in the process of finalizing user interface requirements. A face-to-face meeting between OCLC and HathiTrust is being planned for November, where the group will begin to lay out a vision and timeline for version 2 of the catalog.

Ingest – Ingest rates were low in September as HathiTrust remained caught up with content made available from Google, and due to an issue of metadata encoding that required Google to reprocess a number of volumes before they could be downloaded. Ingest of metadata from Penn State is expected to begin in October, with ingest of content to begin immediately after.

Prototype for New HathiTrust PageTurner — Staff at the University of Michigan investigated ways of altering the current process by which images are transformed for access, in order to produce images that can be used by the GnuBook. A conclusion was not reached and investigation will continue in October. Development work at the University of California will also continue in October, as staff prepare a feature that will allow users to view thumbnail images of the pages in a volume.

Collection Builder – As mentioned above, the UM Press volumes will form the first officially sponsored collection in Collection Builder. Another new feature of the Collection Builder, and the PageTurner application as well, is that users accessing HathiTrust from partnering institution campuses will see the name of their institution in the bottom left corner of the screen. This note will let users know that their institution is supporting the effort to make this content available and ensure its preservation over the long-term. Staff at the University of Michigan continue to work to integrate Collection Builder functionality into the temporary catalog. Negotiating authentication requirements between the two applications has introduced some complications, but options continue to be explored.

Outages – There were no outages in September.

Presentations

iPRES "HathiTrust: Preservation As A Platform For Collaboration and Expanded User Services", October 6 - Jeremy York, Suzanne Chapman, Heather Christenson, and Paul Fogel
PASIG "From Ingest To Access: A Day In The Life Of A HathiTrust Digital Object", October 8 - Jeremy York
NISO Forum "Seamless Sharing: NYU, HathiTrust, ReCAP and the Cloud Library", October 9 - Kat Hagedorn

New Growth

Number of volumes added:

September Total
Indiana University 1,036 19,518
University of California 64,210 521,704
University of Michigan 81,829 3,210,981
University of Wisconsin 7,230 222,275
Total 154,305 3,981,227
  • 36,400 public domain volumes were added in July, bringing the total number of public domain volumes to 641,170 (16% of the total content).

October Forecast

  • Create and maintain full-text indexing and search services in the new production environment.
  • Continue to explore the addition of facets in full-text search. Facets have introduced metadata to the full-text index, and therefore new sorting options, including weighted relevance, will need attention.
  • Continue to investigate potential solutions to the problem of dynamically serving images to GnuBook.

Update on August 2009 Activities

September 11, 2009 [Download PDF]Syndicate content

Top News

Working Group On Computational Research Center – The Research Center proposal planning group has made great progress in the last month. The group has continued discussions on the types of research that could utilize the centers, how results might be shared, and what environments/datasets are best suited to which types of research. In bi-weekly calls, subgroup meetings, and individual interviews, the team has been working through difficult issues such as defining non-consumptive research and recognizing hurdles related to the management and publication of research results. Next steps include developing a draft plan for the infrastructure of the centers and marrying legal and security restrictions with that infrastructure. The group aims to have a draft proposal prepared in early October and a full proposal completed later that month.

Working Group on Development 'sandbox' – Based on a general conversation with the working group and a useful discussion of potential use cases with UC staff during their Ann Arbor visit, staff at the University of Michigan have gathered enough information to start building the development environment. The initial goal is to support all of the current development projects in a single place, and provide a large subset of content with which to work. The new environment will be a substantial improvement over current conditions, and should be a building block for additional capabilities later on, including significant partner development. Michigan has racked, cabled, and started operating system installs on the equipment set aside for the project. When further progress has been made on the base installations the full working group will assemble to discuss the provisions of the environment.

University of Michigan Press Backfile and Reprint Purchase Links in HathiTrust – HathiTrust is collaborating with the University of Michigan Scholarly Publishing Office and the University of Michigan Press to open access to the majority of the published backfile of the UM Press in HathiTrust. The volumes, which are being digitized by the Press, will be available in HathiTrust with an option to purchase a print-on-demand copy in mid to late October.

HathiTrust Disaster Preparedness – Over the summer, an IMLS grant-funded intern in digital preservation performed an in-depth evaluation of disaster preparedness in HathiTrust. The report provides detailed information about the strengths of HathiTrust’s current disaster recovery planning, as well as recommendations for improvements in the short-, intermediate-, and long-term. It is available at http://www.hathitrust.org/technical_reports/HathiTrust_DisasterRecovery.pdf.

Prototype for New HathiTrust PageTurner — Staff from the University of California and University of Michigan held two teleconferences in August to discuss deeper integration of the UC prototype PageTurner into the existing application. Team members discussed strategies for offering full development capabilities on a limited amount of HathiTrust content in advance of the development ‘sandbox’ environment. A working strategy has been reached and a development space should be available in October. UC has continued in the meantime to improve GnuBook functionality with thumbnail views of page images and the ability to display full-text OCR. Staff at UM are investigating ways to alter current processes that make access-quality images available to the PageTurner, to produce images that can be used by the GnuBook.

METS Profile Available — Staff at the University of Michigan have created a version 1.0 METS profile for HathiTrust content, which can be downloaded at http://www.hathitrust.org/preservation. The profile currently applies only to Google content in HathiTrust, but will be updated to reflect requirements for locally-scanned content and volumes digitized by the Internet Archive.

Returned Duplicates — For several years, Google has been working on ways to reduce duplication in its digitization workflow. In August, it implemented processes that use metadata to detect volumes that have been scanned previously at other institutions so identical volumes will not be scanned again. The number of volumes rejected in this de-duplication effort has raised concerns among HathiTrust institutions about the accuracy of Google’s detection processes. The University of California, the University of Wisconsin, Indiana University, and the University of Michigan have undertaken a review of volumes returned as duplicates to better understand how duplicate determination takes place. The four universities have identified a target set of materials to review and are finalizing methodology to perform a manual evaluation. It is hoped that the results will be available for the Google library partner summit later this month.

Mobile Interface — Michigan made significant progress on the development of a mobile interface to the HathiTrust Catalog in August. The work continues, and staff will next turn their attention to the PageTurner application. Initial development will be followed by user testing for both applications.

Development Updates

Large-scale Search – After additional search performance testing in August, an improved index configuration was established by staff at the University of Michigan using a punctuation filter and a list of 400 common words (see blog post for details: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-perform...). This index configuration will be put into production on the new dedicated server hardware, which was installed in August. Michigan also completed additions to the indexing control software (SLIP) to support distribution of indexing across several servers, each with multiple Solr index shards. A continuous indexing strategy for this distributed system and corresponding requirements for storage configuration and scripting has been implemented, and the first indexing tests will have begun by the time this report is published.

Ingest – The number of volumes ingested dropped significantly in August as ingest rates caught up with the rate at which partner content was made available from Google.

Data API – Ed Summers provided insightful and constructive feedback on the HathiTrust Data API in a blog posting in mid-August (http://inkdroid.org/journal/2009/08/13/open-to-view/). The comments are being reviewed by University of Michigan staff.

Collection Builder – Two new APIs for Collection Builder are being tested by staff at Michigan. The first returns the list of collections owned by a user. The second adds multiple items to a collection. These APIs will support future integration of Collection Builder functionality into other applications, such as the HathiTrust temporary catalog.

Outages – On Wednesday August 5 from 8:15pm to 9:30pm EDT, service was degraded (service may have been unavailable to some users) due to a storage system problem at the Indiana site. On Sunday August 23 at 6:30pm EDT to Monday August 24 at 8:00am EDT, Wednesday August 26 from 5:00pm to 6:00pm EDT, and Friday August 28 from 7:25pm to 8:35pm EDT, service was degraded due to network connectivity problems to database servers.

Software and firmware upgrades were performed during the weeks of August 10 and 17 at both sites without incident or interruptions in service. The upgrades conducted during the week of August 17 were preventative in nature, and addressed a hardware problem discovered by the storage system provider, and which was the underlying cause of the service disruption on August 5.

The cause of the other outages has been thoroughly researched but is still not known; workarounds that eliminate any service impact have been put into place, systems are being monitored, and investigation into the problem continues.

New Growth

Number of volumes added:

August Total
Indiana University -- 18,482
University of California 148,810 457,494
University of Michigan 58,878 3,129,152
University of Wisconsin -- 215,045
Total 207,688 3,820,173
  • 23,434 public domain volumes were added in July, bringing the total number of public domain volumes to 604,770 (16% of the total content).

September Forecast

  • Test large scale search performance on new dedicated server hardware.
  • Begin working with facets in large-scale search and continue testing performance variables including common-grams and punctuation.
  • Add reprint purchase links to the HathiTrust interface for UM Press items.
  • Continue development of mobile interfaces for the temporary catalog and PageTurner
  • Establish a collaborative development environment for the HathiTrust Page Turner.

Update on July 2009 Activities

August 14, 2009 [Download PDF]Syndicate content

Top News

UC Staff Visit Ann Arbor – HathiTrust project leads from the California Digital Library joined staff at the University of Michigan for two days of intense and fruitful discussion and planning from July 20-21. The teams consulted on a variety of forward-looking topics including a roadmap for the ingest of content digitized by the Internet Archive, strategies for future bibliographic metadata management, the challenges of providing help and feedback to users in a virtual library with multiple constituencies and stakeholders, HathiTrust PageTurner development, and creating infrastructure for collaborative development efforts. Several new planning efforts were initiated as a result of these discussions and both partners came away believing the visit had helped them to further coordinate efforts and was instrumental to continuing their successes in the future.

New HathiTrust Working Group On Storage – A new working group has been convened to explore the possibility of securing a third instance of storage for HathiTrust in the western United States. The working group members include Stephen Abrams, California Digital Library (co-chair), John Kunze, California Digital Library (co-chair), Luc Declerck, University of California San Diego, Rob Lowden, Indiana University, David Minor, University of California San Diego, and Cory Snavely, University of Michigan. If a third instance of storage is recommended, the group will investigate a variety of technical, management, and organizational issues involved in implementation.

Working Group On Computational Research Center – The Research Centers working group has been hard at work over the last month. The participants (please see the June update) have been engaging in a series of conference calls discussing issues related to the creation of the centers, including the types of research that will be done, the environment needed to support such research, and legal restrictions surrounding the use of the data. The group will continue to discuss these issues and others, such as funding sources and derivative research resulting from HathiTrust data use, in calls throughout August and September.

Working Group on Development 'sandbox' – The Development Environment working group convened for the first time in mid-July via teleconference to discuss the scope of the environment, the contexts in which development will occur (remote development versus local, specific use cases and desired features), and working group logistics. The group identified current applications such as the HathiTrust PageTurner and Collection Builder, and GROOVE, HathiTrust’s ingest mechanism as priority systems to be made available in the development space, and conferred about particular ways that work will be done, such as code versioning. The development environment was a focus of one of the sessions during the meeting between California Digital Library and University of Michigan staff mentioned above, where further discussion on these issues took place. In the coming weeks, team members at Michigan will prepare hardware that has been set aside for the project and do preliminary configuration of the environment on that hardware.

Prototype for New HathiTrust PageTurner — Collaboration between the California Digital Library and the University of Michigan to enhance the HathiTrust Page Turner with GnuBook functionality continued in July, primarily in the form of discussions about division of labor and the establishment of a basic collaborative work environment. A new planning and development team with staff from both institutions met in mid-August to kick off the next phase of GnuBook and PageTurner development.

HathiTrust-OCLC Catalog Project — The HathiTrust WorldCat Local Implementation team is nearing the completion of high-level requirements document for the version 1 catalog, with a target deadline of August 31, 2009. The team also began to document usability issues and suggestions for the proposed interface. OCLC has begun working on the e-content synchronization process that will bring HathiTrust’s records into WorldCat Local.

In striving to create a consistent user experience of HathiTrust, the team has turned to user feedback on the temporary beta catalog (http://catalog.hathitrust.org/).

HathiTrust Statistics — Member institutions have identified the need to make statistics about how HathiTrust is being used more broadly available within the partnership. As a provisional measure, access statistics gathered by Google Analytics are being provided to representatives at these institutions. While these analytics will be useful in the short-term, there is a need for a reporting tool that will provide more granular information, such as usage by institution and by format, in the future.

Development Updates

Large-scale Search – University of Michigan staff investigated the indexing problems with the beta large-scale search that were reported in the last update. The problems were due to a shortage of available memory. However, a decision was taken to wait for new hardware to be deployed before taking further action. The new hardware, purchased in June to support large-scale search, was received in July, and is currently being prepared for testing and use. With the new hardware in place, it is planned to have full text search of all volumes in HathiTrust by October 1st.

UM staff made refinements to the custom punctuation filter for large scale search, and ran tests only to discover the filter did not provide the performance boost anticipated. The punctuation filter has been set aside temporarily, but has potential for future implementation. Tests conducted by staff to compare response times for common-grams Solr indexes in various configurations resulted in a new emphasis being placed on the importance of a well-tuned list of common words. A new program that evaluates the total number of term occurrences for the most frequently occurring words in an index was created to aid in the selection of common words for this list. Additional details can be found on the HathiTrust Large Scale Search Blog (http://www.hathitrust.org/blogs/large-scale-search/). Four new posts were added to the blog in July.

Ingest – Ingest was slowed in July by the discovery that Google was making volumes available for ingest that did not contain the required descriptive metadata. Google addressed the problem and ingest continued as normal after these volumes were re-ingested.

Data API – University of Michigan staff responded to feedback received from California Digital Library on the Data API and discussion of the API continued when CDL visited Michigan. Key issues that have arisen are security and determining how much functionality should be built into the baseline API.

Collection Builder – Michigan explored solutions for integ