Navigation

Zephir

Zephir, the HathiTrust Metadata Management System

In an effort that has highlighted the modularity of the HathiTrust repository, and the capacity for distributed development of repository infrastructure, the University of California California Digital Library has developed a new bibliographic metadata management system for HathiTrust. The new system, called Zephir, is custom-designed to the functional metadata needs of HathiTrust, providing a range of back-end services. Zephir was launched in the fall of 2013.​

Background

Bibliographic information is critically important to HathiTrust. Bibliographic records provide general descriptions of items (title, author, publisher, date), as well as summaries, descriptions and additional information that are often not available in the item itself (e.g., author death date, subject headings, government document status). This information is crucial for helping users find what they are looking for, allowing HathiTrust to make an initial automated rights determination about volumes (with close to 11 million volumes, HathiTrust could not provide the broad access it does otherwise) and perform manual investigations into the copyright or in-print status of items. It is also used in inventories of HathiTrust items, to trigger ingest, and to trouble-shoot problems.

On October 30, 2013 HathiTrust released a new bibliographic metadata management system, called Zephir, to store, manage, and export to other HathiTrust systems, bibliographic records accompanying digital items deposited HathITrust's digital repository. Zephir performs a wide range of functions, including record ingest and updating, general management of records, record versioning, and reporting on record loading, including error reporting. Zephir integrates seamlessly into HathiTrust workflows, providing metadata that is used for all the purposes above, being primarily accessible through HathiTrust's online catalog, datafeeds, and APIs (existing public-facing metadata exposure services (http://www.hathitrust.org/data) will remain available, but will now include metadata sourced from Zephir).

Zephir includes a number of additional functions:

  • Zephir stores all records successfully submitted by contributing partner institutions, allowing for access to a broad range of complementary bibliographic metadata for future use.  
  • Zephir exports a preferred base record with holdings to be made available through the HathiTrust public access catalog.Zephir uses a scoring algorithm to weight the presence or absence of MARC fields and field values to determine base record selection. Base records exported from Zephir can change based on adjustments to the scoring algorithm or based on the quality of updated or additional records.
  • Zephir supports versioning and retains all successfully submitted or updated copies of records describing digitized resources in the HathiTrust repository. The system has been designed to include "shadow" records which will allow for retaining the integrity of originally submitted metadata should changes need to be made by HathiTrust and not the contributing partner institution. Shadow records are currently only applied to affect critical changes affecting rights per the HathiTrust Bibliographic Correction Policy: http://www.hathitrust.org/bib_metadata_correction 
  • Metadata field values critical for management and analysis from all contributed records are mapped to a HathiTrust-specific metadata schema, and stored in a SQL database and as XML files.

In preparation for launching Zephir, HathiTrust implemented revised bibliographic metadata specifications (http://www.hathitrust.org/bib_specifications) as well as new bibliographic metadata submission process (http://www.hathitrust.org/bib_data_submission) which provides contributing partners with feedback about their records through a series of reports generated for each file submitted. 

California Digital Library

The University of California (UC) is a founding member of HathiTrust, and the California Digital Library (CDL) has traditionally been a locus of coordination and technical development for the UC Libraries.  A team at the CDL developed Zephir to support specific HathiTrust requirements by working in consultation with staff at the University of Michigan who have managed HathiTrust bibliographic metadata since HathiTrust's inception. The work of designing and implementing Zephir has highlighted the modularity of the HathiTrust repository, the potential for collaboration between HathiTrust partner institutions in developing components of the infrastructure, and the capacity for distributed development of repository infrastructure in addition to the core systems and services provided by the University of Michigan. 

Project Goals

  • Provide equivalent metadata management functionality to the University of Michigan Aleph-based system.
  • Provide improved update, match and merge record management functionality to the HathiTrust.
  • Provide a flexible framework for the management of metadata at many levels (e.g.: work, manifestation, item)
  • Position the HathiTrust to respond to metadata management challenges raised by duplicate and surrogate records.

Project Team

The launch of Zephir was a several-years effort, participated in by a wide range of staff at the CDL and the University of Michigan. Core team members included:

University of California

Lynne Cameron, HathiTrust Co-Technical Lead (core development team)
Heather Christenson, HathiTrust Project Manager
Stephanie Collett, Technical Project Lead (core development team)
Paul Fogel, HathiTrust Co-Technical Lead
Patricia Martin, Director, Discovery and Delivery Team
Kathryn Stine, Metadata Analyst & Project Manager (core development team)
Michael Thwaites, Programmer & Testing Coordinator (core development team)
Lena Zentall, former Project Manager (core development team)

University of Michigan

Bill Dueber, Library System Programmer
Tim Prettyman, Senior Library Applications Programmer
Jon Rothman, Head of Library Systems Office
Jeremy York, Assistant Director, HathiTrust 

Executive Sponsors and Coordinators

Laine Farley, Executive Director, California Digital Library; HathiTrust Executive Committee member
John Wilkin, Executive Director, HathiTrust

Zephir System Documentation:

Prior to metadata processing, contributor records are submitted to Zephir via FTPS.

Metadata processing

Submitted contributor metadata is processed by a record loader script (written in Perl) and subsequent import/update scripts (written in Ruby). Metadata processing includes validation against the MARC standard and HathiTrust bibliographic metadata specification (http://www.hathitrust.org/bib_specifications). During processing, some metadata values are normalized, put into consistent locations, removed, or added.

Storage (File systems and database)

Zephir (written in Ruby) stores the original volume-level records as submitted by HathiTrust contributing institutions and as processed during loading. When metadata is updated, All files are stored in a file system and Zephir maintains a complete history of all changes to a record (using Git and Pairtree file structure). Selected data elements are stored and indexed in a database (using Mysql), which also includes bibliographic and volume-level records.

Export

Zephir exports bibliographic records and management data for HathiTrust workflows and services (with Ruby). Zephir also has the capacity for exporting volume-level records as well as specific metadata elements for analysis and reporting. 

Zephir interactions with other HathiTrust systems:

Export to HathiTrust Ingest Framework (Feed)

On a daily basis, Zephir exports a list of volume identifiers and additional metadata to the University of Michigan for newly loaded records from contributors. The HathiTrust Ingest Framework (Feed) utilizes this information in the digital repository ingest process.

Import from HathiTrust Ingest Framework (Feed)

On a daily basis, Zephir receives a list of volume identifiers from the University of Michigan (originating in the HathiTrust Ingest Framework (Feed)) representing digital objects ingested to the HathiTrust repository the previous day.

Export to HathiTrust Catalog

Zephir uses a list of volume identifiers from the University of Michigan representing digital objects ingested to the HathiTrust repository the previous day to determine which bibliographic records are included in its daily export to the University of Michigan for further processing and use in HathiTrust’s access systems.


  
  

Interactions with backup infrastructure:

There are three environments maintained for Zephir, production, stage, and development.

The production environment is comprised of two SLES Linux virtual machines (FTPS VM, Zephir VM) and a high-availability database server (Zephir DB) on Solaris. A virtual machine snapshot of the FTPS and Zephir machines (FTPS VM Weekly Snapshot, Zephir VM Weekly Snapshot) and a database snapshot (Zephir DB Weekly Snapshot) are taken weekly. Four snapshots of each component are stored at the production environment at the UCOP data center and a copy is held in the UC San Diego data center for disaster recovery. In addition to system recovery, all original record files submitted to the FTPS server are permanently archived in both data centers (Zephir Input Records File Archive).

The staging and development environments are located at the UC Berkeley data center. These environments employ the same virtual machine and database technology as in production. The primary purpose of these systems are to develop features and fixes for the Zephir system (development environment) and deploy these changes at scale (staging environment) before rolling them out to production.

Timeline

Milestone

Progress
Planning phase April - October 2010 Completed
Project officially launches November 2010  

Complete business arrangements & funds transfer

Completed

 

Ongoing procedures for receiving input files and pre-ingest transformation procedures in place

Completed

 

Core file system in place

Completed

 

Core database in place

Completed

 

Import (transformation)

  • Routines in place to normalize and transform bibliographic data submitted by current content contributing partners

Completed

 

 

Milestone: Generic core system in place.

Completed; Demo'ed 6/14/11

 

Named the system "Zephir" Completed
(Preliminary) load and test records  Completed
Reconcile differences between original contributor records and HathiTrust records Completed
Confirm ingest standards and workflows for contributing records (minimum submission standard, record correction policies & handling) Completed

Process (rights, daylight, preferred record score):

  • Rights routines developed to record rights determination to support incorporation of the HTMMS into existing HathiTrust Rights workflows.
  • Daylighting routines developed to determine when a HathiTrust object is fully processed and vetted and ready for incorporation into discovery and delivery systems.
  • Preferred record score script developed to heuristically score records and save the results in the core system. The preferred record score means the base record can be identified at the point of export.

Completed

 

Process (batch exports)

  • Routines in place to produce VuFind batch exports from the system

 

Completed

Process (batch exports)

  • Routines in place to produce Hathifiles batch exports from the system.
Completed
System adapted to HathiTrust workflow Completed
Load records 

Development environment load target: early June 2013

Staging environment load target: early July 2013

Production environment load target: late July 2013

Functional and performance testing of system  In progress
Integration testing  In progress
System acceptance - Run systems in parallel Systems running in parallel through mid-October 2013
Cutover to Zephir - System in production with the HathiTrust

Fall 2013

Monthly Project Updates

December 2010

January 2011

February 2011

March 2011

April 2011

May 2011

June 2011

July 2011

August 2011

September 2011

October 2011

November 2011

December 2011

January 2012

February 2012

March 2012

April 2012

May 2012

June 2012

July 2012

August 2012

September 2012

October 2012

November 2012

December 2012

January 2013

February 2013

March 2013

April 2013

May 2013

June 2013

July 2013

August 2013

 

Contact

Stephanie Collett, Technical Lead

Kathryn Stine, Project Manager