Zephir, the HathiTrust Metadata Management System
In an effort that has highlighted the modularity of the HathiTrust repository, and the capacity for distributed development of repository infrastructure, the University of California California Digital Library has developed a new bibliographic metadata management system for HathiTrust. The new system, called Zephir, is custom-designed to the functional metadata needs of HathiTrust, providing a range of back-end services. Zephir was launched in the fall of 2013.
Bibliographic information is critically important to HathiTrust. Bibliographic records provide general descriptions of items (title, author, publisher, date), as well as summaries, descriptions and additional information that are often not available in the item itself (e.g., author death date, subject headings, government document status). This information is crucial for helping users find what they are looking for, allowing HathiTrust to make an initial automated rights determination about volumes (with close to 11 million volumes, HathiTrust could not provide the broad access it does otherwise) and perform manual investigations into the copyright or in-print status of items. It is also used in inventories of HathiTrust items, to trigger ingest, and to trouble-shoot problems.
On October 30, 2013 HathiTrust released a new bibliographic metadata management system, called Zephir, to store, manage, and export to other HathiTrust systems, bibliographic records accompanying digital items deposited HathITrust's digital repository. Zephir performs a wide range of functions, including record ingest and updating, general management of records, record versioning, and reporting on record loading, including error reporting. Zephir integrates seamlessly into HathiTrust workflows, providing metadata that is used for all the purposes above, being primarily accessible through HathiTrust's online catalog, datafeeds, and APIs (existing public-facing metadata exposure services (http://www.hathitrust.org/data) will remain available, but will now include metadata sourced from Zephir).
Zephir includes a number of additional functions:
- Zephir stores all records successfully submitted by contributing partner institutions, allowing for access to a broad range of complementary bibliographic metadata for future use.
- Zephir exports a preferred base record with holdings to be made available through the HathiTrust public access catalog.Zephir uses a scoring algorithm to weight the presence or absence of MARC fields and field values to determine base record selection. Base records exported from Zephir can change based on adjustments to the scoring algorithm or based on the quality of updated or additional records.
- Zephir supports versioning and retains all successfully submitted or updated copies of records describing digitized resources in the HathiTrust repository. The system has been designed to include "shadow" records which will allow for retaining the integrity of originally submitted metadata should changes need to be made by HathiTrust and not the contributing partner institution. Shadow records are currently only applied to affect critical changes affecting rights per the HathiTrust Bibliographic Correction Policy: http://www.hathitrust.org/bib_metadata_correction
- Metadata field values critical for management and analysis from all contributed records are mapped to a HathiTrust-specific metadata schema, and stored in a SQL database and as XML files.
In preparation for launching Zephir, HathiTrust implemented revised bibliographic metadata specifications (http://www.hathitrust.org/bib_specifications) as well as new bibliographic metadata submission process (http://www.hathitrust.org/bib_data_submission) which provides contributing partners with feedback about their records through a series of reports generated for each file submitted.
California Digital Library
The University of California (UC) is a founding member of HathiTrust, and the California Digital Library (CDL) has traditionally been a locus of coordination and technical development for the UC Libraries. A team at the CDL developed Zephir to support specific HathiTrust requirements by working in consultation with staff at the University of Michigan who have managed HathiTrust bibliographic metadata since HathiTrust's inception. The work of designing and implementing Zephir has highlighted the modularity of the HathiTrust repository, the potential for collaboration between HathiTrust partner institutions in developing components of the infrastructure, and the capacity for distributed development of repository infrastructure in addition to the core systems and services provided by the University of Michigan.
- Provide equivalent metadata management functionality to the University of Michigan Aleph-based system.
- Provide improved update, match and merge record management functionality to the HathiTrust.
- Provide a flexible framework for the management of metadata at many levels (e.g.: work, manifestation, item)
- Position the HathiTrust to respond to metadata management challenges raised by duplicate and surrogate records.
The launch of Zephir was a several-years effort, participated in by a wide range of staff at the CDL and the University of Michigan. Core team members included:
University of California
Lynne Cameron, HathiTrust Co-Technical Lead (core development team)
Heather Christenson, HathiTrust Project Manager
Stephanie Collett, Technical Project Lead (core development team)
Paul Fogel, HathiTrust Co-Technical Lead
Patricia Martin, Director, Discovery and Delivery Team
Kathryn Stine, Metadata Analyst & Project Manager (core development team)
Michael Thwaites, Programmer & Testing Coordinator (core development team)
Lena Zentall, former Project Manager (core development team)
University of Michigan
Bill Dueber, Library System Programmer
Tim Prettyman, Senior Library Applications Programmer
Jon Rothman, Head of Library Systems Office
Jeremy York, Assistant Director, HathiTrust
Executive Sponsors and Coordinators
Laine Farley, Executive Director, California Digital Library; HathiTrust Executive Committee member
John Wilkin, Executive Director, HathiTrust
Zephir System Documentation:
Prior to metadata processing, contributor records are submitted to Zephir via FTPS.
Submitted contributor metadata is processed by a record loader script (written in Perl) and subsequent import/update scripts (written in Ruby). Metadata processing includes validation against the MARC standard and HathiTrust bibliographic metadata specification (http://www.hathitrust.org/bib_specifications). During processing, some metadata values are normalized, put into consistent locations, removed, or added.
Storage (File systems and database)
Zephir (written in Ruby) stores the original volume-level records as submitted by HathiTrust contributing institutions and as processed during loading. When metadata is updated, All files are stored in a file system and Zephir maintains a complete history of all changes to a record (using Git and Pairtree file structure). Selected data elements are stored and indexed in a database (using Mysql), which also includes bibliographic and volume-level records.
Zephir exports bibliographic records and management data for HathiTrust workflows and services (with Ruby). Zephir also has the capacity for exporting volume-level records as well as specific metadata elements for analysis and reporting.
Zephir interactions with other HathiTrust systems:
Export to HathiTrust Ingest Framework (Feed)
On a daily basis, Zephir exports a list of volume identifiers and additional metadata to the University of Michigan for newly loaded records from contributors. The HathiTrust Ingest Framework (Feed) utilizes this information in the digital repository ingest process.
Import from HathiTrust Ingest Framework (Feed)
On a daily basis, Zephir receives a list of volume identifiers from the University of Michigan (originating in the HathiTrust Ingest Framework (Feed)) representing digital objects ingested to the HathiTrust repository the previous day.
Export to HathiTrust Catalog
Zephir uses a list of volume identifiers from the University of Michigan representing digital objects ingested to the HathiTrust repository the previous day to determine which bibliographic records are included in its daily export to the University of Michigan for further processing and use in HathiTrust’s access systems.
Interactions with backup infrastructure:
There are three environments maintained for Zephir, production, stage, and development.
The production environment is comprised of two SLES Linux virtual machines (FTPS VM, Zephir VM) and a high-availability database server (Zephir DB) on Solaris. A virtual machine snapshot of the FTPS and Zephir machines (FTPS VM Weekly Snapshot, Zephir VM Weekly Snapshot) and a database snapshot (Zephir DB Weekly Snapshot) are taken weekly. Four snapshots of each component are stored at the production environment at the UCOP data center and a copy is held in the UC San Diego data center for disaster recovery. In addition to system recovery, all original record files submitted to the FTPS server are permanently archived in both data centers (Zephir Input Records File Archive).
The staging and development environments are located at the UC Berkeley data center. These environments employ the same virtual machine and database technology as in production. The primary purpose of these systems are to develop features and fixes for the Zephir system (development environment) and deploy these changes at scale (staging environment) before rolling them out to production.
|Planning phase April - October 2010||Completed|
|Project officially launches November 2010|
Complete business arrangements & funds transfer
Ongoing procedures for receiving input files and pre-ingest transformation procedures in place
|Core file system in place|
|Core database in place|
Milestone: Generic core system in place.
Completed; Demo'ed 6/14/11
|Named the system "Zephir"||Completed|
|(Preliminary) load and test records||Completed|
|Reconcile differences between original contributor records and HathiTrust records||Completed|
|Confirm ingest standards and workflows for contributing records (minimum submission standard, record correction policies & handling)||Completed|
Process (rights, daylight, preferred record score):
Process (batch exports)
Process (batch exports)
|System adapted to HathiTrust workflow||Completed|
Development environment load target: early June 2013
Staging environment load target: early July 2013
Production environment load target: late July 2013
|Functional and performance testing of system||In progress|
|Integration testing||In progress|
|System acceptance - Run systems in parallel||Systems running in parallel through mid-October 2013|
|Cutover to Zephir - System in production with the HathiTrust|
Monthly Project Updates
Stephanie Collett, Technical Lead
Kathryn Stine, Project Manager