Hathifiles Description

The “hathifiles” are a standard metadata format we use at HathiTrust to distribute information about items in the HathiTrust collection. They include information derived from the bibliographic record (e.g., title, publisher, language, commonly used identifiers, etc.), rights and access codes, and information about the source of the item.

Hathifiles are made available as TSV files for the entire collection on the hathifiles page or as TSV and JSON formats for personal collections.

Many of the fields described below are extracted from the MARC record, a commonly-used format in library catalogs to describe a work. Information is provided below about which subfields are included in the data element. See “What is a MARC record, and why is it important?” to learn more about the MARC format.

Notes:

  • Multiple occurrences of OCLC number, ISBN, ISSN, or LCCN are comma-delimited within the appropriate data element.

  • If there is no corresponding data for a field, the field will be empty.

  • The fields are provided in the hathifiles in the order described below.

  • When new elements are added to the hathifiles, they are added to the end of the row.

Table 1 describes the fields available in the downloadable hathifiles.
Data Element Field Name in Header File Description
Volume Identifier htid This is the permanent HathiTrust item identifier. Each item identifier is unique. This identifier can be used to construct a persistent handle url or other link that directs users to the item. Handles can be constructed as follows: https://hdl.handle.net/2027/volume_identifier For example: https://hdl.handle.net/2027/mdp.39015013764785
Access access An access code that describes whether or not users can view the item. The access code is derived from the rights attribute. Permitted values include: allow - end users can view the item deny - end users cannot view the item Notes: Items with a copyright status of “public domain in the United States” (i.e., only users within the United States can view the item) have the value of “allow”. Items with a copyright status of “in-copyright in the United Status” (i.e., only users outside the United States can view the item) have the value of “allow”. Also see “Rights” and “Access Profile” data elements below.
Rights code rights A code (also referred to as “rights attribute”) that describes the copyright status, license or access. See the full list of codes: https://www.hathitrust.org/the-collection/preservation/rights-database/#attributes
HathiTrust record number ht_bib_key HathiTrust's record number for the associated bibliographic record. HathiTrust record numbers are not permanent and can change over time. URLs to HathiTrust catalog records can be constructed as follows: https://catalog.hathitrust.org/Record/record_number For example: https://catalog.hathitrust.org/Record/001285647
Enumeration/Chronology description Enumeration (e.g., “vol.1”) and chronology (e.g., “1883”, “Jun-Oct 1927”) data for this item.
Source source Code identifying the source of the bibliographic record. Currently, the NUC code of the originating library is used for the code.
Source institution record number source_bib_num Local bibliographic record number used in the catalog of the library that contributed the item.
OCLC numbers oclc_num OCLC number(s) for the bibliographic record. Multiple values are separated by a comma.
ISBNs isbn ISBN(s) for the bibliographic record. Multiple values are separated by a comma.
ISSNs issn ISSN(s) for the bibliographic record. Multiple values are separated by a comma.
LCCNs lccn LCCN(s) for the bibliographic record. Multiple values are separated by a comma.
Title title The title of the work. May include an author if provided in the MARC field 245 $c. Includes all subfields of the 245 MARC field.
Publishing information imprint The name of the publisher and the date of publication. Includes subfieds b and c of the 260 MARC field.
Rights determination reason code rights_reason_code This code describes how the “Rights” code was set. See the full list of Reason Codes.
Date of last update rights_timestamp This date may change when any of the following activities occur: the item was newly deposited into the collection a new copy of the digital item overrode the previous copy the rights and access status has changed a new bibliographic record was provided by the contributor
Government Document us_gov_doc_flag United States federal government document indicator. Permitted values include: 1- the item is a US federal government document 0 - the item is not a US federal government document
Publication Date rights_date_used Derived publication date of the item. The date is derived from data provided in the 008 field of the MARC record and the enumeration/chronology data for the item. In cases where the date of the item could not be easily determined by HathiTrust processes, the date will be listed in the hathifiles as 9999.
Publication Place pub_place The place of publication for the work. The codes included in this data element were originally provided in bytes 15-17 of the 008 MARC field. See the full list of country codes in the “MARC Code List for Countries.”
Language lang The primary language of the work. The codes included in this data element were originally provided in bytes 35-37 of the 008 MARC field. See the full list of language codes in the “MARC code list for Languages.”
Bibliograhic Format bib_fmt Bibliographic format of the work. Definitions of format values can be found on the Library of Congress website Permitted values include: BK - monographic book SE - serial, continuing resources (e.g., journals, newspapers, periodicals) CF - computer files and electronic resources MP - maps, including atlases and sheet maps MU - music, including sheet music VM - visual material MX - mixed materials
Collection Code collection_code An administrative code used to share information between Zephir and HathiTrust repository.*
Content Provider Code content_provider_code The institution that originally contributed the content. Codes used are listed at https://www.hathitrust.org/institution_identifiers.*
Responsible Entity Code responsible_entity_code The institution that took responsibility for accessioning the content into HathiTrust, in cases where the content provider was not a member of HathiTrust. Codes used are listed at https://www.hathitrust.org/institution_identifiers.*
Digitization Source digitization_agent_code The organization that digitized the content. Codes used are listed at https://www.hathitrust.org/rights_database#Sources.*
Access profile access_profile_code Access profiles indicate whether an item has view or download restrictions. They work in combination with the rights codes (included in the hathifiles in data element “rights”) to determine user access. Permitted values include: open - Items with this value do not have any download restrictions. google - Items with this value have some download restrictions. Any user anywhere can download one page at a time. Member-affiliated users can download the full pdf. page - Items with this value can be viewed on the HathiTrust website. Users can download individual pages but cannot download the full pdf, regardless of member affiliation. page+lowres - Users can download the item in a lower resolution with a watermark only.
Author author The name of the person, company or meeting that created the work. Author names are typically in authorized format, meaning that the name is provided in a standardized form used across multiple catalogs and databases. Includes the following fields from the MARC record: 100 $a $b $c $d - Name of the person who authored the work 110 $a $b $c $d - Name of a corporation or organization that authored the work 111 $a $c $d - Name of a meeting or conference that is responsible for creating the work

*For more information about codes used in HathiTrust internal processes, see the page at https://www.hathitrust.org/internal_codes.

Top