Available Indexes

Hathifiles Description

The "hathifiles" are tab-delimited text files that describe every item in the HathiTrust collection. They include information derived from the bibliographic record (e.g., title, publisher, language, commonly used identifiers, etc.), rights and access codes, and information about the source of the item. See the bottom of this page for potential use cases.

The files are available for download on the “hathifiles” page. A header file is also provided at the top of the list of hathifiles. 

Many of the fields below are extracted from the MARC record, a commonly-used format in library catalogs to describe a work. Information is provided below about which subfields are included in the data element. See “What is a MARC record, and why is it important?” to learn more about the MARC format.

File format:

The file is tab-delimited. Multiple occurrences of OCLC number, ISBN, ISSN, or LCCN are comma-delimited within the appropriate data element. If there is no corresponding data for that element, the data element will be empty. The fields are provided in the hathifiles in the order described below. When new elements are added to the hathifiles, they are added to the end of the row. 

Data element Field name in header file Description
Volume Identifier htid

This is the permanent HathiTrust item identifier. Each item identifier is unique.

This identifier can be used to construct a persistent handle url or other link that directs users to the item.

Handles can be constructed as follows: https://hdl.handle.net/2027/volume_identifier

For example: https://hdl.handle.net/2027/mdp.39015013764785

Access access

An access code that describes whether or not users can view the item. The access code is derived from the rights attribute.

Permitted values include:

  • allow - end users can view the item
  • deny - end users cannot view the item

Notes:

  • Items with a copyright status of “public domain in the United States” (i.e., only users within the United States can view the item) have the value of “allow”. 
  • Items with a copyright status of “in-copyright in the United Status” (i.e., only users outside the United States can view the item) have the value of “allow”.

Also see “Rights” and “Access Profile” data elements below.

Rights code rights A code (also referred to as “rights attribute”) that describes the copyright status, license or access. See the full list of codes
HathiTrust record number

ht_bib_key

HathiTrust's record number for the associated bibliographic record. HathiTrust record numbers are not permanent and can change over time.

URLs to HathiTrust catalog records can be constructed as follows:

https://catalog.hathitrust.org/Record/record_number

For example:  https://catalog.hathitrust.org/Record/001285647

Enumeration/Chronology

description

Enumeration (e.g., “vol.1”) and chronology (e.g., “1883”, “Jun-Oct 1927”) data for this item. 
Source

source

Code identifying the source of the bibliographic record. Currently, the NUC code of the originating library is used for the code.
Source institution record number source_bib_num Local bibliographic record number used in the catalog of the library that contributed the item. 
OCLC numbers oclc_num OCLC number(s) for the bibliographic record. Multiple values are separated by a comma. 
ISBNs isbn ISBN(s) for the bibliographic record. Multiple values are separated by a comma. 
ISSNs issn ISSN(s) for the bibliographic record. Multiple values are separated by a comma. 
LCCNs lccn LCCN(s) for the bibliographic record. Multiple values are separated by a comma. 
Title title

The title of the work. May include an author if provided in the MARC field 245 $c.

Includes all subfields of the 245 MARC field.

Publishing information imprint

The name of the publisher and the date of publication.

Includes subfieds b and c of the 260 MARC field.

Rights determination reason code rights_reason_code This code describes how the “Rights” code was set. See the full list of Reason Codes.
Date of last update rights_timestamp

This date may change when any of the following activities occur:

  • the item was newly deposited into the collection
  • a new copy of the digital item overrode the previous copy
  • the rights and access status has changed
  • a new bibliographic record was provided by the contributor
Government Document us_gov_doc_flag

United States federal government document indicator.

Permitted values include:

  • 1- the item is a US federal government document
  • 0 - the item is not a US federal government document
Publication Date rights_date_used Derived publication date of the item. The date is derived from data provided in the 008 field of the MARC record and the enumeration/chronology data for the item. In cases where the date of the item could not be easily determined by HathiTrust processes, the date will be listed in the hathifiles as 9999. 
Publication Place pub_place The place of publication for the work. The codes included in this data element were originally provided in bytes 15-17 of the 008 MARC field. See the full list of country codes in the “MARC Code List for Countries.
Language lang The primary language of the work. The codes included in this data element were originally provided in bytes 35-37 of the 008 MARC field. See the full list of language codes in the “MARC code list for Languages.” 
Bibliograhic Format bib_fmt

Bibliographic format of the work.

Permitted values include:

  • BK - monographic book
  • SE - serial, continuing resources (e.g., journals, newspapers, periodicals)
  • CF - computer files and electronic resources
  • MP - maps, including atlases and sheet maps
  • MU - music, including sheet music
  • VM - visual material
  • MX - mixed materials
Collection Code collection_code An administrative code used to share information between Zephir and HathiTrust repository.*
Content Provider Code content_provider_code The institution that originally contributed the content. Codes used are listed at https://www.hathitrust.org/institution_identifiers.*
Responsible Entity Code responsible_entity_code The institution that took responsibility for accessioning the content into HathiTrust, in cases where the content provider was not a member of HathiTrust. Codes used are listed at https://www.hathitrust.org/institution_identifiers.*
Digitization Agent Code digitization_agent_code The organization that digitized the content. Codes used are listed at https://www.hathitrust.org/institution_identifiers.*

Access profile

*ADDED 7/1/2018*

access_profile_code

Access profiles indicate whether an item has view or download restrictions. They work in combination with the rights codes (included in the hathifiles in data element “rights”) to determine user access.

Permitted values include:

  • open - Items with this value do not have any download restrictions. 
  • google - Items with this value have some download restrictions. Any user anywhere can download one page at a time. Member-affiliated users can download the full pdf. 
  • page - Items with this value can be viewed on the HathiTrust website. Users can download individual pages but cannot download the full pdf, regardless of member affiliation.
  • page+lowres - Users can download the item in a lower resolution with a watermark only.

Author

*ADDED 7/1/2018*

author

The name of the person, company or meeting that created the work. Author names are typically in authorized format, meaning that the name is provided in a standardized form used across multiple catalogs and databases.

Includes the following fields from the MARC record:

  • 100 $a $b $c $d - Name of the person who authored the work
  • 110 $a $b $c $d - Name of a corporation or organization that authored the work
  • 111 $a $c $d - Name of a meeting or conference that is responsible for creating the work

*For more information about codes used in HathiTrust internal processes, see the page at https://www.hathitrust.org/internal_codes.

Potential use cases

For research

Researchers can use the hathifiles to do basic analysis against the fields provided. For example, a user who wants to create a HathiTrust Research Center workset or request a HathiTrust dataset can work with the data in the files to create a list of item identifiers that are fully viewable (Access: allow), within a specific date range (Publication Date: xxxx), and in a certain language (Language: xxx).

For libraries

Libraries and vendors of discovery products can use the hathifiles to add links to HathiTrust items when they already have bibliographic records in their catalogs. For example, a library can compare OCLC numbers, ISBNs, and ISSNs in the hathifiles against their own catalog and construct a link to the HathiTrust item or catalog record.

A library can also use the hathifiles to acquire records through other methods:

  • The OCLC identifier can be used to retrieve records either via Connexion or from the OCLC z39.50 server using USE attribute 12.
  • The source institution's record number can be used in obtaining records directly from that institution. Contact the source institution directly for further information about access to their data.

The hathifiles can also be used in combination with other data feeds that we provide, such as the Bib API or the OAI feed. For example, a non-member library interested in all books (Bibliographic format: BK) that are in the English language (Language: eng), that aren’t US federal government documents (Government Document: 0), and that have been opened with a Creative Commons license (Rights: all codes beginning with “cc”) could use the resulting list of rows to retrieve full MARC records from the Bib API.