Beyond Google Books: Getting Locally-Digitized Material into HathiTrust
June 18, 2015
By Aaron Elkiss, University of Michigan Library, Cross-posted from the University of Michigan Library Tech Talk Blog
HathiTrust was founded in 2008 as a partnership of the (at the time) 13 universities of the Committee on Institutional Cooperation plus the University of California, all of whom were digitizing or had plans to digitize volumes through the Google Books program. Even those original partners had a variety of past and present digitization projects beyond Google Books, though, and a goal from early on was to support preservation and access to a wide array of digitized book material. Not all material is suitable for digitization with Google Books, and not all institutions with material to digitize are Google Books partners!
Early Attempts at Non-Google Content in HathiTrust
In the 2009-2010 time frame, HathiTrust started experimenting with non-Google content. In particular we added support for content digitized by the Internet Archive; we also made some early attempts at migrating content from Michigan’s existing repository (See our previous blog post on this effort.) Additionally, we added support for an impressive collection of incunabula digitized in-house by the Universidad Complutense de Madrid and volumes digitized at Yale University with support from Microsoft.
We quickly found that it would not be sustainable to add support for every new source of content coming into HathiTrust, especially when each source was only contributing a small number of volumes. Even when content coming from a source was relatively well-prepared and homogeneous, it usually differered enough from existing content in HathiTrust to require a great deal of investigation and craft work. Because HathiTrust started out with only material from Google Books, the process for getting material into HathiTrust was initially very specific to Google Books. The format for packaging content was essentially tied to what Google Books was providing, and the specifications for images in HathiTrust were based on digitization specifications that are geared towards technically sophisticated and well-resourced institutions who are newly digitizing content. There are stringent technical specifications for the images as well as specific descriptive metadata that image creators need to embeded in the files. These specifications are in place for good reason: to support both preservation of and access to all material, the same standards must be applied to all digitized book content entering the repository. The HathiTrust book viewing application generates images on the fly from the preservation copies in the repository; so, the less variation in the repository, the easier it is to assure correct operation. Additionally, less variation means that (in theory) any future format migration or other preservation actions would have fewer issues to consider.
Our first attempt to solve the problem of getting content into HathiTrust from disparate sources was to distribute a version of the the toolkit we had developed to handle material from Internet Archive, Yale, Madrid, and Michigan. The toolkit contained functionality to fix metadata embedded in image files, generate Metadata Encoding & Transmission Standard (METS) files, and validate submission packages. However, this toolkit could only work for a limited audience. Just installing the toolkit was non-trivial because of the large numbers of external dependencies. We tried to minimize the number of assumptions the toolkit made about its environment, but it still required Linux and a lot of prerequisite software. Also, the toolkit wasn’t a complete out of the box solution, unless the content happened to be exactly like something we’d already handled. It required time and programming expertise to create code describing a new package format and the steps needed to happen to transform the package into something meeting HathiTrust’s requirements. Because of these high technical requirements, it took a fair amount of our time to support the few institutions who tried making use of it. There were a few successes with the toolkit, most notably with content from the Texas A&M University, but not many other institutions had the resources to make successful use of the toolkit.
A Simpler HathiTrust Submission Package
After these setbacks, we thought about what else we could do to make it easier for partner institutions to submit content. We didn’t really have the resources to make the toolkit easier to install or use, so instead, we came up with a way to relax requirements for submitting content to HathiTrust while maintaining the same high standards in the repository. We already had code to attempt to bridge the gap between images coming from the Internet Archive, Michigan’s existing repository, etc, and the HathiTrust specifications as well as to create METS files from external metadata.
So, we made specifications for a simpler submission package format that eased the requirements in several ways:
- Rather than requiring a METS file, partners can just create a simple YAML file with some basic metadata and a checksum.md5 file listing the files and their MD5 checksums.
- Instead of requiring JPEG 2000 images prepared in a specific way, partners can submit nearly any black & white, greyscale, or full-color red/green/blue TIFF file and we will automatically create the preservation JPEG 2000 copies.
- Metadata no longer needs to be present in specific places in the images. Partners can provide it in the YAML file and we will insert it in the preservation copies of the images.
- Partners can submit content to a Box share rather than needing to make individualized arrangements to transfer content to HathiTrust.
There are some issues that the simple package format doesn’t solve — they’re core issues to digitization that any institution doing digital preservation would need to consider. Partners will need to:
- Decide what individual objects will consist of in HathiTrust. For monographs, the object in HathiTrust is just the individual book, but for serials, other multiple items bound together, and other more complicated cases, institutions must make some decisions.
- Scan images at appropriate resolutions. If the images aren’t high enough resolution to meet HathiTrust’s preservation standards (300 pixels per inch for greyscale or full-color images, 600 pixels per inch for black-and-white), then the only option might be to re-scan those volumes.
There’s also still some work institutions might need to do up front. Institutions must:
- Convert non-TIFF/JPEG2000 images to greyscale or full-color red/green/blue TIFFs before submission. We can’t currently handle anything other than TIFFs or JPEG2000 images, non-red/green/blue or greyscale images, or images with 16 bit color depth.
- Perform optical character recognition and produce one plain text file per page image. Most OCR software should be able to do this out of the box.
The requirements are much less technically demanding than using the ingest toolkit, though. So far 7 institutions have successfully submitted over 4,000 locally-digitized volumes to HathiTrust in this new format, and several others have started experimenting with it.
Get in touch with us at firstname.lastname@example.org if your institution is interested in submitting content to HathiTrust, or has any questions or comments about the process!
We’ve made available the specifications for the simple submission package format and an example YAML file.
Additionally, there are a number of free tools that can help with preparing submission packages. If your images are not already TIFF or JPEG 2000 images, take a look at ImageMagick. In most cases you can convert images to TIFFs just by running (for example):
convert infile.jpg outfile.tif
Editors like Notepad++ and TextWrangler can help with creating YAML files. YAML can be a little fussy about syntax and spacing, so there are online YAML validators available to check the files you create.