Electronic Access and The “Collective Collection”

June 17, 2016

Written by Mike Furlough

Note: this is the lightly edited text of a talk presented at the 2016 CRL Collections Forum, @Risk: Stewardship, Due Diligence, and the Future of Print in Chicago, IL on April 14, 2016.

The audio for this talk is available on YouTube.

About twenty years ago, while I was serving a reference desk shift at the University of Virginia Library, a student approached me, looking for “old magazines.” He was working on a well-worn freshman composition assignment that required the student to find and analyze advertisements from issues of Time, Saturday Evening Post, or other popular magazines from the mid-twentieth century.  I showed him the still relatively new web-based library catalog, instructed him on how to search for specific magazine titles, and gave him a guide to the stacks so he could find them by call number.   Just as we wrapped up he asked: “Can’t I see the magazines on the computer?”

That was when I knew that users’ expectations had changed forever, and I had no immediate expectation that we could meet them.   At that time, Google didn’t yet exist.  Large-scale digitization in libraries had begun, with a lot of support from the Andrew W. Mellon Foundation, but focused mostly on nineteenth century materials.  JSTOR was already available but a young product.  Science publishers were beginning to smell a new source of income from their own backfiles. Ambitions were so high that we could imagine tens of thousands, maybe a hundred thousand books and journals online in a few years.

Google blew those expectations away in 2004 when they announced their digitization partnership with five libraries[1]. If that student showed up today and bothered to ask for help, I’d show him how to use Google to find back issues of Life Magazine. I could also show him how to recall the magazines from the remote storage facility for twenty-four-hour delivery, but I suspect that I’d be met with greater incredulity than twenty years ago.

A lot has changed in those twenty years since I showed the student how to find the APs, and those twenty years went by awfully fast.  Twenty years, a generational timeframe, seems to be something of a magic number for preservation.  The Digital Preservation Network (DPN) has done business modeling and forecasting on a twenty-year timescale, and twenty years is also the minimum length of time DPN commits to preserving the bits deposited by its members. Twenty years is a common term for shared print retention agreements, the closest thing to a permanent commitment many of us feel we can make, given the unknowns of the future.   Twenty years is as close to forever as we frequently can get.

Many of our discussions around shared print are driven by economics and opportunity costs:  can we reduce the cost maintaining our print collection by sharing the burden with others?  Can we free up space for other purposes by doing so?  The original ballot proposal for HathiTrust’s shared print monograph archiving program cited a survey of library directors who overwhelmingly agreed that “withdrawal of print books would be an important future strategy…if a robust digital alternative were available” and that “they would be more likely to withdraw their print book collections if their library could provide guaranteed on-demand access to print versions through a sharing network such as HathiTrust.”  There’s nothing wrong with responding to near-term needs: it’s the reality of the academic world we are living in, and we’ve premised much of our work at HathiTrust on being able to do things more affordably at scale.  (More about our own shared print program later in this talk.)

But we know these decisions have long-term implications, and thus at today’s meeting and elsewhere we as a community are turning to slightly different questions:  How can we ensure future access to the print record?  I propose that it would be useful if we spent some time discussing questions such as:  What is our vision for the state of the print record in the year 2036?

I think we would want to see a sound, robust, continental network of print archives and interlinking access services that include digital access. Maybe there would be a few “mega-repositories” to serve “mega-regions.”[2]  I think that we’d want to have some confidence that our successors could make choices for the next 20, 30, or 40 years with better information than we ourselves have. We would like for our successors in twenty years to be able to account for a significant majority—75, 85, 95%? —of what we know to have been collected by research libraries. We would want to have confidence that we’ve also been able to account for a wider range of print materials, often unique and at risk, that document topics, histories, cultures, and lives that are not now well represented in our research collections, much less our digital collections.  Such materials will have been digitized and the digital files preserved; and multiple libraries will have committed to the long-term retention of each.  The copies will be geographically distributed and will be accessible in appropriate formats that support the needs of the users.

Is it too much to expect that over 20 years we could both account for the totality of North American research collections and ensure that we have well documented print retention commitments for that totality?  I’m usually the one who reminds people that preservation is about coming to terms with loss, not saving everything:  “Look, I know that withdrawing a book reminds you of your own mortality, but we really don’t need to keep this one.”  So I’m not saying these are even the correct questions.  We all know that we will have to work through existing constraints to get to such a point, but sometimes some visioning can be an energizing exercise.  I’m not proposing the few sentences I’ve laid out is such a vision or that I have it right.  I’m only suggesting we think about what we want, in addition to where we are now.

What is required for us to have greater certainty about what exists, what is retained, and it suitability for future use?

Obviously we have a very long way to go to anything like the very rough sketch I just offered. CRL’s analysis of the data in the PAPR database and last summer’s PAPR II summit both highlight some of the difficulties facing us. OCLC Research has posited that we are seeing the beginnings of “system-level” thinking about the collective collection, but CRL points out that there has been limited coordination among existing serials archiving programs.  Their work speculates that perhaps only as little as 2% of the existing journal titles held by North American Libraries are currently covered in our archiving programs.   High quality holdings data is hard to come by and so the analysis is only as good as the data we have.

HathiTrust has been focusing all of our planning around shared print support on monographs archiving.  It is practically encoded in our organizational DNA. Two of the goals stated in our bylaws announce that we will

  • …develop partnerships and services that ensure preservation of the materials in HathiTrust and the entire print and digital scholarly record.
  • …reduce long-term capital and operating costs of storage and care of print collections through redoubled efforts to coordinate shared storage strategies among libraries.[3]

Initial planning for this program lays out a goal to eventually confirm retention commitments for the monograph titles with a digital copy in HathiTrust (our current, non-de-duplicated count is about 7 million titles).[4]  Those retention commitments will be made by member libraries and publicly disclosed in commonly used resources, such as WorldCat, as well as other knowledge bases. A robust access system would be a part of the program.  We are really only getting started with this work, but I would like to think that we can use this as an opportunity to help move us toward the vision I sketched earlier.

Although we are not primarily focused on serials print archiving, it would only make sense for HathiTrust to work with CRL and others to support serials print archiving based on the HathiTrust collection.  The Keepers Registry has been tracking our holdings, treating us as a serials archive even though it is not our primary focused.  Anecdotally I know that libraries consider title inclusion in HathiTrust as a factor when developing retention criteria; but through speaking with various experts I have not heard of any cases where a library has withdrawn solely because the title is in HathiTrust. Of course, our metadata is not perfect.   Our un-de-duplicated serials count from the daily statistics is 370,000.  OCLC, during last June’s summit, refined this count and put it at around 290,000 titles.  Only about 95,000 of the records we hold for serials have an ISSN.  We receive an annual report of holdings from each of our member libraries.  Even so, it has proven very challenging to match our members’ reported serials holdings data with our digital collection because of the different ways that libraries record holdings for serials and express the enumeration and chronology for these titles.

We do much better for monographs, and we can readily see that there is a great deal of duplication among our members’ physical collections. But we know there are gaps in our knowledge due to a few variables.   These include how easy or hard it is to produce such data from a given ILS system[5] and in some cases whether a library has undertaken a reclamation project with OCLC.[6]  The data may not be as up-to-date or may be missing some information we request.   This of course will inhibit our shared print work to a degree, so improving the quality and extent of our holdings data will be important for shared print programs and operations.

Returning to digitization:  Shared print projects are influenced by two decades of digitization, driven heavily by publishers or others in the commercial sector. Obviously HathiTrust’s collection is one outcome of the mass digitization work begun by Google and to some extent by Internet Archive over a decade ago.  It’s important to note that HathiTrust’s primary collection strategy has been to aggregate materials that have been digitized from our member libraries’ collections in general.  Our focus is on unlocking additional utility from those collections for the entire partnership.  We do this to build a public good while also creating services that benefit the members. We occasionally have collected materials from non-members where the material can be made full view, meets our specifications, and is highly valued.  But these are the exceptions:  the core of our collection is what you will find in circulating and special collections research libraries in North America.   It is primarily published materials in bound form:  books.

Much of this digitization has been done in partnership with Google or Internet Archive.  Google has been focused on breadth of material, getting as close as it can to “completeness.” Partner libraries have latitude in what they send to Google, and Google has prioritized some types of materials by request of its partner libraries. But there are well-known limitations in what Google has been able to scan (think foldout maps, large format items, such as newspapers), and such materials are not generally found in HathiTrust.

HathiTrust is a not perfect representation of what has been scanned by Google or anyone else.  There are Google partner schools that have not deposited all (or any) of their Google-digitized material into HathiTrust.  While there is significant overlap between HathiTrust and print research collections, it is clear it that it is hardly complete. From conversing with Constance Malpas at OCLC Research I know that the median overlap rate of ARL schools with HathiTrust is still hovering in the mid-30 percent range, and the maximum does not seem to exceed 50%.

With those caveats in mind, where should we focus our efforts, and how do those align with CRL?

I think it should be obvious that there is too much to do by any one organization.  Duplication of effort is not a winning strategy when resources are scarce.  And when there is too much to do you should focus on exploiting your strengths.  I think that the many libraries that are members of both CRL and HathiTrust (and others who are not) would expect us to pursue complementarity.  That is just obvious to me.

The results of a recent survey of our membership, which are still preliminary, offer some additional guideposts.[7]  It’s clear that above all else our members highly value the work we’ve done with published textual materials. They care about improving the quality of the corpus and completing gaps in sets and in subjects more than they do about expanding our scope into other formats.  This might include focus on material that Google has not scanned or been missed.

There is continued strong support for our US Federal Documents Initiative, which is the one area we have identified as a collection/digitization priority.  There we are attempting to identify a corpus of existing federal documents that can be scanned and added to HathiTrust.    This program, our shared print monograph program, and improving and “completing” our existing corpus, all will require a much clearer sense of the characteristics of our digital corpus, the existing collective print corpus, and the relationship between the two.  It’s clear our interest here overlaps with that of CRLs, as well as OCLC, and we can be working together to advance both digital access and print archiving.

Stepping away from HathiTrust specifically, I believe that just as our “community” long-term vision for shared print is hazy, so is our community long-term vision for digitization.  We have all operated in the last decade with the knowledge that Google has been scanning materials found in circulating collections, but the extent of that work is not widely understood.  Appropriately we see some strategies developed around certain types of materials, such as federal documents within the HathiTrust membership, or the National Digital Newspaper Program from the National Endowment for the Humanities and the Library of Congress. Foundations and national funding agencies have generally shifted their priorities for digitization funding, if they have any, towards scholar-driven selection and commitment to use.  Local strategies wisely focus on what is unique and distinctive from their collections. There are attempts to coordinate as we become aware of opportunities to do so.  For example, CRL has focused on state and international documents, which for HathiTrust are of secondary interest compared with US federal documents.

To the extent that there is alignment among these different programs, it is far less intentional than it might be.  All of these loosely coordinated programs have accomplished a great deal, but again I wonder if we can begin to more intentionally craft a vision that could be used as a guide to our collective future investments in digitization.   Earlier I linked digitization to print archiving as a part of what a twenty-year vision might include.  A twenty-year vision for digitization might assume that we will have achieved the creation of a comprehensive corpus of reformatted items, but would also assume the longevity of print as an important and necessary mode of access that exists symbiotically with digital.  This implies that we must continue digitizing materials from the early 2000s onward if existing digital versions cannot be located, and in turn this implies that library digitization strategies rely heavily on the fair use decisions we have seen in the Authors Guild cases to include in copyright materials.  It would promote more strategic uses of digitized in copyright (for example, non-consumptive, computational access should be a standard mode of access, just like on screen reading).[8]  And it should lay out principles of engagement with the commercial sector to help evaluate opportunities for partnership and licensing that will arise.

In the interest of ensuring the Future of Print, is it reasonable to treat digitization as a necessary, required step in print archiving?  Here I am imagining a concerted effort to identify sets of materials that should be retained in print, followed by a commitment to digitize them.[9] In fact, I think that CRL and HathiTrust are both well positioned to examine how we could better link print archiving and digitization going forward, given our respective commitments to these issues and the large number of member libraries we share in common.

To link print archiving to a commitment to digitize would, at the large scale, require significant effort to identify the corpus of material that should be retained and registered as such.  To do it well would require requiring significant coordination across multiple existing programs. And significant funding too.  Here is where reality comes back to keep us in our corners. We think we have come a very long way in digitization and print archiving, but when we begin to really look closely we can see that we have an even longer way to go.  If we could take some time to step back and define where we want to end up in our print archiving work, I think we’d stand a better chance of finding our way there.

But I come back to where I began: what do you want that student in twenty years to be able to do with library collections?


[1] The “Google Five” were Michigan, Stanford, Harvard, New York Public, and Oxford.  See the New York Times story “Google is Adding Major Libraries to its Database,” December 14, 2004.  http://www.nytimes.com/2004/12/14/technology/google-is-adding-major-libraries-to-its-database.html

[2] See OCLC Research’s 2012 report Print Management at “Mega-scale”: A Regional Perspective on Print Book Collections in North America, written by Brian Lavoie, Constance Malpas, and JD Shipengrover, online at http://www.oclc.org/content/dam/research/publications/library/2012/2012-….

[3] See the HathiTrust bylaws, specifically Article I – Purpose online at https://www.hathitrust.org/bylaws

[4] The preliminary planning for this program is summarized in the Final Report of the Hathitrust Print Monograph Archive Planning Task Force, published with commentary at https://www.hathitrust.org/files/sharedprintreport.pdf

[5] For example, some members have a great deal of difficulty extracting holdings data from their ILS systems, either because the system was not designed with this reporting in mind, or the library lacks staff resources to spend a lot of time on it.

[6] HathiTrust relies on OCLC IDs to match records between a library’s local holdings and the volumes in HathiTrust’s digital collection.

[7] We expect to publish the results of the survey in summer 2016.

[8] For us, enabling computational analysis for the HathiTrust collection is a critical mode of access, which we support through the HathiTrust Research Center at Indiana and Illinois.  We’ve seen some promising work using content analysis that could contribute to improvements or enhancements in the descriptive metadata of our corpus.   We’ve also learned that researchers pose questions that are corpus agnostic.  The HathiTrust collection is great, but they also want to compute against several other collections, and it is complicated for them to do that.  Extending Research Center services across other corpora, such as CRLs, could be helpful to both scholars and to libraries.

[9] The partnership between CRL and Linda Hall Library is a model for this.  Together they’ve committed to joint partnership to preserve and develop historical research collections in the fields of science, technology and engineering.  They have developed a joint collection strategy, and have recently announced plans to begin digitizing pre-1950 titles.  See https://www.crl.edu/news/crl-and-linda-hall-library-digitize-historical-serials