The Bibliography and the Index: Dr. Ana Lučić’s Quest to Automate Peritextual Identification

HTRC

July 12, 2024

Index, Table of Contents, Dedication, Bibliography: Beyond literary wayfinding, this text — known as peritext — can be both a valuable source for scholars researching texts at scale in areas such as publishing trends or popular references within a certain time period and a hindrance when they need to focus on “just the text.” Using the robust collection of the HathiTrust Digital Library and the tools for computational analysis provided by the HathiTrust Research Center, it is theoretically possible to query peritextual data. But there’s a catch — you need to be able to identify which content is coming from which parts of the text. While there is not currently a way to enable researchers to computationally discern between these textual elements, Dr. Ana Lučić and her research team aim to change that.

Dr. Lučić recently received a National Endowment for the Humanities (NEH) Digital Humanities Advancement grant to pursue a project to automatically detect peritext and differentiate it from the core body of work. This would allow researchers to engage in research projects targeting peritextual elements of volumes in the HathiTrust Digital Library. We spoke with Ana recently to learn more about her project and its potential impact on computational scholarship using the HathiTrust Digital Library.

Dr. Ana Lučić is a Staff Research Scientist at the Illinois Applied Research Institute at the University of  Illinois at Urbana Champaign, and a PhD graduate of its School of Information Sciences, which co-hosts the HathiTrust Research Center. Ana specializes in natural language processing techniques, literature based discovery, data engineering and data management and deriving new knowledge from old data. In 2017, she and a team of researchers received an Advanced Collaborative Support award from the HathiTrust Research Center to pursue the Computational Support for Reading Chicago Reading

 

Two pages from The House of Mirth are displayed. One title page has New York circled and the other interior page has New York circled.
Ana’s project aims to automatically differentiate text appearing in front or back matter and the same text within the core work.

 

Tell me about your new research project funded through the National Endowment for the Humanities and what exactly you’ll be investigating through this work.

We have proposed to build a data set of approximately 1,000 fiction and nonfiction works that are available through the HathiTrust Digital Library. We would like to use these books that have been published throughout different decades of the twentieth century  to identify boundaries of the front matter and back matter. Front matter implies Title Page, Preface,  Introduction, and back matter implies a Bibliography, maybe an Index of Terms, perhaps a Conclusion or acknowledgements.

All of these elements are what some consider to be an essential part of the work that might or might not require separation from the core work when doing the analysis. Additionally, sometimes, it can be challenging to discern if the part of the work was created/added by the author themselves or by the publisher. But some would say that, for particular types of analyses,  separating front and back matter from core work can be beneficial. So we would like to identify manually in this data set the boundaries of the front matter and the back matter and then determine where the core work starts and where does it end. That would provide us the basis for building a predictive supervised model that would then hopefully allow us to predict these boundaries algorithmically in other works that are in the HathiTrust Digital Library.

 

Why is the distinction between the front matter, back matter, and core text important? 

When you compare the core content and the peritextual elements based on the features extracted from these elements, you can see that peritext can be modeled as an outlier of the core work. This prompted us [to ask] if we can identify these peritextual elements in an automated way. For the types of analyses that we would like to do, it would be helpful to know  exactly which structural part of the work the feature that we extracted from the text  came from, because we think that it is important. 

If, for example, you are analyzing how many times a place name was mentioned in the text and if you include bibliography in that analysis you might end up with lots of mentions of New York because New York is the mecca of the publishing world. And so a lot of publishers and a lot of works that were mentioned in the book might be referenced at the end. Then you can get a sense that this book contains a lot of references to New York, but actually they are mostly in the bibliography, and so they are not really in the core text. And what you would like to know is whether New York is mentioned in the core text, too, and if so, how it is being mentioned. That impacts a scholar’s understanding and analysis of a title or a group of titles.

 

Why use the HathiTrust Digital Library over another digital library? 

The titles [in HathiTrust] come from a lot of libraries throughout the United States and elsewhere and reflect the holdings of those libraries in some way, but those holdings include a very wide variety of works. We’ve decided that the HathiTrust collection is a very comprehensive resource in terms of the number of fiction and nonfiction titles that it holds and we have used it in the past. It has the tools and environments that we are used to working with like the secure data capsule, the Hathifilesf, and the digital library catalog.

 

How will you go about selecting titles for your study? 

We are not the only ones who are examining these questions. Other scholars are also interested in questions such as, how can we identify whether a work is fictional or primarily nonfictional? Ted Underwood, for example, has done a lot of work in trying to automatically identify fiction works that are featured in the HathiTrust Digital Library. He has released a list of works that their predictive method determined to be fiction. This is a useful list for us, because we are currently using it to randomly select works from it that were published in the twentieth century that we could possibly include in our data set. 

We are also going to use the help of the UIUC library to translate some of the categories into Library of Congress subject headings so that we can query the HathiTrust Digital Library. When we are going to select these 1,000 works, we are going more for a randomized search, because we think that will be more in line with what the user might see when they search in the digital library.

We’ll be selecting both fiction and nonfiction. Some of the works are currently in the public domain. I think that we will have a lot more works that are still under copyright, and we are actually relying on HathiTrust to allow us access to [that work] so that we can demarcate the boundaries of front matter and back matter. In addition to extracting features from the full-text, we also plan to use the non-consumptive features that are available through the Extracted Features dataset created by HathiTrust Research Center team.. So, we will be building two models, one that relies on the features extracted from full text and the other that relies on non-consumptive features available through the Extracted Features dataset.

We are excited to observe what the model will be able to tell us. We believe that we will probably be able to see some trends and some indication of where it is likely that the front matter ends and back matter begins. 

 

Describe the overview of the phases and tools you’ll be using for your research.

We are still in the early phases of the project although we have identified the works that will go in the dataset. We have also developed a user interface inside a secure data capsule that will facilitate the annotation of the selected works and their digitized pages inside the capsule. 

We are currently at the annotation phase of the project that will be preceded by the process of establishing  inter-annotator agreement between different annotators. This process will help us establish how much agreement there is between annotators on the task of establishing front matter, core work, and back matter.. The annotation phase is expected to complete by the end of the summer. .

After the annotation phase, we will have the boundaries of the structure of the work based on which we will be able to extract features from front matter, core work, and back matter and start training the model to learn the separation boundary between these elements of the work. 

We also plan to create visualizations to assist users in seeing what they can expect from this work. Such as, “Hey, this work has a lot of back matter.” or “This work has a lot at the front.” Then there would be a period of manual evaluation and validation of our results, and the final stage would involve establishing if any of these methods can be potentially integrated with the HathiTrust Digital Library. So this is the [ultimate] goal and our hypothesis is that this visualization  would be useful to  other readers and researchers who apply computational analysis methods on the works inside the HathiTrust digital library.

 

What will be a possible application of this model?

I believe that these methods can allow researchers, for example, to analyze the history of the book and history of publishing. If you could have a method that can accurately identify front matter, you could then identify a workset from the HathiTrust Digital Library, and then observe how publishers have decorated the book throughout decades or throughout centuries and how and in what ways these practices fluctuated. 

[These methods] can help investigate questions about how we access anything that we read and what surrounds it.  I think this is an important question. This still requires lots of research into what surrounds the book and what we actually read in the book, because we frequently tend to skip certain pages. This is something that I would like to do in the future, and I’m hoping others would be able to do so.

 

Conclusion

In attempting to create an automated method for identifying peritext — one that other researchers can use off-the-shelf for investigations in the HathiTrust Digital Library — Dr. Lučić’s team will contribute to the common good of computational literary analysis. With 18+ million titles in the collection and an advanced research environment at the HathiTrust Research Center that includes copyrighted text for non-consumptive research, HathiTrust is uniquely positioned to serve as the repository source for the research grant. 

 

Top