HathiTrust Research Center’s First Hackathon: What We Learned and Built


July 8, 2024

From May 21 through May 23, the HathiTrust Research Center (HTRC) held its first ever hackathon at the University of Illinois at Urbana-Champaign. 

We held the event as part of the National Endowment for the Humanities-funded Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction (TORCHLITE) project. TORCHLITE uses HTRC’s new Extracted Features (EF) API. An API, or application programming interface, is a way for computer programs to communicate with each other, allowing access to data in a standardized way. The EF API allows programmatic access to metadata and annotated tokens (words) for more than 17 million volumes from the HathiTrust Digital Library collection, including in‑copyright material.

A group of people sit and stand around a table where one person shows something on a white laptop computer.

The event brought together a diverse group of researchers including digital humanists, computer scientists, librarians, and graduate students in various disciplines to use this new API to develop visualizations, code notebooks (think Google docs but for code), educational material, and web applications. Participants had a wide range of expertise in literature, data science, programming, history, languages, and more. The diversity of expertise led to many interesting and successful collaborations. Participant and University of Kentucky Data Services Librarian Isaac Wink said he enjoyed “getting to connect with a wide range of HTRC users. We all came from very different places in terms of research interests, technical skills, and points in our careers, but the common point of interest in HTRC tools enabled us to engage with each other’s expertise and build new connections.”

Group of images with participants doing work on computer and talking with each other.

The first day of the event started off with a keynote address by Ben Lee and a handful of presentations about the Extracted Features data and the API to help participants orient themselves. We also did a speed networking icebreaker activity, where people circled throughout the room to introduce themselves and talked about their interests. The next two days were dedicated to forming groups and using the API to create data visualizations and other applications. HTRC did some pre-planning to outline potential projects participants could work on which were led by HTRC staff. Other groups formed organically and got to work. Throughout the hackathon, we had checkins where groups would present their progress in a very informal way.

By the end of the three days, participants had created a dozen projects. The projects include a mapping visualization of publication information, a code notebook for searching for specific terms across a collection of HathiTrust documents, and a code notebook that visualizes the most prominent words in a collection of documents.

Overall, the hackathon was collegial, collaborative, and easygoing, with lots of meals, snacks, and coffee to fuel the work. Rafia Mirza, Digital Scholarship Librarian at Southern Methodist University, said that her favorite part of the event was “how it was set up to bring together people with varying levels of technical skill to play around with the API to create something interesting (and hopefully useful) by the end of the hackathon. It was a very supportive and collegial atmosphere, and it was very neat to have made something by the end.”

The graph shows five bands of multi-colored data which illustrates the type of work explored during the hackathon, including this representation of word occurrences in Marcel Proust.
Visualization created during the hackathon that graphs parts of speech across pages in a single HathiTrust volume. https://github.com/gworthey/TORCHLITE_PoS/

The HTRC staff learned a lot of valuable lessons from the hackathon. We were able to see what kinds of questions and visualizations are compelling to researchers, what non-HathiTrust data they are interested in using to augment our data, and how they tackle coding projects. We also learned a great deal about our own data and what could be improved in future iterations of the Extracted Features and the TORCHLITE dashboard.

Based on participant feedback, we know participants got a lot out of the hackathon as well. Issac stated that he learned “what collaborative coding looks like” and that the secret to success is “to be able to frame a project well and divide up tasks modularly so that everyone’s work can fit together nicely.” Rafia is looking forward to using HTRC in workshops on text mining and has already started working on incorporating the new HTRC resources and tools into SMU’s fall workshop series.

The next step for the TORCHLITE project is to create a production version of our dashboard and API. Currently, the development API is open to the public but is still in beta. We hope to release production versions of both the API and dashboard by the end of 2024. All the TORCHLITE documentation, including educational materials such as code notebooks, are available on our hackathon website: https://htrc.github.io/torchlite-hackathon/. Questions about the API, TORCHLITE, our documentation, and HTRC generally can be directed to htrc-help@hathitrust.org.