Available Indexes

Usage examples of HathiTrust datasets

Researchers who have received access to a dataset from HathiTrust have produced the following examples of

Publications, presentations, and dissertations


Krewson, Stephen. “Extracting Illustrated Pages from Digital Libraries with Python.” The Programming Historian 8 (January 14, 2019). https://programminghistorian.org/en/lessons/extracting-illustrated-pages.


Dubnicek, Ryan, Ted Underwood, and J. Stephen Downie. “Creating A Disability Corpus for Literary Analysis: Pilot Classification Experiments.” IConference 2018 Proceedings, July 12, 2018. http://hdl.handle.net/2142/100252.

Martin, Cathie Jo. “Imagine All the People: Literature, Society, and Cross-National Variation in Education Systems.” World Politics 70, no. 3 (June 19, 2018): 398–442. https://doi.org/10.1017/s0043887118000023.

Martin, Shawn. “Textual Analysis and the History of Scholarly Communication.” Proceedings of the Association for Information Science and Technology 54, no. 1 (October 24, 2017): 752–53. https://doi.org/10.1002/pra2.2017.14505401143.

McConnaughey, Lara, Jennifer Dai, and David Bamman. “The Labeled Segmentation of Printed Books.” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, September 2017, 737–47. https://doi.org/10.18653/v1/d17-1077.

Underwood, Ted. “The Historical Significance of Textual Distances.” 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2018), June 30, 2018. https://arxiv.org/abs/1807.00181.

Underwood, Ted. “Why Literary Time Is Measured in Minutes.” ELH 85, no. 2 (2018): 341–65. https://doi.org/10.1353/elh.2018.0013.

Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics, February 13, 2018. https://doi.org/10.22148/16.019.


Bamman, David, Michelle Carney, Jon Gillick, Cody Hennesy, and Vijitha Sridhar. “Estimating the Date of First Publication in a Large-Scale Digital Library.” 2017 ACM/IEEE Joint Conference on Digital Libraries - JCDL 17, June 1, 2017. https://doi.org/10.1109/jcdl.2017.7991569.

Martin, Shawn. “Textual Analysis and the History of Scholarly Communication.” Proceedings of the Association for Information Science and Technology 54, 1 (2017): 752–53. https://doi.org/10.1002/pra2.2017.14505401143.

Page, Kevin, Terhi Nurmikko-Fuller, Timothy Cole, and J. Stephen Downie. “Building Worksets for Scholarship by Linking Complementary Corpora.” Digital Humanities DH2017, 2017. https://dh2017.adho.org/abstracts/606/606.pdf.

Reagan, Andrew J. 2017. "Towards a Science of Human Stories: Using Sentiment Analysis and Emotional Arcs to Understand the Building Blocks of Complex Social Systems." Order No. 10266462, The University of Vermont and State Agricultural College. https://search.proquest.com/docview/1889207124?accountid=14667.


Alex, Beatrice, Claire Grover, Jon Oberlander, Tara Thomson, Miranda Anderson, James Loxley, Uta Hinrichs, and Ke Zhou. “Palimpsest: Improving Assisted Curation of Loco-Specific Literature.” Digital Scholarship in the Humanities 32, no. suppl_1 (November 7, 2016): i4–i16. https://doi.org/10.1093/llc/fqw050.

Algee-Hewitt, Mark, Sarah Allison, Marissa Gemma, Ryan Heuser, Franco Moretti, and Hannah Walser. “Canon/Archive. Large-Scale Dynamics in the Literary Field.” Pamphlets of the Stanford Literary Lab Pamphlet 11 (January 2016): 1–13. https://litlab.stanford.edu/LiteraryLabPamphlet11.pdf.

Duhaime, Douglas Ernest. “Textual Reuse in the Eighteenth Century: Mining Eliza Haywood's Quotations.” DHQ: Digital Humanities Quarterly 10, no. 1 (2016). http://www.digitalhumanities.org/dhq/vol/10/1/000229/000229.html.

Hinze, Annika, David Bainbridge, Sally Jo Cunningham, and J. Stephen Downie. “Low-Cost Semantic Enhancement to Digital Library Metadata and Indexing.” Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries - JCDL 16, June 19, 2016, 93–102. https://doi.org/10.1145/2910896.2910910.

Jett, Jacob, Terhi Nurmikko-Fuller, Timothy W. Cole, Kevin R. Page, and J. Stephen Downie. “Enhancing Scholarly Use of Digital Libraries: A Comparative Survey and Review of Bibliographic Metadata Ontologies.” Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries - JCDL 16, June 19, 2016, 35–44. https://doi.org/10.1145/2910896.2910903.

Piper, Andrew. “Fictionality.” Journal of Cultural Analytics, December 20, 2016. https://doi.org/10.22148/16.011.

Underwood, Ted. 2016. “The Life Cycles of Genres.” Journal of Cultural Analytics, May. https://doi.org/10.22148/16.005.

Underwood, Ted, and Jordan Sellers. “The Longue Durée of Literary Prestige.” Modern Language Quarterly 77, no. 3 (September 2016): 321–44. https://doi.org/10.1215/00267929-3570634.


Hinze, Annika, Craig Taube-Schock, David Bainbridge, Rangi Matamua, and J. Stephen Downie. “Improving Access to Large-Scale Digital Libraries Through Semantic-Enhanced Search and Disambiguation.” Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries - JCDL 15, June 21, 2015, 147–56. https://doi.org/10.1145/2756406.2756920.

Hinze, Annika, Craig Taube-Schock, David Bainbridge, Sally Jo Cunningham, and J. Stephen Downie. “Introducing Capisco: a Semantically-Enhanced Search and Discovery System for Large-Scale Text Corpora.” ACM SIGWEB Newsletter, 2015, 1–14. https://doi.org/10.1145/2833219.2833223.

Joque, Justin. “Text Analysis and Visualization in the Literature Classroom.” HTRC UnCamp. Lecture presented at the HTRC UnCamp, March 30, 2015.

Leetaru, Kalev. “History As Big Data: 500 Years Of Book Images And Mapping Millions Of Books.” Forbes. Forbes Media, September 16, 2015. https://www.forbes.com/sites/kalevleetaru/2015/09/16/history-as-big-data-500-years-of-book-images-and-mapping-millions-of-books/#70da78456aba.

Nurmikko-Fuller, Terhi, Kevin R. Page, Pip Willcox, Jacob Jett, Chris Maden, Timothy Cole, Colleen Fallaw, Megan Senseney, and J. Stephen Downie. “Building Complex Research Collections in Digital Libraries: A Survey of Ontology Implications.” Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries - JCDL 15, June 21, 2015, 169–72. https://doi.org/10.1145/2756406.2756944.

Underwood, Ted. “The Literary Uses of High-Dimensional Space.” Big Data & Society 2, no. 2 (December 1, 2015): 1–6. https://doi.org/10.1177/2053951715602494.


Bamman, David, Ted Underwood, and Noah A. Smith. “A Bayesian Mixed Effects Model of Literary Character.” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics 1 (June 2014): 370–79. https://doi.org/10.3115/v1/p14-1035.


Burton, Orville Vernon. “The South as ‘Other,’ the Southerner as ‘Stranger.’” The Journal of Southern History 79, 1 (2013): 7–50. http://www.jstor.org/stable/23795402.

Conway, Paul. “Preserving Imperfection: Assessing the Incidence of Digital Imaging Error in HathiTrust.” Preservation, Digital Technology & Culture 42, no. 1 (2013): 17–30. https://doi.org/10.1515/pdtc-2013-0003.

Underwood, Ted, Michael L. Black, Loretta Auvil, and Boris Capitanu. 2013. “Mapping Mutable Genres in Structurally Complex Volumes.” 2013 IEEE International Conference on Big Data, October. https://doi.org/10.1109/bigdata.2013.6691676.

Underwood, Ted. 2013. “We Don’t Already Understand the Broad Outlines of Literary History.” Web log. The Stone and the Shell (blog). February 8, 2013. https://tedunderwood.com/2013/02/08/we-dont-already-know-the-broad-outlines-of-literary-history/.

Willis, Craig, and Miles Efron. “Finding Information in Books: Characteristics of Full-Text Searches in a Collection of 10 Million Books.” Proceedings of the American Society for Information Science and Technology, November 1, 2013, 48:1–48:10. https://doi.org/10.1002/meet.14505001085.


Hagedorn, Kat, Michael Kargela, Youn Noh, and David Newman. “A New Way to Find: Testing the Use of Clustering Topics in Digital Libraries.” D-Lib Magazine 17, 9/10 (September/ October 2011). https://doi.org/10.1045/september2011-hagedorn.


Hagedorn, Kat, David Newman, and Youn Noh. “How Topic Modeling Is Useful in Digital Libraries.” Presentation, 2010. https://pdfs.semanticscholar.org/a01e/4772dbd8fbab62e43987c49f1510393ec10c.pdf.

Newman, David, Youn Noh, Edmund Talley, Sarvnaz Karimi, and Timothy Baldwin. 2010. “Evaluating Topic Models for Digital Libraries.” Proceedings of the 10th Annual Joint Conference on Digital Libraries - JCDL 10, June 21, 2010, 215–24. https://doi.org/10.1145/1816123.1816156.


Brown, Travis. “Princeton Prosody.” Maryland Institute for Technology in the Humanities, 2013. https://mith.umd.edu/research/princeton-prosody/.

Mendenhall, Ruby, and Mark Van Moer. “Text Analytics Visualization.” National Center for Supercomputing Applications (NCSA) at the University of Illinois. University of Illinois, n.d. http://www.ncsa.illinois.edu/enabling/vis/vis_group/text_analytics.

Mendenhall, Ruby, Ralph Roskies, Michael Levine, and Nicholas Nystrom. “Rescued History.” National Science Foundation, February 25, 2016. https://www.nsf.gov/discoveries/disc_summ.jsp?cntn_id=137797.

Morrissey, Robert, and Min Chen. “Commonplace Cultures.” Commonplace Cultures, n.d. https://commonplacecultures.org/.

Poehler, Eric. “Pompeii Bibliography and Mapping Project.” The UMass Digital Humanities Initiative. University of Massachusetts Amherst, November 29, 2017. https://digitalhumanities.umass.edu/pbmp/?author=1.