Text Mining

Working with HathiTrust

As HathiTrust partners, University of California affiliates may make use of the HathiTrust corpus of digitized books for research and educational computational investigation. The full corpus can be searched using the HathiTrust Digital Library (HTDL). Many of the books are in the public domain and the full text readily available. For books still in copyright, HTDL makes available only the book's descriptive metadata (though there is a way to work with materials in copyright, described below). HTDL utilizes your CRUZ ID and gold password. 

The HathiTrust Research Center (HTRC) Analytics supports researchers with computational analysis using the corpus. HTRC requires you to create a separate but free account. At a basic level, you can create a workset of books in HTDL and import this into HTRC to run basic algorithms; at the advanced level, you can work with HTRC to gain access to the entire HathiTrust corpus, including materials still in copyright, to use in nonconsumptive research* activities.

Scroll down to read more about the following options:

  1. Web-based Algorithms
  2. Datasets for NonConsumptive Research*
  3. Data Capsules for NonConsumptive Research

*From the 2010 Authors Guild vs Google amended settlement agreement: "Non-Consumptive Research" means research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book.”  Non-consumptive analytics includes image analysis, text extraction, textual analysis and information extraction, linguistic analysis, automated translation, and indexing and search. Read more on HathiTrust's Non-Consumptive Use Research Policy here.

1. Web-based Algorithms (Public Domain Books)

At a basic level, you can run a scripts on small worksets of books you have gathered from the HathiTrust Digital Library, basically canned-algorithms for quick analysis:

  • Open HTDL and HTRC and login to both.
  • In HTDL, build a collection using the public domain volumes in HathiTrust Digital Library. Upload your workset into HTRC.
  • In HTRC, use the web-based algorithms. Execute an algorithm, this will prompt you to select a workset (your own, or a publically available workset). 

Note, this approach does not include in-copyright works.

2. Research Datasets

HTRC releases research datasets to facilitate text analysis using the HathiTrust Digital Library. While copyright-protected texts are not available for download from HathiTrust, research can still be performed on the basis of non-consumptive analysis of features extracted from full text, for example n-grams from over 13 million volumes in the HTDL to analyze in the computer environment of your choice.

Extracted features include volume-level metadata, page-level metadata, part-of-speech-tagged tokens, and token counts.

3. Data Capsules

The HTRC Data Capsule gives a researcher a secure, virtual computer for non-consumptive analytical access to the full OCR text of works in the HathiTrust Digital Library. Data capsules are restricted, particularly in limiting how and when the products created by analysis tools leave the capsule. Data products leaving a data capsule must undergo results review prior to release. To get started with the Data Capsule check out this tutorial: