Text Mining

Working with HathiTrust

As HathiTrust partners, University of California affiliates may make use of the HathiTrust corpus of digitized books for research and educational computational investigation. The full corpus can be searched using the HathiTrust Digital Library (HTDL). Many of the books are in the public domain and the full text readily available. For books still in copyright, HTDL makes available only the book's descriptive metadata (though there is a way to work with materials in copyright, described below). HTDL utilizes your CRUZ ID and gold password.

The HathiTrust Research Center (HTRC) Analytics supports researchers with computational analysis using the corpus. HTRC requires you to create a separate but free account. At a basic level, you can create a workset of books in HTDL and import this into HTRC to run basic algorithms; at the advanced level, you can work with HTRC to gain access to the entire HathiTrust corpus, including materials still in copyright, to use in nonconsumptive research* activities.

Scroll down to read more about the following options:

Web-based Algorithms
Datasets for NonConsumptive Research*
Data Capsules for NonConsumptive Research

*From the 2010 Authors Guild vs Google amended settlement agreement: "Non-Consumptive Research" means research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book.” Non-consumptive analytics includes image analysis, text extraction, textual analysis and information extraction, linguistic analysis, automated translation, and indexing and search. Read more on HathiTrust's Non-Consumptive Use Research Policy here.

Getting Started Guide
HTRC's documentation and FAQ to get you started.

1. Web-based Algorithms (Public Domain Books)

At a basic level, you can run a scripts on small worksets of books you have gathered from the HathiTrust Digital Library, basically canned-algorithms for quick analysis:

Open HTDL and HTRC and login to both.
In HTDL, build a collection using the public domain volumes in HathiTrust Digital Library. Upload your workset into HTRC.
In HTRC, use the web-based algorithms. Execute an algorithm, this will prompt you to select a workset (your own, or a publically available workset).

Note, this approach does not include in-copyright works.

HathiTrust Digital Library
This is digital preservation repository and access platform. It provides long-term preservation and access services for public domain and in copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives, including UCSC.
HathiTrust Research Center (HTRC) Analytics
Supports large-scale computational analysis of the works in the HathiTrust Digital Library to facilitate non-profit and educational research. Sign up for a free account.

2. Research Datasets

HTRC releases research datasets to facilitate text analysis using the HathiTrust Digital Library. While copyright-protected texts are not available for download from HathiTrust, research can still be performed on the basis of non-consumptive analysis of features extracted from full text, for example n-grams from over 13 million volumes in the HTDL to analyze in the computer environment of your choice.

Extracted features include volume-level metadata, page-level metadata, part-of-speech-tagged tokens, and token counts.

HTRC Derived Datasets
Information about Extracted Features, including use cases.

3. Data Capsules

The HTRC Data Capsule gives a researcher a secure, virtual computer for non-consumptive analytical access to the full OCR text of works in the HathiTrust Digital Library. Data capsules are restricted, particularly in limiting how and when the products created by analysis tools leave the capsule. Data products leaving a data capsule must undergo results review prior to release. To get started with the Data Capsule check out this tutorial:

Data Capsule Tutorial
A hands-on instructions to introduce the HTRC Data Capsule tool.

1156 High Street spacer bullet

Santa Cruz, CA spacer bullet

95064

Feedback

Creative Commons Attribution 3.0 License except where otherwise noted.

Patrons with Disabilities
Privacy Policy

Staff Portal
LibApps Login
Incident Form (staff only)

Land Acknowledgement

The land on which we gather is the unceded territory of the Awaswas-speaking Uypi Tribe. The Amah Mutsun Tribal Band, comprised of the descendants of indigenous people taken to missions Santa Cruz and San Juan Bautista during Spanish colonization of the Central Coast, is today working hard to restore traditional stewardship practices on these lands and heal from historical trauma.

The land acknowledgement used at UC Santa Cruz was developed in partnership with the Amah Mutsun Tribal Band Chairman and the Amah Mutsun Relearning Program at the UCSC Arboretum.

University Library