The Institutional Data Initiative's pipeline for analyzing, refining, and publishing the Institutional Books 1.0 collection.
- Getting started
- Available utilities
- Custom Exclusion List
- CLI: Common options
- CLI:
setup - CLI:
analyze - CLI:
process - CLI:
export - CLI:
publish - About IDI
- Cite
This pipeline was built and optimized to process a Google Books collection retrieved using GRIN Transfer and stored on cloud storage.
See GRIN Transfer documentation for details.
# Clone project
git clone https://github.com/instdin/institutional-books-1-pipeline.git
# Install dependencies
# NOTE: Will attempt to install system-level dependencies on MacOS and Debian-based systems.
bash install.sh
# Edit environment variables
# We recommend increasing `CACHE_MAX_SIZE_IN_GB` significantly if possible (e.g: 1000 Gb)
nano .env # (or any text editor)
# Run commands
uv run pipeline.py command options
uv run pipeline.py --verbose command options # Run command and include debug logs- setup: Pipeline setup and corpus I/O (for example: downloading and indexing a local copy of the collection).
- analyze: Analysis of the data present in the collection. Results are stored in the database.
- process: Processing and/or augmentation of data from the collection.
- export: Export of samples and stats.
- publish: Prepares the dataset for publication.
The following code excerpt presents some of the utilities this codebase makes available to work with the collection.
This codebase uses Peewee as an ORM to manage a SQLite database.
from dotenv import load_dotenv
load_dotenv()
import utils
from models import BookIO
from models.book_io import BookTarballData
# `BookIO` is a Peewee model for the "book_io" table.
# See Peewee's documentation for more info on how to work with models:
# https://docs.peewee-orm.com/en/latest/
# Retrieving an individual volume record by barcode
book = book.get(barcode="ABCDEF")
# Google-provided OCR text by page (pulled from remote storage or disk cache)
text: list[str] = book.text_by_page
# Metadata from books_latest.csv (pulled from remote storage or disk cache)
metadata: dict = book.metadata
# Scans, OCR data, text exports and metadata and checksum extracted from barcode.tar.gz (pulled on the fly and cached)
parsed_tarball: BookTarballData = book.parsed_tarball
# Iterating over the collection
for book in Book.select().iterator():
print(book)
# Quick access to the Peewee db connector itself
db = utils.get_db()All models cross-reference BookIO via a book foreign key.
By default, this pipeline uses HathiTrust's rights determination records to help determine the current rights determination status of each volume in the current collection.
If your collection is not present on HathiTrust, or if you would like to use your own rights determination data, it is possible to provide a custom exclusion list instead.
In .env, replace PD_FILTERING_MECHANISM with LIST:
PD_FILTERING_MECHANISM="LIST"Create a file named pd-exclusion-list.txt in your data/ folder:
One barcode per line.
ABCD1234
EFGH5678
IJKL9123
The barcodes listed in pd-exclusion-list.txt will be considered non public domain / not permissively licensed.
With that configuration, every feature requiring a rights determination check will use the list provided via pd-exclusion-list.txt to determine whether a given volume should be included or not.
All of the CLI commands listed in this README have a --help flag that lists its options.
Here are common options:
| Option name | Description |
|---|---|
--overwrite |
Delete existing entries/files if they already exist |
--offset and --limit |
Allows for running an operation on a subset of BookIO entries. Entries are ordered by barcode. |
--max-workers |
For commands that spin up sub processes, allows for determining how many workers should be created. Generally defaults to the number of available CPU threads. |
--db-write-batch-size |
Allows for determining how many entries should be processed before writing to the database. Matters in a very limited number of contexts. |
⚠️ setup buildmust be run at least once.
Initializes the pipeline:
- Sets up the local database
- Pulls information about the collection from cloud storage (output of GRIN Transfer)
- Indexes individual volumes as
BookIOrecords - Caches text from individual volume on disk
Notes:
- Can be run every time remote storage is updated. Updates existing records.
- Update runs do not delete volumes that may have disapeared from
books_latest.csv(unlikely)
uv run pipeline.py setup build
uv run pipeline.py setup build --skip-caching
uv run pipeline.py setup --cache_limit=100000 # Only caches the text from the first 100,000 volumesReports on the pipeline's status (database and cache size, etc ...).
uv run pipeline.py setup statusClears local data. Asks for confirmation before deleting each top-level folder/item.
uv run pipeline.py setup clearCollects genre/form classification data for each book from the collection's metadata.
Notes:
- Extracted from
MARC Genres(viabook.metadata). - Skips entries that were already analyzed, unless instructed otherwise.
uv run pipeline.py analyze extract-genre-classification-from-metadataCollects rights determination data from the Hathitrust API for this collection.
Notes:
--max-workersdefaults to 4.- Skips entries that were already analyzed, unless instructed otherwise.
uv run pipeline.py analyze extract-hathitrust-rights-determinationCollects book-level language data for each book from the collection's metadata.
Notes:
- Extracted from
MARC Language(viabook.metadata) - Original data is in ISO 639-2B format. This command stores it both in this format as well as ISO 639-3.
- Skips entries that were already analyzed, unless instructed otherwise.
uv run pipeline.py analyze extract-main-language-from-metadataCollects Google-provided OCR quality metrics for each book, as expressed in the collection's metadata.
Notes:
- Extracted from
GRIN OCR Analysis Score(viabook.metadata). - Skips entries that were already analyzed, unless instructed otherwise.
uv run pipeline.py analyze extract-ocr-quality-from-metadataExtracts the page count of each book, both:
- as expressed in the collection's metadata (
Page Countviabook.metadata) - from the total of available pages in the OCR'd text
Notes:
- Skips entries that were already analyzed, unless instructed otherwise
uv run pipeline.py analyze extract-page-countCollects topic/subject classification data for each book from the collection's metadata.
Notes:
- Extracted from
MARC Subjects(viabook.metadata). - Skips entries that were already analyzed, unless instructed otherwise.
uv run pipeline.py analyze extract-topic-classification-from-metadataCollects topic classification items that can be used to train a text classification model. Said text classification model's goal is to assign a top-level category from the Library of Congress' Classification Outline to a given book based on its metadata.
Isolates entries where:
TopicClassification.from_metadataonly contains 1 term (no comma).- Said term can be matched with one of the top-level items from the Library Of Congress Classification Outline (see
LOC_CO_TO_GXML_TOPICS).
Notes:
- Replaces existing training set if already present.
- Training dataset is split between "train" (most entries), "test" (validation, 5000 entries), "benchmark" (1000 entries).
- See
export topic-classification-training-setto export the results of this command.
uv run pipeline.py analyze extract-topic-classification-training-datasetCollects, for each entry, the likely year of publication based on existing metadata. This is meant to be used for statistical analysis purposes only.
Notes:
- Extracted from either
MARC Date 1orMARC Date 2(viabook.metadata) - Entries with where
MARC Date Typeis eitherContinuing resourceorNo attempt to codewill be skipped. - Incomplete years will be ignored (e.g:
19uu,1uuu,9999...) - Skips entries that were already analyzed, unless instructed otherwise.
uv run pipeline.py analyze extract-year-of-publication-from-metadataRuns text-level language detection on the OCR'd text of each book, split into chunks.
For each book:
- Collects the distribution and proportion of all identified languages in
language_detection. - Keeps track of token counts identified per language, at book level. (
o200k_basetokens). - Keeps track of the "main" detected language in
main_language(for comparison with metadata info).
Notes:
- Uses
pyfranc. - By default, texts are split and analyzed in blocks of up to 768 characters.
- Skips entries that were already analyzed, unless instructed otherwise.
uv run pipeline.py analyze run-language-detectionRuns pleais/OCROScope on the OCR'd text of each book in order to collect a secondary OCR quality metric.
Notes:
- Skips entries that were already analyzed, unless instructed otherwise
uv run pipeline.py analyze run-ocr-quality-detectionGenerate a simhash for every OCR'd text in the collection in order to coarsely identify collection-level near duplicates.
Notes:
- Skips entries that were already analyzed, unless instructed otherwise.
uv run pipeline.py analyze run-simhashRuns simple text analysis methods on the OCR'd text of each entry in the collection.
Collects metrics such as:
- character/word/bigram/trigram/sentence counts.
- token-type ratios.
- tokenizability (how "well" a given text tokenizes using
o200k_base).
Notes:
- Skips entries that were already analyzed, unless instructed otherwise
uv run pipeline.py analyze run-text-analysisTokenizes the OCR'd text of each entry and saves the resulting token counts in the database.
Uses the tokenizer of the target LLM specified via --target-llm.
Notes:
--target-llmcan identify both OpenAI and HuggingFace-hosted models. Prefix withopenai/for OpenAI models.- Skips texts that were already analyzed with this specific tokenizer, unless instructed otherwise.
- A valid HuggingFace token might be needed to access some of the target tokenizers.
uv run pipeline.py analyze run-token-count --target-llm="openai/gpt-4o"
uv run pipeline.py analyze run-token-count --target-llm="mistralai/Mixtral-8x22B-Instruct-v0.1"
uv run pipeline.py analyze run-token-count --target-llm="microsoft/phi-4"Runs a topic classification model on the collection.
Notes:
- The model was trained on the data filtered by
extract-topic-classification-training-dataset - This command updates
TopicClassificationrecords - Uses instdin/institutional-books-topic-classifier-bert by default
Benchmark mode:
- Runs topic classification model on 1000 records set aside for benchmarking purposes.
- Results of the benchmark will be saved as:
/data/output/export/topic-classification-benchmark-{model-name}-{datetime}.csv
uv run pipeline.py analyze run-topic-classification --benchmark-mode # 1000 benchmark entries
uv run pipeline.py analyze run-topic-classification # Actual classification run
uv run pipeline.py analyze run-topic-classification --device # Allows for specifying on which torch device the model should runThis series of commands allows for re-processing the collection's original text export. The goal is to attempt to make it more usable and readable, for humans and machines alike.
This process is divided in three steps (see details our technical report for more details).
This command uses at text-generation model to label line-level OCR chunks. This data can then be used to train a (coarse) classification model, assigning a type to every chunk.
Notes:
- Pulls
--n-samplesrandom pages from books in--languages. - Uses Ollama as an inference backend and
phi4:14b-q8_0: make sure both are available. - Training set is stored as
OCRPostprocessingTrainingDatasetrecords. - 10% of the pages are set aside to build a test set.
uv run pipeline.py process ocr-postprocessing step01-generate-training-dataset
uv run pipeline.py process ocr-postprocessing step01-generate-training-dataset --n-samples=1000This command distills a Sentence Transformer model into a static embedding model via Model2Vec. This model is then fine-tuned into a classifier, using the data generated in step 01.
The resulting model allows for detecting the "type" (OCRPostprocessingTrainingDataset.TARGET_TYPE) of each line of OCR'd text.
python pipeline process ocr-postprocessing step02-train-and-evaluate-model
python pipeline process ocr-postprocessing step02-train-and-evaluate-model --source-model-name="sentence-transformers/LaBSE"
⚠️ Prototype
This command:
- Uses one of the models trained with step 02 to infer the type of each line in the original OCR export.
- Uses the detected type and heuristics to assemble the lines into more readable text.
- Outputs a single JSON file per book, handled via
BookIO.postprocessed_ocr.
Notes:
- Whenever possible, running heads and page numbers are skipped.
- Whenever possible chunks detected as noise will be skipped (e.g: if they're only 1 character long).
- Only tested on the following languages: eng, deu, fra, ita, spa.
- This is implementation is an early prototype and is therefore more effective than efficient.
python pipeline process ocr-postprocessing step03-process --classifier-name="labse-ocr-postprocessing-2025-05-02-20-06"Generates a single CSV with statistics from the entire pipeline. Can be used as a "bird's eye view" of the current state of the experiments and overall dataset.
Saved as:
/data/output/export/overview-{datetime}.csv
uv run pipeline.py export stats overviewExports a CSV sheet to manually evaluate the accuracy of our collection-level items deduplication method.
Randomly picks --n-samples samples.
Saved as:
/data/output/export/deduplication-eval-sheet-{n-samples}-{datetime}.csv
uv run pipeline.py export misc deduplication-eval-sheet --n-samples=1000Simplified CSV export of the source metadata extracted from Google Books.
Saved as:
/data/output/export/simplified-source-metadata-{pd}-{datetime}.csv
uv run pipeline.py export misc simplified-source-metadata
uv run pipeline.py export misc simplified-source-metadata --include-non-pdExports the topic classification training dataset prepared via analyze extract-topic-classification-training-dataset as a series of CSVs.
Current setup: text classification fine-tunning https://huggingface.co/docs/autotrain/en/text_classification
Saved as:
/data/output/export/topic-classification-training-dataset-{set}-{datetime}.csv
uv run pipeline.py export misc topic-classification-training-datasetCompiles the finalized dataset so it can be published on HuggingFace 🤗.
Notes:
- Output saved locally, in the project's data folder.
- Asks for confirmation before proceeding.
--include-textallows for switching between the two versions of the dataset.- Dataset target name is adjusted automatically.
uv run pipeline.py publish hf generate
uv run pipeline.py publish hf generate --include-text # Full dataset text_by_page_xyz fields
uv run pipeline.py publish hf generate --include-non-pd # Includes volumes that were not flagged as public domain or permissively licensedUploads the dataset to HuggingFace 🤗. Creates Parquet chunks of specific length and uploads them to the hub.
Notes:
- dataset.push_to_hub() cannot easily be used with this dataset (charding issues).
- Asks for confirmation before proceeding.
--include-textallows for switching between the two versions of the dataset.- Dataset target name is adjusted automatically.
uv run pipeline.py publish hf push
uv run pipeline.py publish hf push --include-text # Full dataset text_by_page_xyz fieldsBasic integrity check for the datasets that were pushed to Hugging Face 🤗. Compares each remote row with its local counterpart.
Notes:
--include-textallows for switching between the two versions of the dataset.--use-local-copyallows for using the local copy generated withpublish hf generate.- Dataset target name is adjusted automatically.
uv run pipeline.py publish hf check-integrity
uv run pipeline.py publish hf check-integrity --include-text # Full dataset text_by_page_xyz fields| Suffix | Description |
|---|---|
_src |
"From source". This field's data comes from information we gathered from the collection itself. |
_gen |
"Generated". This field's data was generated as part of our analysis / post-processing. |
_ext |
"External". This field's data was pulled from an external source via a records matching mechanism. |
| Field name | Type | Description | Section in technical report |
|---|---|---|---|
barcode_src |
String | The volume's barcode. Serves as a primary key/identifier. | 3 |
title_src |
String | Merge of all the title-related bibliographic metadata available for this volume. | 3 |
author_src |
String | Merge of all the author name-related bibliographic metadata available for this volume. | 3 |
date1_src |
String | First available date for that volume. Described in date_types_src. May contain placeholder characters. See MARC 21 specification for details. |
4.3 |
date2_src |
String | Second available date for that volume. | 4.3 |
date_types_src |
String | Describes the nature of date1_src and date2_src. See MARC 21 specification for details. |
4.3 |
page_count_src |
Int | Page count for that volume. | 4.2 |
token_count_o200k_base_gen |
Int | Total tokens for that volume's OCR-extracted text, as measured with o200k_base. |
4.2 |
language_src |
String | ISO 639-3 code for the main language of this book, as expressed in the collection's bibliographic metadata. Converted from original ISO 639-2B for convenience. | 4.4 |
language_gen |
String | ISO 693-3 code for the main language of this book, as detected by our text-level language analysis of the OCR-extracted text. | 4.4 |
language_distribution_gen |
Dict | Distribution of the languages detected by our text-level language analysis. Only languages for which more than 1000 o200k_base tokens were detected in total were kept. |
4.4 |
topic_or_subject_src |
String | Topic or subject information, as expressed in the collection's bibliographic metadata. Only available for (approximately) half of the collection. | 4.5 |
topic_or_subject_gen |
String | High-level "topic" assigned to this volume by our topic classification model. Inferred from existing metadata. One of the Library of Congress' Classification Outline first-level items. | 4.5 |
topic_or_subject_score_gen |
Float | Confidence score returned by our topic classification model for this specific prediction. | 4.5 |
genre_or_form_src |
String | Genre or form information, as expressed in the collection's bibliographic metadata. Only available for (approximately) 10% of the collection. | 4.5 |
general_note_src |
String | Additional notes about this specific volume in the collection's bibliographic metadata. | 3 |
ocr_score_src |
Int (0-100) | Primary OCR quality score, as expressed in the collection's metadata. | 4.7 |
ocr_score_gen |
Int (0-100) | Secondary OCR quality score, generated by using pleias/OCRoscope on the collection's OCR-extracted text. | 4.7 |
likely_duplicates_barcodes_gen |
List | List of barcodes for which the OCR-extracted text is highly-similar to this volume's. | 4.6 |
text_analysis_gen |
Dict | High-level text analysis of the OCR-extracted text, both original and post-processed. | 4.8 |
identifiers_src |
Dict | List of bibliographic identifiers, as expressed in the collection's metadata. | 3 |
hathitrust_data_ext |
Dict | Rights determination data pulled from the Hathitrust API for this volume. | 5 |
text_by_page_src |
List[String] | Original OCR-extracted text for this volume. | 4.2 |
text_by_page_gen |
List[String] | Post-processed OCR-extracted text for this volume. Available for books in the following languages: eng, deu, fra, ita, spa (~850K books). |
4.9 |
| Field name | Type | Description |
|---|---|---|
languages |
List[String] | List of ISO 693-3 codes. Sorted by prevalence. |
proportion |
List[Float] | List of percentages. Sorted by prevalence. |
| Field name | Type | Description | Section in technical report |
|---|---|---|---|
text_by_page_src |
Dict | Text analysis data for the original OCR-extracted text. | 4.8 |
text_by_page_gen |
Dict | Text analysis data for the post-processed OCR-extracted text. | 4.9 |
Both dicts are shaped as follows when available:
| Field name | Type | Description |
|---|---|---|
tokenizability_score |
Float (0.0-100.0) | Measure of how close to 1.25 o200k_base token per word this text is. |
char_count |
Int | Total characters. |
word_count |
Int | Total detected words (language-aware tokenization). |
word_count_unique |
Int | Total unique detected words. |
word_type_token_ratio |
Float (0.0-100.0) | Lexical diversity at word level. May help identify the underlying document type. |
bigram_count |
Int | Total bigrams. |
bigram_count_unique |
Int | Total unique bigrams. |
bigram_type_token_ratio |
Float (0.0-100.0) | Lexical diversity at bigram level. May help identify the underlying document type. |
trigram_count |
Int | Total bigrams. |
trigram_count_unique |
Int | Total unique bigrams. |
trigram_type_token_ratio |
Float (0.0-100.0) | Lexical diversity at bigram level. May help identify the underlying document type. |
sentence_count |
Int | Total detected sentences. |
sentence_count_unique |
Int | Total unique detected sentences. |
| Field name | Type | Description |
|---|---|---|
lccn |
List[String] | List of Library of Congress Control Numbers, if available. |
isbn |
List[String] | List of International Standard Book Numbers, if available. |
ocolc |
List[String] | List of OCLC Control Numbers, if available. |
| Field name | Type | Description |
|---|---|---|
url |
String | Permalink to that volume on Hathitrust. |
rights_code |
String | Hathitrust's rights determination code. |
reason_code |
String | Hathitrust's rights determination reason code. |
last_check |
String | Date at which that information was pulled from the Hathitrust API. |
While this pipeline's publish feature is optimized for the HuggingFace Hub, it is possible to use it to upload to other platforms.
To do so, start by:
- Populating
HF_DATASET_NAME_METADATAandHF_DATASET_NAME_FULLwith a dataset name of your choice - Running
publish hf generate
This process will generate and store an Arrow version of the dataset on disk. It then becomes possible to access and process it as desired (conversion to parquet or JSONL, chunking, export to cloud storage, etc ...).
# This can be a file at the root of the project, or a new command.
from pathlib import Path
from dotenv import load_dotenv
load_dotenv()
from datasets import load_from_disk
import utils
from const import HF_DATASET_DIR_PATH
dataset_name = os.getenv("HF_DATASET_NAME_FULL") # Or HF_DATASET_NAME_METADATA
dataset_path = Path(HF_DATASET_DIR_PATH, dataset_name)
dataset = load_from_disk(dataset_path)
for record in dataset:
# Then: Chunking, conversion, upload ..
# See example: commands/publish.hf/push.py
# HuggingFace Dataset docs: https://huggingface.co/docs/datasets/en/indexThe Institutional Data Initiative at Harvard Law School Library works with knowledge institutions—from libraries and museums to cultural groups and government agencies—to refine and publish their collections as data. Reach out to collaborate on your collections.
@misc{cargnelutti2025institutionalbooks10242b,
title={Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability},
author={Matteo Cargnelutti and Catherine Brobston and John Hess and Jack Cushman and Kristi Mukk and Aristana Scourtas and Kyle Courtney and Greg Leppert and Amanda Watson and Martha Whitehead and Jonathan Zittrain},
year={2025},
eprint={2506.08300},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.08300},
}