From 47cea374876cd4f100a2a015dbddcf9fe44052ca Mon Sep 17 00:00:00 2001 From: "James R. Barlow" Date: Sat, 20 Dec 2025 15:42:44 -0800 Subject: [PATCH] docs: add CLAUDE.md for Claude Code guidance MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Provides architecture overview, common commands, and testing info for AI-assisted development. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 --- CLAUDE.md | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..fdfcad3f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,87 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +OCRmyPDF adds an OCR text layer to scanned PDF files, making them searchable. It uses Tesseract OCR and Ghostscript as external dependencies. + +## Common Commands + +```bash +# Run all tests (uses pytest-xdist for parallel execution) +pytest + +# Run a single test file +pytest tests/test_main.py + +# Run a specific test +pytest tests/test_main.py::test_function_name + +# Run tests with coverage +pytest --cov=src/ocrmypdf --cov-report=html + +# Run slow tests (disabled by default) +pytest --runslow + +# Lint and format +ruff check src/ +ruff format src/ + +# Type checking +mypy src/ocrmypdf +``` + +## Architecture + +### Entry Points +- **CLI**: `src/ocrmypdf/__main__.py` → `src/ocrmypdf/cli.py` parses arguments +- **Python API**: `src/ocrmypdf/api.py` provides `ocr()` function for programmatic use + +### Core Pipeline +The OCR pipeline is in `src/ocrmypdf/_pipeline.py` and `src/ocrmypdf/_pipelines/`. Processing flow: +1. Input validation and triage (PDF vs image) +2. PDF info extraction (`src/ocrmypdf/pdfinfo/`) +3. Page-by-page OCR processing (parallelized) +4. PDF/A generation and optimization + +### Options Model +`src/ocrmypdf/_options.py` contains `OCROptions`, a Pydantic model that validates all CLI and API options. Options validation happens in `src/ocrmypdf/_validation.py` with cross-cutting validation in `src/ocrmypdf/_validation_coordinator.py`. + +### Plugin System +OCRmyPDF uses `pluggy` for extensibility. Key files: +- `src/ocrmypdf/pluginspec.py`: Defines all hook specifications +- `src/ocrmypdf/builtin_plugins/`: Default implementations + - `tesseract_ocr.py`: Tesseract OCR engine + - `ghostscript.py`: PDF rasterization and PDF/A generation + - `optimize.py`: PDF optimization + +Plugins can replace the OCR engine, add CLI arguments, or modify image processing. + +### External Tool Wrappers +`src/ocrmypdf/_exec/` contains wrappers for external tools: +- `ghostscript.py`: PDF rasterization, PDF/A conversion +- `tesseract.py`: OCR engine interface +- `unpaper.py`: Image preprocessing (deskew, clean) +- `jbig2enc.py`, `pngquant.py`: Image optimization + +### Job Context +- `PdfContext`: Document-level context passed through pipeline +- `PageContext`: Per-page context for parallel processing + +## Testing + +Tests are in `tests/` with fixtures defined in `tests/conftest.py`. Key fixtures: +- `resources`: Path to test PDF/image files in `tests/resources/` +- `outpdf`: Temporary output PDF path +- `check_ocrmypdf()`: Run OCR and assert valid output +- `run_ocrmypdf_api()`: Run via API, returns ExitCode +- `run_ocrmypdf()`: Run as subprocess + +## External Dependencies + +Requires system packages: Tesseract OCR, Ghostscript. Optional: unpaper, jbig2enc, pngquant. + +## License + +MPL-2.0 for core code. Tests and docs use CC-BY-SA-4.0.