Allow passing an OcrOptions object directly to ocr() as the first
positional argument, providing a cleaner API for programmatic use.
The old-style API with individual parameters remains fully supported.
Introduce a new --mode (-m) argument that consolidates the three
mutually exclusive OCR processing options into a single enum:
- default: Error if text is found (standard behavior)
- force: Rasterize all content and run OCR (replaces --force-ocr)
- skip: Skip pages with existing text (replaces --skip-text)
- redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)
The legacy flags --force-ocr, --skip-text, and --redo-ocr remain as
silent aliases for backward compatibility. Both CLI and API usage
continue to work unchanged.
Replace inheritance from pluggy.PluginManager with composition pattern,
providing a type-safe interface for all 16 hooks defined in pluginspec.py.
The underlying pluggy manager is now accessible via the .pluggy property
for advanced use cases like set_blocked().
This change enables IDE autocomplete and type checking for all hook calls
while maintaining full backward compatibility with the plugin system.
Lossy JBIG2 has been removed due to well-documented risks of character
substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and
--jbig2-page-group-size arguments are now deprecated and ignored with
a warning.
Changes:
- Remove jbig2_lossy and jbig2_page_group_size from OCROptions
- Simplify optimize.py to use single-image JBIG2 encoding only
(no symbol dictionaries/JBIG2Globals)
- Remove convert_group() from jbig2enc.py
- Deprecate CLI args with warnings for backward compatibility
- Update documentation to explain lossless-only JBIG2
Add user control over which rasterizer is used for PDF page rendering:
- 'auto' (default): prefers pypdfium when available, falls back to Ghostscript
- 'pypdfium': force pypdfium2 (errors if not installed)
- 'ghostscript': force traditional Ghostscript rasterizer
Changes:
- Add rasterizer field with validation to OCROptions model
- Add --rasterizer CLI argument in the Advanced options group
- Update rasterize_pdf_page hookspec to pass options to plugins
- Update pypdfium plugin with check_options hook for availability check
- Update both plugins to respect the rasterizer option
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Completes Phase 5 of the CLI refactoring plan by enabling nested
plugin option access (e.g., options.tesseract.timeout) alongside
the legacy flat access (options.tesseract_timeout).
Changes:
- Add module-level plugin option model registry in _options.py
- Add __getattr__ to OCROptions for dynamic namespace access
- Register plugin models in setup_plugin_infrastructure()
- Add test for nested plugin option access
Plugin option instances are lazily created from flat field values
and cached for subsequent access.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The commit message captures the essence of the changes: we fixed how the output folder is handled in the hOCR pipeline by making it a proper field in OCROptions and updating the API functions accordingly.
Would you like me to generate a full commit message or is this sufficient?
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
This commit updates `_pdf_to_hocr` and `_hocr_to_ocr_pdf` to use direct OCROptions construction, eliminating the last vestiges of CLI-parser dependency in the experimental APIs.
Key changes:
- Removed `parser = get_parser()` calls
- Added plugin validation similar to main `ocr()` function
- Simplified plugin manager hook calls
- Added None value filtering to use OCROptions defaults
- Maintained error handling and extra_attrs logic
The refactoring makes these experimental APIs truly API-first and simplifies the code by removing unnecessary CLI-related complexity.
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
This commit introduces a new `OCROptions` class in `_options.py` that provides:
- Proper typing for OCRmyPDF options
- Pydantic validation
- Backward compatibility with `argparse.Namespace`
- Gradual migration support for the options system
Key changes:
- Added comprehensive option fields with type hints
- Implemented custom attribute access methods
- Created conversion methods between Namespace and OCROptions
- Updated type hints in multiple files to support both types
- Maintained existing validation logic
The new model allows for a step-by-step refactoring of the options handling throughout the project.
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>