Introduce --ocr-engine option to select between OCR engines:
- 'auto' (default): Uses Tesseract
- 'tesseract': Explicit Tesseract selection
- 'none': Skip OCR entirely (for PDF processing only)
Key changes:
- Extend OcrEngine ABC with generate_ocr() and supports_generate_ocr()
for direct OcrElement tree output (bypasses hOCR)
- Add get_ocr_engine(options) hook parameter for engine selection
- Implement NullOcrEngine for --ocr-engine none
- Export OcrElement, OcrClass, BoundingBox from ocrmypdf package
- Add ocr_tree support to grafting pipeline
This prepares the foundation for pluggable OCR engines while maintaining
full backward compatibility with existing Tesseract-based workflows.
Lossy JBIG2 has been removed due to well-documented risks of character
substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and
--jbig2-page-group-size arguments are now deprecated and ignored with
a warning.
Changes:
- Remove jbig2_lossy and jbig2_page_group_size from OCROptions
- Simplify optimize.py to use single-image JBIG2 encoding only
(no symbol dictionaries/JBIG2Globals)
- Remove convert_group() from jbig2enc.py
- Deprecate CLI args with warnings for backward compatibility
- Update documentation to explain lossless-only JBIG2
Completes Phase 5 of the CLI refactoring plan by enabling nested
plugin option access (e.g., options.tesseract.timeout) alongside
the legacy flat access (options.tesseract_timeout).
Changes:
- Add module-level plugin option model registry in _options.py
- Add __getattr__ to OCROptions for dynamic namespace access
- Register plugin models in setup_plugin_infrastructure()
- Add test for nested plugin option access
Plugin option instances are lazily created from flat field values
and cached for subsequent access.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>