2 Commits

Author SHA1 Message Date
James R. Barlow
740f67091c Rename OCROptions to OcrOptions for consistency
Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.
2026-01-12 23:37:54 -08:00
James R. Barlow
0c3745a1a4 Add OCR engine selection framework and null OCR engine
Introduce --ocr-engine option to select between OCR engines:
- 'auto' (default): Uses Tesseract
- 'tesseract': Explicit Tesseract selection
- 'none': Skip OCR entirely (for PDF processing only)

Key changes:
- Extend OcrEngine ABC with generate_ocr() and supports_generate_ocr()
  for direct OcrElement tree output (bypasses hOCR)
- Add get_ocr_engine(options) hook parameter for engine selection
- Implement NullOcrEngine for --ocr-engine none
- Export OcrElement, OcrClass, BoundingBox from ocrmypdf package
- Add ocr_tree support to grafting pipeline

This prepares the foundation for pluggable OCR engines while maintaining
full backward compatibility with existing Tesseract-based workflows.
2026-01-12 10:11:14 -08:00