Commit Graph

19 Commits

Author SHA1 Message Date
James R. Barlow
740f67091c Rename OCROptions to OcrOptions for consistency
Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.
2026-01-12 23:37:54 -08:00
James R. Barlow
0c3745a1a4 Add OCR engine selection framework and null OCR engine
Introduce --ocr-engine option to select between OCR engines:
- 'auto' (default): Uses Tesseract
- 'tesseract': Explicit Tesseract selection
- 'none': Skip OCR entirely (for PDF processing only)

Key changes:
- Extend OcrEngine ABC with generate_ocr() and supports_generate_ocr()
  for direct OcrElement tree output (bypasses hOCR)
- Add get_ocr_engine(options) hook parameter for engine selection
- Implement NullOcrEngine for --ocr-engine none
- Export OcrElement, OcrClass, BoundingBox from ocrmypdf package
- Add ocr_tree support to grafting pipeline

This prepares the foundation for pluggable OCR engines while maintaining
full backward compatibility with existing Tesseract-based workflows.
2026-01-12 10:11:14 -08:00
James R. Barlow
16c2604a07 Remove lossy JBIG2 support, retain lossless JBIG2 only
Lossy JBIG2 has been removed due to well-documented risks of character
substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and
--jbig2-page-group-size arguments are now deprecated and ignored with
a warning.

Changes:
- Remove jbig2_lossy and jbig2_page_group_size from OCROptions
- Simplify optimize.py to use single-image JBIG2 encoding only
  (no symbol dictionaries/JBIG2Globals)
- Remove convert_group() from jbig2enc.py
- Deprecate CLI args with warnings for backward compatibility
- Update documentation to explain lossless-only JBIG2
2025-12-23 02:45:07 -08:00
James R. Barlow
0ad7f5fc13 feat: add dynamic nested access to plugin options
Completes Phase 5 of the CLI refactoring plan by enabling nested
plugin option access (e.g., options.tesseract.timeout) alongside
the legacy flat access (options.tesseract_timeout).

Changes:
- Add module-level plugin option model registry in _options.py
- Add __getattr__ to OCROptions for dynamic namespace access
- Register plugin models in setup_plugin_infrastructure()
- Add test for nested plugin option access

Plugin option instances are lazily created from flat field values
and cached for subsequent access.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:21:48 -08:00
James R. Barlow
dbd3c93757 Fix issue with unpickling HOCRResult
Fixes [Bug]: HOCRResult.from_json() not unpickling correctly #1427
2024-11-10 02:05:57 -08:00
James R. Barlow
7a8cc21e31 Add support for sidecar output to io.BytesIO
Closes #1252
2024-04-07 01:38:55 -07:00
James R. Barlow
71166f7be8 Make hocr API experimental for now
This commit can be reverted when we are ready to release a new version.
2023-10-30 00:07:10 -07:00
James R. Barlow
fbf0674189 hocr_to_ocr_pdf: handle missing hocr json file 2023-10-24 00:54:31 -07:00
James R. Barlow
23951c9e38 Working HOCR folder to PDF converter 2023-10-24 00:54:30 -07:00
James R. Barlow
68bb38d0ad pdf_to_hocr: improve plugin handling 2023-10-24 00:52:31 -07:00
James R. Barlow
0443e87345 Introduce pdf_to_hocr API 2023-10-24 00:52:31 -07:00
James R. Barlow
de2bb5ce8c Remove tqdm dependency and TqdmConsole
Might be too aggressive? No deprecation warning....
2023-09-21 00:05:39 -07:00
James R. Barlow
80ed2117cc Change to SPDX license tracking 2022-07-28 01:10:07 -07:00
James R. Barlow
dc6f1a266a Modernize type annotations 2022-07-23 00:39:24 -07:00
James R. Barlow
aa0ec40102 Change license of all GPLv3 files to MPL-2.0
https://github.com/jbarlow83/OCRmyPDF/issues/600
2020-08-05 00:44:42 -07:00
James R. Barlow
f4cb424451 Support input/output streams at API level 2020-06-22 02:02:18 -07:00
James R. Barlow
58abb5785c pytest picky about list vs tuple 2020-04-15 03:16:51 -07:00
James R. Barlow
4a640b8dcd Fix language argument not working as list
Fixes #523
2020-04-14 23:18:52 -07:00
James R. Barlow
c36e9950ae tests: test TqdmConsole 2019-12-30 17:51:09 -08:00