Introduce a new --mode (-m) argument that consolidates the three
mutually exclusive OCR processing options into a single enum:
- default: Error if text is found (standard behavior)
- force: Rasterize all content and run OCR (replaces --force-ocr)
- skip: Skip pages with existing text (replaces --skip-text)
- redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)
The legacy flags --force-ocr, --skip-text, and --redo-ocr remain as
silent aliases for backward compatibility. Both CLI and API usage
continue to work unchanged.
Replace pm.hook.method() calls with pm.method() calls to match the
refactored plugin manager that now uses composition over inheritance.
The hook attribute is no longer directly exposed; instead, type-safe
methods are provided directly on the plugin manager class.
Lossy JBIG2 has been removed due to well-documented risks of character
substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and
--jbig2-page-group-size arguments are now deprecated and ignored with
a warning.
Changes:
- Remove jbig2_lossy and jbig2_page_group_size from OCROptions
- Simplify optimize.py to use single-image JBIG2 encoding only
(no symbol dictionaries/JBIG2Globals)
- Remove convert_group() from jbig2enc.py
- Deprecate CLI args with warnings for backward compatibility
- Update documentation to explain lossless-only JBIG2
Migrate all code from flat accessor pattern (options.tesseract_timeout)
to the plugin namespace pattern (options.tesseract.timeout).
Key changes:
- Fix _get_plugin_options to raise AttributeError for unregistered
namespaces instead of silently returning None
- Add _convert_value helper to convert PathLike to str for plugin
model field compatibility
- Filter out _plugin_cache_* entries from JSON serialization to fix
worker process serialization (test_simulate_oom_killer)
- Update tesseract_ocr.py, ghostscript.py, _validation_coordinator.py,
and _pipelines/ocr.py to use options.tesseract.* and
options.ghostscript.* accessors
- Update tests to use setup_plugin_infrastructure() for plugin
model registration
The existing logic would call an OCR plugin's get_languages function before
allowing the plugin to check if its dependencies were available. This caused
an exception if Tesseract was installed, when we were supposed to issue
an error message advising the user to install Tesseract.