Commit Graph

79 Commits

Author SHA1 Message Date
James R. Barlow
740f67091c Rename OCROptions to OcrOptions for consistency
Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.
2026-01-12 23:37:54 -08:00
James R. Barlow
c69f293322 Add --mode/-m CLI argument with ProcessingMode enum
Introduce a new --mode (-m) argument that consolidates the three
mutually exclusive OCR processing options into a single enum:
- default: Error if text is found (standard behavior)
- force: Rasterize all content and run OCR (replaces --force-ocr)
- skip: Skip pages with existing text (replaces --skip-text)
- redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)

The legacy flags --force-ocr, --skip-text, and --redo-ocr remain as
silent aliases for backward compatibility. Both CLI and API usage
continue to work unchanged.
2026-01-12 15:23:08 -08:00
James R. Barlow
bb5238e524 Update tests to use new OcrmypdfPluginManager interface
Replace pm.hook.method() calls with pm.method() calls to match the
refactored plugin manager that now uses composition over inheritance.
The hook attribute is no longer directly exposed; instead, type-safe
methods are provided directly on the plugin manager class.
2026-01-08 13:09:19 -08:00
James R. Barlow
16c2604a07 Remove lossy JBIG2 support, retain lossless JBIG2 only
Lossy JBIG2 has been removed due to well-documented risks of character
substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and
--jbig2-page-group-size arguments are now deprecated and ignored with
a warning.

Changes:
- Remove jbig2_lossy and jbig2_page_group_size from OCROptions
- Simplify optimize.py to use single-image JBIG2 encoding only
  (no symbol dictionaries/JBIG2Globals)
- Remove convert_group() from jbig2enc.py
- Deprecate CLI args with warnings for backward compatibility
- Update documentation to explain lossless-only JBIG2
2025-12-23 02:45:07 -08:00
James R. Barlow
9ebba91466 Use plugin namespace access pattern throughout codebase
Migrate all code from flat accessor pattern (options.tesseract_timeout)
to the plugin namespace pattern (options.tesseract.timeout).

Key changes:
- Fix _get_plugin_options to raise AttributeError for unregistered
  namespaces instead of silently returning None
- Add _convert_value helper to convert PathLike to str for plugin
  model field compatibility
- Filter out _plugin_cache_* entries from JSON serialization to fix
  worker process serialization (test_simulate_oom_killer)
- Update tesseract_ocr.py, ghostscript.py, _validation_coordinator.py,
  and _pipelines/ocr.py to use options.tesseract.* and
  options.ghostscript.* accessors
- Update tests to use setup_plugin_infrastructure() for plugin
  model registration
2025-12-23 02:02:21 -08:00
James R. Barlow
b7737446e4 cli: push up imports 2025-12-21 12:21:48 -08:00
James R. Barlow
1f493ba789 refactor: post-AI code cleanup 2025-12-21 12:21:47 -08:00
James R. Barlow
e1216eddb0 test: modify test_two_languages to use list of languages
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
66a3e8508e feat: add comprehensive validators to OCROptions
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-13 11:42:23 -08:00
James R. Barlow
7bb3a97208 refactor: update test_validation.py to use Pydantic model validation
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-13 11:42:23 -08:00
James R. Barlow
d556014185 Remove language warning 2025-12-13 11:41:58 -08:00
James R. Barlow
c591f9601a Remove Latin hOCR test 2023-11-19 23:51:27 -08:00
James R. Barlow
95b14ee282 Refactor lossless reconstruction setter into separate function
Still messy but good enough as a start.
2023-10-24 00:52:31 -07:00
James R. Barlow
ea36aedb5f Overhaul version checkers to prefer Version to str 2023-09-25 00:59:44 -07:00
James R. Barlow
5124daa79f Fix test failures from preceding 2023-06-19 23:25:31 -07:00
James R. Barlow
b7eb93eb79 Adopt ruff and fix prelim lints 2023-04-14 00:19:17 -07:00
James R. Barlow
46d0978a09 Update version scripts to support Ghostscript 10.0 2022-10-03 21:59:31 -07:00
James R. Barlow
c2ccc7f29d Fix test failure due to new logging from pikepdf 2022-09-21 01:00:08 -07:00
James R. Barlow
acc70036cc Set minimum Tesseract to 4.1.1 2022-08-02 15:20:29 -07:00
James R. Barlow
67773da309 Drop support for Ghostscript <9.50 2022-08-02 15:01:10 -07:00
James R. Barlow
5fe3102e4e tests: new test to confirm correct printing of tesseract install advice 2022-08-01 12:31:37 -07:00
James R. Barlow
5b57520c98 tests: simplify some validation tests 2022-08-01 12:31:05 -07:00
James R. Barlow
30e4198f3a tests: fix test_validation when chi_sim not installed 2022-08-01 02:47:39 -07:00
James R. Barlow
ba372e5841 Reorganize validation to fix exception when Tesseract not installed
The existing logic would call an OCR plugin's get_languages function before
allowing the plugin to check if its dependencies were available. This caused
an exception if Tesseract was installed, when we were supposed to issue
an error message advising the user to install Tesseract.
2022-08-01 02:04:09 -07:00
James R. Barlow
80ed2117cc Change to SPDX license tracking 2022-07-28 01:10:07 -07:00
James R. Barlow
dc6f1a266a Modernize type annotations 2022-07-23 00:39:24 -07:00
James R. Barlow
17a5b8b43c Refactor reporting of optimization failures 2022-06-13 01:30:15 -07:00
James R. Barlow
61069660a2 Move optimization options to plugin 2022-06-12 02:42:16 -07:00
James R. Barlow
f91faf9795 Add new argument --tesseract-thresholding to control tesseract thresholding where available
Also add missing test for --tesseract-oem
2021-12-06 15:38:14 -08:00
James R. Barlow
6c34d59836 tesseract: yet another version variant 2021-11-04 00:17:18 -07:00
James R. Barlow
790d3022f6 Implement --output-type=none to skip producing the PDF and use only the sidecar
Closes #787
2021-09-26 01:07:34 -07:00
James R. Barlow
5f01c5e330 Fix another species of Tesseract version number breaking regex
Fixes #795
2021-06-16 00:09:03 -07:00
James R. Barlow
7b1e5b4f41 Fix "invalid version number" for untagged tesseract versions
Fixes #770
2021-04-26 01:18:07 -07:00
James R. Barlow
336d274a54 Drop remnants of support for Tesseract without has_textonly_pdf
Also improve Tesseract version checking so it can compare all of their
weird conventions.
2021-04-07 23:05:21 -07:00
James R. Barlow
173a80864d Delinting 2021-04-07 02:09:45 -07:00
James R. Barlow
aa115a8be3 Remove pytest_helpers_namespace 2021-04-07 01:56:51 -07:00
James R. Barlow
a4e1f8e1f3 Merge branch 'feature/lambda' 2021-04-01 16:36:22 -07:00
James R. Barlow
079c162a96 Ensure sidecar is not input or output file 2021-03-05 00:29:42 -08:00
Dima Kuznetsov
5e2206bae7 Allow --sidecar along --pages (#735) 2021-02-19 16:55:35 -08:00
James R. Barlow
16bda74974 Refactor - decouple progressbar from executor 2021-01-30 20:42:00 -08:00
James R. Barlow
d274d88929 Refactor to eliminate global state in _concurrent 2021-01-30 17:36:30 -08:00
James R. Barlow
7bccb8c748 tests: fix concurrency 2021-01-24 23:46:33 -08:00
James R. Barlow
babc76fa74 tests: assert that most patched functions are called
We were not actually checking if functions we patched we called when
expected.
2020-12-28 23:58:33 -08:00
James R. Barlow
3707af3b74 Change pdf.root to pdf.Root 2020-11-03 01:30:31 -08:00
James R. Barlow
bfe4a5b329 Tidy a log message 2020-09-25 00:17:57 -07:00
James R. Barlow
e6a7b58863 Merge branch 'de-gpl' 2020-08-12 12:20:38 -07:00
James R. Barlow
bed74501fc Fix test breakage in validation
Broken in commit 4cc0dc
2020-08-05 01:35:26 -07:00
James R. Barlow
aa0ec40102 Change license of all GPLv3 files to MPL-2.0
https://github.com/jbarlow83/OCRmyPDF/issues/600
2020-08-05 00:44:42 -07:00
James R. Barlow
892db88f0e test_two_languages: use narrower test 2020-06-12 14:33:02 -07:00
James R. Barlow
eeb44f78cc Fix tests that failed on other platforms from previous fix 2020-06-12 12:59:46 -07:00