Commit Graph

124 Commits

Author SHA1 Message Date
James R. Barlow
ef88ba3f95 Add OcrOptions as first-class argument to ocr() function
Allow passing an OcrOptions object directly to ocr() as the first
positional argument, providing a cleaner API for programmatic use.
The old-style API with individual parameters remains fully supported.
2026-01-20 10:20:52 -08:00
James R. Barlow
740f67091c Rename OCROptions to OcrOptions for consistency
Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.
2026-01-12 23:37:54 -08:00
James R. Barlow
c69f293322 Add --mode/-m CLI argument with ProcessingMode enum
Introduce a new --mode (-m) argument that consolidates the three
mutually exclusive OCR processing options into a single enum:
- default: Error if text is found (standard behavior)
- force: Rasterize all content and run OCR (replaces --force-ocr)
- skip: Skip pages with existing text (replaces --skip-text)
- redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)

The legacy flags --force-ocr, --skip-text, and --redo-ocr remain as
silent aliases for backward compatibility. Both CLI and API usage
continue to work unchanged.
2026-01-12 15:23:08 -08:00
James R. Barlow
c9ea07e954 Reduce chattiness of fonttools 2026-01-12 10:16:58 -08:00
James R. Barlow
f5617ce44e Refactor OcrmypdfPluginManager to use composition over inheritance
Replace inheritance from pluggy.PluginManager with composition pattern,
providing a type-safe interface for all 16 hooks defined in pluginspec.py.
The underlying pluggy manager is now accessible via the .pluggy property
for advanced use cases like set_blocked().

This change enables IDE autocomplete and type checking for all hook calls
while maintaining full backward compatibility with the plugin system.
2026-01-07 17:23:13 -08:00
James R. Barlow
e9bfce34f1 Fix ruff linting issues
- Use X | Y syntax in isinstance calls (UP038)
- Remove trailing whitespace from blank lines (W293)
2025-12-23 03:07:48 -08:00
James R. Barlow
16c2604a07 Remove lossy JBIG2 support, retain lossless JBIG2 only
Lossy JBIG2 has been removed due to well-documented risks of character
substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and
--jbig2-page-group-size arguments are now deprecated and ignored with
a warning.

Changes:
- Remove jbig2_lossy and jbig2_page_group_size from OCROptions
- Simplify optimize.py to use single-image JBIG2 encoding only
  (no symbol dictionaries/JBIG2Globals)
- Remove convert_group() from jbig2enc.py
- Deprecate CLI args with warnings for backward compatibility
- Update documentation to explain lossless-only JBIG2
2025-12-23 02:45:07 -08:00
James R. Barlow
ed813cec67 feat: add --rasterizer CLI option to select PDF rasterization backend
Add user control over which rasterizer is used for PDF page rendering:
- 'auto' (default): prefers pypdfium when available, falls back to Ghostscript
- 'pypdfium': force pypdfium2 (errors if not installed)
- 'ghostscript': force traditional Ghostscript rasterizer

Changes:
- Add rasterizer field with validation to OCROptions model
- Add --rasterizer CLI argument in the Advanced options group
- Update rasterize_pdf_page hookspec to pass options to plugins
- Update pypdfium plugin with check_options hook for availability check
- Update both plugins to respect the rasterizer option

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
0ad7f5fc13 feat: add dynamic nested access to plugin options
Completes Phase 5 of the CLI refactoring plan by enabling nested
plugin option access (e.g., options.tesseract.timeout) alongside
the legacy flat access (options.tesseract_timeout).

Changes:
- Add module-level plugin option model registry in _options.py
- Add __getattr__ to OCROptions for dynamic namespace access
- Register plugin models in setup_plugin_infrastructure()
- Add test for nested plugin option access

Plugin option instances are lazily created from flat field values
and cached for subsequent access.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:21:48 -08:00
James R. Barlow
28d6ea0f10 feat: Add CLI generation methods to plugin option models
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:48 -08:00
James R. Barlow
6913ec7cb8 feat: add PluginOptionRegistry for dynamic option models
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:48 -08:00
James R. Barlow
62ad37b276 refactor: centralize plugin manager setup
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:48 -08:00
James R. Barlow
f04b5504e8 test: fix hOCR pipeline output folder handling
The commit message captures the essence of the changes: we fixed how the output folder is handled in the hOCR pipeline by making it a proper field in OCROptions and updating the API functions accordingly.

Would you like me to generate a full commit message or is this sufficient?

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
f0c292f4e1 refactor: Remove CLI-parser dependencies in experimental API functions
This commit updates `_pdf_to_hocr` and `_hocr_to_ocr_pdf` to use direct OCROptions construction, eliminating the last vestiges of CLI-parser dependency in the experimental APIs.

Key changes:
- Removed `parser = get_parser()` calls
- Added plugin validation similar to main `ocr()` function
- Simplified plugin manager hook calls
- Added None value filtering to use OCROptions defaults
- Maintained error handling and extra_attrs logic

The refactoring makes these experimental APIs truly API-first and simplifies the code by removing unnecessary CLI-related complexity.

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
1f493ba789 refactor: post-AI code cleanup 2025-12-21 12:21:47 -08:00
James R. Barlow
a869a4ac42 fix: filter out None values from OCROptions kwargs
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
48a2fdb0f2 refactor: replace _kwargs_to_cmdline with direct OCROptions construction in experimental API functions
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
91a2d39845 refactor: replace command line synthesis with direct OCROptions construction
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
cb22a35834 refactor: convert hOCR API entry points to use OCROptions
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
1579337ebe refactor: Create OCROptions model with Namespace compatibility
This commit introduces a new `OCROptions` class in `_options.py` that provides:
- Proper typing for OCRmyPDF options
- Pydantic validation
- Backward compatibility with `argparse.Namespace`
- Gradual migration support for the options system

Key changes:
- Added comprehensive option fields with type hints
- Implemented custom attribute access methods
- Created conversion methods between Namespace and OCROptions
- Updated type hints in multiple files to support both types
- Maintained existing validation logic

The new model allows for a step-by-step refactoring of the options handling throughout the project.

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-13 11:40:57 -08:00
James R. Barlow
a385cd967d docs: Improve ocrmypdf.api 2025-11-10 15:58:47 -08:00
James R. Barlow
ee47e986f3 docs: Improve module-level docstring for OCRmyPDF Python API
Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>
2025-11-10 10:33:26 -08:00
Alina Bürge
a9a8b39dba Fix the use of the plugin_manager argument (#1555) 2025-08-18 13:00:39 -07:00
James R. Barlow
0f82d7223e Fix some typing issues 2024-10-27 12:31:16 -07:00
James R. Barlow
7a8cc21e31 Add support for sidecar output to io.BytesIO
Closes #1252
2024-04-07 01:38:55 -07:00
James R. Barlow
3a3635f7f9 Python 3.10 cleanup, manual fixes 2024-02-14 12:48:17 -08:00
James R. Barlow
71166f7be8 Make hocr API experimental for now
This commit can be reverted when we are ready to release a new version.
2023-10-30 00:07:10 -07:00
James R. Barlow
db3df13e95 Remove ocrmypdf._sync 2023-10-24 00:54:31 -07:00
James R. Barlow
4dbc5e1dba Fix some typing issues 2023-10-24 00:54:31 -07:00
James R. Barlow
f238e721ed Improve documentation of new public hOCR APIs 2023-10-24 00:54:31 -07:00
James R. Barlow
23951c9e38 Working HOCR folder to PDF converter 2023-10-24 00:54:30 -07:00
James R. Barlow
e8ae370ceb Eliminate api= kwarg and implicit creation of pluginmanager 2023-10-24 00:54:30 -07:00
James R. Barlow
68bb38d0ad pdf_to_hocr: improve plugin handling 2023-10-24 00:52:31 -07:00
James R. Barlow
0443e87345 Introduce pdf_to_hocr API 2023-10-24 00:52:31 -07:00
James R. Barlow
b3de5833d3 Refactor conversion of ocrmypdf.ocr() arguments to cmdline 2023-10-24 00:52:31 -07:00
James R. Barlow
539f0ee0ce Document some missing CLI options to API 2023-10-03 23:54:24 -07:00
James R. Barlow
f4c211fa2d Use Python 3.9-style type hinting for tuple[] and AbstractSet -> Set 2023-10-01 00:09:05 -07:00
James R. Barlow
113a6b45bd ruff autofixes (mostly typing.* -> collections.abc.*) 2023-10-01 00:02:53 -07:00
James R. Barlow
85e31d0a19 Tidy imports and line length 2023-09-25 01:04:01 -07:00
James R. Barlow
47b0f28564 Minor documentation and typing fixes 2023-09-25 00:19:48 -07:00
James R. Barlow
de2bb5ce8c Remove tqdm dependency and TqdmConsole
Might be too aggressive? No deprecation warning....
2023-09-21 00:05:39 -07:00
James R. Barlow
19045c4f21 Replace coloredlogs and tqdm with rich 2023-08-11 01:47:42 -07:00
James R. Barlow
e8ed510543 Add stop on soft render errors and option to override 2023-06-01 23:49:34 -07:00
James R. Barlow
a99e40fa84 Update notes about ocrmypdf.ocr usage 2023-04-14 11:54:58 -07:00
James R. Barlow
9ce692a6f1 ruff: further fixes 2023-04-14 02:39:36 -07:00
James R. Barlow
33b70be7d5 ruff: more fixes, mainly missing docstrings 2023-04-14 02:16:38 -07:00
James R. Barlow
4924b11b6b Additional ruff fixes 2023-04-14 01:25:16 -07:00
James R. Barlow
9b8d14d16e Accept most of ruff's delinting 2023-04-14 00:45:34 -07:00
James R. Barlow
4d2f499f97 Remove optional status of coloredlogs
Everything optional is a possible complication.
Better to remove the option.
2022-08-04 04:15:56 -07:00
James R. Barlow
53db866ef9 Remove deprecated exception PdfMergeFailedError 2022-08-04 03:54:55 -07:00