OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2026-02-08 13:22:34 -05:00

Author	SHA1	Message	Date
James R. Barlow	ef88ba3f95	Add OcrOptions as first-class argument to ocr() function Allow passing an OcrOptions object directly to ocr() as the first positional argument, providing a cleaner API for programmatic use. The old-style API with individual parameters remains fully supported.	2026-01-20 10:20:52 -08:00
James R. Barlow	740f67091c	Rename OCROptions to OcrOptions for consistency Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.	2026-01-12 23:37:54 -08:00
James R. Barlow	c69f293322	Add --mode/-m CLI argument with ProcessingMode enum Introduce a new --mode (-m) argument that consolidates the three mutually exclusive OCR processing options into a single enum: - default: Error if text is found (standard behavior) - force: Rasterize all content and run OCR (replaces --force-ocr) - skip: Skip pages with existing text (replaces --skip-text) - redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr) The legacy flags --force-ocr, --skip-text, and --redo-ocr remain as silent aliases for backward compatibility. Both CLI and API usage continue to work unchanged.	2026-01-12 15:23:08 -08:00
James R. Barlow	c9ea07e954	Reduce chattiness of fonttools	2026-01-12 10:16:58 -08:00
James R. Barlow	f5617ce44e	Refactor OcrmypdfPluginManager to use composition over inheritance Replace inheritance from pluggy.PluginManager with composition pattern, providing a type-safe interface for all 16 hooks defined in pluginspec.py. The underlying pluggy manager is now accessible via the .pluggy property for advanced use cases like set_blocked(). This change enables IDE autocomplete and type checking for all hook calls while maintaining full backward compatibility with the plugin system.	2026-01-07 17:23:13 -08:00
James R. Barlow	e9bfce34f1	Fix ruff linting issues - Use X \| Y syntax in isinstance calls (UP038) - Remove trailing whitespace from blank lines (W293)	2025-12-23 03:07:48 -08:00
James R. Barlow	16c2604a07	Remove lossy JBIG2 support, retain lossless JBIG2 only Lossy JBIG2 has been removed due to well-documented risks of character substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and --jbig2-page-group-size arguments are now deprecated and ignored with a warning. Changes: - Remove jbig2_lossy and jbig2_page_group_size from OCROptions - Simplify optimize.py to use single-image JBIG2 encoding only (no symbol dictionaries/JBIG2Globals) - Remove convert_group() from jbig2enc.py - Deprecate CLI args with warnings for backward compatibility - Update documentation to explain lossless-only JBIG2	2025-12-23 02:45:07 -08:00
James R. Barlow	ed813cec67	feat: add --rasterizer CLI option to select PDF rasterization backend Add user control over which rasterizer is used for PDF page rendering: - 'auto' (default): prefers pypdfium when available, falls back to Ghostscript - 'pypdfium': force pypdfium2 (errors if not installed) - 'ghostscript': force traditional Ghostscript rasterizer Changes: - Add rasterizer field with validation to OCROptions model - Add --rasterizer CLI argument in the Advanced options group - Update rasterize_pdf_page hookspec to pass options to plugins - Update pypdfium plugin with check_options hook for availability check - Update both plugins to respect the rasterizer option 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:29:17 -08:00
James R. Barlow	0ad7f5fc13	feat: add dynamic nested access to plugin options Completes Phase 5 of the CLI refactoring plan by enabling nested plugin option access (e.g., options.tesseract.timeout) alongside the legacy flat access (options.tesseract_timeout). Changes: - Add module-level plugin option model registry in _options.py - Add __getattr__ to OCROptions for dynamic namespace access - Register plugin models in setup_plugin_infrastructure() - Add test for nested plugin option access Plugin option instances are lazily created from flat field values and cached for subsequent access. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:21:48 -08:00
James R. Barlow	28d6ea0f10	feat: Add CLI generation methods to plugin option models Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-21 12:21:48 -08:00
James R. Barlow	6913ec7cb8	feat: add PluginOptionRegistry for dynamic option models Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-21 12:21:48 -08:00
James R. Barlow	62ad37b276	refactor: centralize plugin manager setup Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-21 12:21:48 -08:00
James R. Barlow	f04b5504e8	test: fix hOCR pipeline output folder handling The commit message captures the essence of the changes: we fixed how the output folder is handled in the hOCR pipeline by making it a proper field in OCROptions and updating the API functions accordingly. Would you like me to generate a full commit message or is this sufficient? Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-21 12:21:47 -08:00
James R. Barlow	f0c292f4e1	refactor: Remove CLI-parser dependencies in experimental API functions This commit updates `_pdf_to_hocr` and `_hocr_to_ocr_pdf` to use direct OCROptions construction, eliminating the last vestiges of CLI-parser dependency in the experimental APIs. Key changes: - Removed `parser = get_parser()` calls - Added plugin validation similar to main `ocr()` function - Simplified plugin manager hook calls - Added None value filtering to use OCROptions defaults - Maintained error handling and extra_attrs logic The refactoring makes these experimental APIs truly API-first and simplifies the code by removing unnecessary CLI-related complexity. Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-21 12:21:47 -08:00
James R. Barlow	1f493ba789	refactor: post-AI code cleanup	2025-12-21 12:21:47 -08:00
James R. Barlow	a869a4ac42	fix: filter out None values from OCROptions kwargs Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-21 12:21:47 -08:00
James R. Barlow	48a2fdb0f2	refactor: replace `_kwargs_to_cmdline` with direct OCROptions construction in experimental API functions Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-21 12:21:47 -08:00
James R. Barlow	91a2d39845	refactor: replace command line synthesis with direct OCROptions construction Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-21 12:21:47 -08:00
James R. Barlow	cb22a35834	refactor: convert hOCR API entry points to use OCROptions Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-21 12:21:47 -08:00
James R. Barlow	1579337ebe	refactor: Create OCROptions model with Namespace compatibility This commit introduces a new `OCROptions` class in `_options.py` that provides: - Proper typing for OCRmyPDF options - Pydantic validation - Backward compatibility with `argparse.Namespace` - Gradual migration support for the options system Key changes: - Added comprehensive option fields with type hints - Implemented custom attribute access methods - Created conversion methods between Namespace and OCROptions - Updated type hints in multiple files to support both types - Maintained existing validation logic The new model allows for a step-by-step refactoring of the options handling throughout the project. Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>	2025-12-13 11:40:57 -08:00
James R. Barlow	a385cd967d	docs: Improve ocrmypdf.api	2025-11-10 15:58:47 -08:00
James R. Barlow	ee47e986f3	docs: Improve module-level docstring for OCRmyPDF Python API Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>	2025-11-10 10:33:26 -08:00
Alina Bürge	a9a8b39dba	Fix the use of the plugin_manager argument (#1555 )	2025-08-18 13:00:39 -07:00
James R. Barlow	0f82d7223e	Fix some typing issues	2024-10-27 12:31:16 -07:00
James R. Barlow	7a8cc21e31	Add support for sidecar output to io.BytesIO Closes #1252	2024-04-07 01:38:55 -07:00
James R. Barlow	3a3635f7f9	Python 3.10 cleanup, manual fixes	2024-02-14 12:48:17 -08:00
James R. Barlow	71166f7be8	Make hocr API experimental for now This commit can be reverted when we are ready to release a new version.	2023-10-30 00:07:10 -07:00
James R. Barlow	db3df13e95	Remove ocrmypdf._sync	2023-10-24 00:54:31 -07:00
James R. Barlow	4dbc5e1dba	Fix some typing issues	2023-10-24 00:54:31 -07:00
James R. Barlow	f238e721ed	Improve documentation of new public hOCR APIs	2023-10-24 00:54:31 -07:00
James R. Barlow	23951c9e38	Working HOCR folder to PDF converter	2023-10-24 00:54:30 -07:00
James R. Barlow	e8ae370ceb	Eliminate api= kwarg and implicit creation of pluginmanager	2023-10-24 00:54:30 -07:00
James R. Barlow	68bb38d0ad	pdf_to_hocr: improve plugin handling	2023-10-24 00:52:31 -07:00
James R. Barlow	0443e87345	Introduce pdf_to_hocr API	2023-10-24 00:52:31 -07:00
James R. Barlow	b3de5833d3	Refactor conversion of ocrmypdf.ocr() arguments to cmdline	2023-10-24 00:52:31 -07:00
James R. Barlow	539f0ee0ce	Document some missing CLI options to API	2023-10-03 23:54:24 -07:00
James R. Barlow	f4c211fa2d	Use Python 3.9-style type hinting for tuple[] and AbstractSet -> Set	2023-10-01 00:09:05 -07:00
James R. Barlow	113a6b45bd	ruff autofixes (mostly typing.* -> collections.abc.*)	2023-10-01 00:02:53 -07:00
James R. Barlow	85e31d0a19	Tidy imports and line length	2023-09-25 01:04:01 -07:00
James R. Barlow	47b0f28564	Minor documentation and typing fixes	2023-09-25 00:19:48 -07:00
James R. Barlow	de2bb5ce8c	Remove tqdm dependency and TqdmConsole Might be too aggressive? No deprecation warning....	2023-09-21 00:05:39 -07:00
James R. Barlow	19045c4f21	Replace coloredlogs and tqdm with rich	2023-08-11 01:47:42 -07:00
James R. Barlow	e8ed510543	Add stop on soft render errors and option to override	2023-06-01 23:49:34 -07:00
James R. Barlow	a99e40fa84	Update notes about ocrmypdf.ocr usage	2023-04-14 11:54:58 -07:00
James R. Barlow	9ce692a6f1	ruff: further fixes	2023-04-14 02:39:36 -07:00
James R. Barlow	33b70be7d5	ruff: more fixes, mainly missing docstrings	2023-04-14 02:16:38 -07:00
James R. Barlow	4924b11b6b	Additional ruff fixes	2023-04-14 01:25:16 -07:00
James R. Barlow	9b8d14d16e	Accept most of ruff's delinting	2023-04-14 00:45:34 -07:00
James R. Barlow	4d2f499f97	Remove optional status of coloredlogs Everything optional is a possible complication. Better to remove the option.	2022-08-04 04:15:56 -07:00
James R. Barlow	53db866ef9	Remove deprecated exception PdfMergeFailedError	2022-08-04 03:54:55 -07:00

1 2 3

124 Commits