OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2026-02-07 21:03:59 -05:00

Author	SHA1	Message	Date
James R. Barlow	37e7131a01	Drop support for Python 3.10, require Python 3.11+ Python 3.11 is now the minimum supported version. This aligns with the codebase's use of StrEnum (introduced in 3.11) and removes compatibility shims that were only needed for older versions.	2026-01-20 11:54:55 -08:00
James R. Barlow	bc745d4d81	Replace magic Ghostscript raster device strings with StrEnum	2026-01-20 10:44:25 -08:00
James R. Barlow	c818ad5e75	Drop deprecated NeverRaise exception	2026-01-20 10:43:21 -08:00
James R. Barlow	ef88ba3f95	Add OcrOptions as first-class argument to ocr() function Allow passing an OcrOptions object directly to ocr() as the first positional argument, providing a cleaner API for programmatic use. The old-style API with individual parameters remains fully supported.	2026-01-20 10:20:52 -08:00
James R. Barlow	3f328785f0	Fix pypdfium rasterizer to match Ghostscript dimensions The pypdfium rasterizer was producing output images that differed by 1 pixel compared to Ghostscript due to floating-point precision issues in dimension calculations. Root cause: - pypdfium used harmonic mean of x/y DPI to calculate a single scale factor, losing the distinction between x and y DPI - No DPI rounding like Ghostscript's 6-decimal precision - Compound rounding errors when converting points to pixels Solution: 1. Round DPI to 6 decimals to match Ghostscript's precision 2. Calculate expected output dimensions using separate x/y DPI values 3. Handle dimension swapping for 90°/270° rotations 4. Resize output image if off by 1-2 pixels (graceful correction) This ensures pixel-perfect matching with Ghostscript while being minimally invasive and only resizing when necessary. Changes: - Modified _render_page_to_bitmap() to calculate expected dimensions - Modified _process_image_for_output() to correct small discrepancies - Updated rasterize_pdf_page() to pass dimensions through pipeline - Parametrized rotation tests to run with both rasterizers All 45 rotation tests now pass with both pypdfium and ghostscript. Fixes test_rotated_skew_timeout with pypdfium rasterizer.	2026-01-14 14:37:24 -08:00
James R. Barlow	5acf21651f	ruff lint and format	2026-01-13 01:50:57 -08:00
James R. Barlow	7bfe3ecd5b	Fix double-compression of already-deflated JPEGs Images with [FlateDecode, DCTDecode] filter chain were incorrectly being marked for additional FlateDecode compression, resulting in double-compressed data and invalid output PDFs. Add _already_flate_encoded() helper to check if an image already has FlateDecode in its filter chain, and skip such images in _find_deflatable_jpeg().	2026-01-13 01:41:59 -08:00
James R. Barlow	740f67091c	Rename OCROptions to OcrOptions for consistency Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.	2026-01-12 23:37:54 -08:00
James R. Barlow	c69f293322	Add --mode/-m CLI argument with ProcessingMode enum Introduce a new --mode (-m) argument that consolidates the three mutually exclusive OCR processing options into a single enum: - default: Error if text is found (standard behavior) - force: Rasterize all content and run OCR (replaces --force-ocr) - skip: Skip pages with existing text (replaces --skip-text) - redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr) The legacy flags --force-ocr, --skip-text, and --redo-ocr remain as silent aliases for backward compatibility. Both CLI and API usage continue to work unchanged.	2026-01-12 15:23:08 -08:00
James R. Barlow	c9ea07e954	Reduce chattiness of fonttools	2026-01-12 10:16:58 -08:00
James R. Barlow	0c3745a1a4	Add OCR engine selection framework and null OCR engine Introduce --ocr-engine option to select between OCR engines: - 'auto' (default): Uses Tesseract - 'tesseract': Explicit Tesseract selection - 'none': Skip OCR entirely (for PDF processing only) Key changes: - Extend OcrEngine ABC with generate_ocr() and supports_generate_ocr() for direct OcrElement tree output (bypasses hOCR) - Add get_ocr_engine(options) hook parameter for engine selection - Implement NullOcrEngine for --ocr-engine none - Export OcrElement, OcrClass, BoundingBox from ocrmypdf package - Add ocr_tree support to grafting pipeline This prepares the foundation for pluggable OCR engines while maintaining full backward compatibility with existing Tesseract-based workflows.	2026-01-12 10:11:14 -08:00
James R. Barlow	315d0df0e9	Fix incorrect rotation direction in pypdfium rasterizer pypdfium2 expects clockwise rotation values, but OCRmyPDF tracks rotation in counter-clockwise. Negate the rotation value to fix. Also refactor nested try/finally blocks to use contextlib.closing() for cleaner resource management.	2026-01-10 16:29:49 -08:00
James R. Barlow	0c4ee5af4e	Add 'auto' output type for best-effort PDF/A without Ghostscript - Add new '--output-type auto' option (now the default) that produces best-effort PDF/A without requiring Ghostscript - When verapdf is available, use speculative PDF/A conversion - Without verapdf, pass through as PDF/A if safe (input claims PDF/A or --force-ocr was used), otherwise output as regular PDF - Make Ghostscript check conditional - only required for pdfa* output types - Update soft error tests to explicitly use --output-type pdfa since they exercise Ghostscript failure modes - Fix Tesseract OSD error handling to check both stdout and stderr for known non-fatal messages like "Too few characters"	2026-01-09 00:56:00 -08:00
James R. Barlow	bdc50e9470	Add explicit word spacing for pdfminer.six compatibility Insert space characters between words in the fpdf2 renderer so PDF readers like pdfminer.six can properly segment words during text extraction. Some PDF readers rely on explicit space characters rather than inferring word boundaries from positioning. - Use itertools.pairwise to iterate consecutive word pairs - Render space immediately after each word (content stream order matters) - Skip space insertion between CJK words (no spaces in CJK text) - Use 5% line height threshold to filter OCR noise - Support RTL text direction	2026-01-08 16:32:14 -08:00
James R. Barlow	4cb488d0fc	Skip speculative PDF/A when --pdfa-image-compression is set When the user explicitly sets --pdfa-image-compression to something other than 'auto', skip the speculative PDF/A conversion and use Ghostscript instead. The speculative conversion (using pikepdf + verapdf) doesn't apply image compression settings, so Ghostscript is required to honor the user's compression preference.	2026-01-08 15:12:35 -08:00
James R. Barlow	900a60fd10	Add verapdf integration for speculative PDF/A conversion Introduce a fast path for PDF/A conversion that uses pikepdf to add PDF/A structures directly (sRGB ICC profile and XMP metadata), then validates with verapdf. If validation passes, skip Ghostscript entirely. If validation fails or verapdf is unavailable, fall back to the existing Ghostscript conversion path. New files: - src/ocrmypdf/_exec/verapdf.py: CLI wrapper for verapdf validator - tests/test_verapdf.py: Test suite for new functionality Modified: - pdfa.py: Add speculative_pdfa_conversion() and helpers - _pipeline.py: Add try_speculative_pdfa() function - _pipelines/_common.py: Integrate speculative path into postprocess()	2026-01-08 10:58:01 -08:00
James R. Barlow	f5617ce44e	Refactor OcrmypdfPluginManager to use composition over inheritance Replace inheritance from pluggy.PluginManager with composition pattern, providing a type-safe interface for all 16 hooks defined in pluginspec.py. The underlying pluggy manager is now accessible via the .pluggy property for advanced use cases like set_blocked(). This change enables IDE autocomplete and type checking for all hook calls while maintaining full backward compatibility with the plugin system.	2026-01-07 17:23:13 -08:00
James R. Barlow	0e946a7498	Clarify messageabout number of workers	2026-01-07 16:41:18 -08:00
James R. Barlow	b2b6a7c4b1	Pass OMP_THREAD_LIMIT to Tesseract subprocesses instead of modifying parent env Instead of setting OMP_THREAD_LIMIT in the parent process's environment, calculate the thread limit in the validate hook and pass it through to Tesseract subprocess calls via the env parameter. This avoids polluting the parent process's environment while still controlling Tesseract's thread usage.	2026-01-06 18:43:29 -08:00
James R. Barlow	7a4b98974c	Integrate fpdf2 renderer and remove legacy hOCR renderer - Update pipeline to use fpdf2 renderer as default - Remove legacy hocrtransform PDF renderer (_font.py, _hocr.py, pdf_renderer.py) - Update CLI and options for fpdf2 renderer - Add fpdf2 dependency to pyproject.toml - Update graft module for fpdf2 multi-page rendering	2026-01-06 13:45:44 -08:00
James R. Barlow	d72a494979	Add fpdf2-based PDF text layer renderer Implement new PDF renderer using fpdf2 library that provides: - Multilingual text support via font module - Proper baseline and rotation handling - Multi-page rendering with efficient font embedding - Invisible but selectable text layer	2026-01-06 13:45:14 -08:00
James R. Barlow	64726f97b3	Add font infrastructure and glyphless font - Add font module with FontManager, FontProvider, MultiFontManager, and SystemFontProvider for multilingual font support - Add NotoSans-Regular.ttf for Latin text rendering - Replace pdf.ttf with Occulta.ttf glyphless font - Add script to generate new Occulta glyphless font - System font discovery for CJK, Arabic, Devanagari scripts	2026-01-06 13:44:54 -08:00
James R. Barlow	83a43408c2	Refactor tesseract thresholding to use enum type Replace integer-based thresholding parameter with ThresholdingMethod enum for improved type safety. The CLI still accepts the same string values (auto, otsu, adaptive-otsu, sauvola) but internally uses a strongly-typed enum. This makes the code more maintainable and catches type errors at development time.	2025-12-27 13:32:56 -08:00
James R. Barlow	2cb0973540	Improve Ghostscript API/CLI definitions	2025-12-27 01:40:12 -08:00
James R. Barlow	0d6e0c4560	Merge branch 'main' into dev	2025-12-24 00:44:18 -08:00
James R. Barlow	195344d307	Reinstate "Work around Ghostscript 10.6.0 JPEG encoding issue by forcing optimization."" This reverts commit `fc30cb8903`. It turns out that both fixes were necessary.	2025-12-23 15:41:34 -08:00
James R. Barlow	fc30cb8903	Revert "Work around Ghostscript 10.6.0 JPEG encoding issue by forcing optimization." This reverts commit `f4c6c8121b`. The issue is now resolved by correcting the encoidng issue directly.	2025-12-23 15:03:51 -08:00
James R. Barlow	e613db6a82	Fix Ghostscript 10.6 JPEG corruption by repairing truncated images Ghostscript 10.6 has a bug that truncates JPEG data by 1-15 bytes. This adds detection and repair by comparing output images to input images and restoring the original bytes when truncation is detected. - Add warning when GS 10.6+ is used with PDF/A output - Add _repair_gs106_jpeg_corruption() to fix damaged JPEGs after Ghostscript processing - Add unit tests for the repair function	2025-12-23 14:56:24 -08:00
James R. Barlow	4c1ef0b471	Also process art and bleed boxes	2025-12-23 11:20:41 -08:00
James R. Barlow	eace567f7b	Test and fix page box issues	2025-12-23 11:19:51 -08:00
James R. Barlow	e9bfce34f1	Fix ruff linting issues - Use X \| Y syntax in isinstance calls (UP038) - Remove trailing whitespace from blank lines (W293)	2025-12-23 03:07:48 -08:00
James R. Barlow	16c2604a07	Remove lossy JBIG2 support, retain lossless JBIG2 only Lossy JBIG2 has been removed due to well-documented risks of character substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and --jbig2-page-group-size arguments are now deprecated and ignored with a warning. Changes: - Remove jbig2_lossy and jbig2_page_group_size from OCROptions - Simplify optimize.py to use single-image JBIG2 encoding only (no symbol dictionaries/JBIG2Globals) - Remove convert_group() from jbig2enc.py - Deprecate CLI args with warnings for backward compatibility - Update documentation to explain lossless-only JBIG2	2025-12-23 02:45:07 -08:00
James R. Barlow	9ebba91466	Use plugin namespace access pattern throughout codebase Migrate all code from flat accessor pattern (options.tesseract_timeout) to the plugin namespace pattern (options.tesseract.timeout). Key changes: - Fix _get_plugin_options to raise AttributeError for unregistered namespaces instead of silently returning None - Add _convert_value helper to convert PathLike to str for plugin model field compatibility - Filter out _plugin_cache_* entries from JSON serialization to fix worker process serialization (test_simulate_oom_killer) - Update tesseract_ocr.py, ghostscript.py, _validation_coordinator.py, and _pipelines/ocr.py to use options.tesseract.* and options.ghostscript.* accessors - Update tests to use setup_plugin_infrastructure() for plugin model registration	2025-12-23 02:02:21 -08:00
James R. Barlow	aec995aced	Require plugin model registration for namespace access in OCROptions - Update __getattr__ docstring to clarify that plugin models must be registered for namespace access (e.g., options.tesseract.timeout) - Update test_json_serialization.py to properly register TesseractOptions before accessing plugin namespaces - Worker processes now register plugin models for multiprocessing tests - Exclude plugin cache keys from extra_attrs comparison in tests	2025-12-22 15:09:55 -08:00
James R. Barlow	be425e7405	Refactor pdfinfo: split info.py into focused modules Split the 1288-line info.py into smaller, single-responsibility modules: - _types.py: Enums, type aliases, lookup dictionaries - _contentstream.py: PDF content stream parsing, DPI calculation - _image.py: ImageInfo class and image finding functions - _worker.py: Concurrency/worker process handling - info.py: PageInfo, PdfInfo classes (reduced to ~530 lines) Public API unchanged - all existing imports continue to work.	2025-12-22 01:27:23 -08:00
James R. Barlow	b4f9673364	Add unit tests for HocrParser, PdfTextRenderer, and OcrElement Comprehensive test coverage for the new hocrtransform components: - test_ocr_element.py: Tests for BoundingBox, Baseline, FontInfo, OcrElement dataclass methods (iter_by_class, find_by_class, get_text_recursive, words/lines/paragraphs properties) - test_hocr_parser.py: Tests for parsing hOCR files including page/paragraph/line/word extraction, RTL text, rotated text, different line types (header, caption), font info, and edge cases - test_pdf_renderer.py: Tests for PDF rendering including text extraction verification, page sizing, multi-line content, text direction, baseline handling, textangle rotation, word breaks, debug options, and image overlay Also fixes x_font regex pattern to not capture trailing semicolons. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 17:05:49 -08:00
James R. Barlow	9ea804aff5	Refactor hocrtransform: separate parsing from rendering Split the hOCR transformation code into three distinct layers: 1. ocr_element.py - Generic OcrElement dataclass that represents OCR output structure from any source (hOCR, ALTO, custom engines). Includes helper classes: BoundingBox, Baseline, FontInfo. 2. hocr_parser.py - HocrParser class that parses hOCR XML files into OcrElement trees, extracting bbox, baseline, textangle, confidence, font info, direction, and language. 3. pdf_renderer.py - PdfTextRenderer class that renders OcrElement trees to PDF text layers, handling text positioning, baseline rotation, LTR/RTL, and word break injection. The existing HocrTransform class is preserved for backward compatibility, now delegating to the new components internally. This separation enables: - Support for non-hOCR OCR output formats - Independent improvements to text rendering - Reuse of OcrElement for other purposes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 16:17:22 -08:00
James R. Barlow	57e2600566	Also process art and bleed boxes	2025-12-21 14:03:28 -08:00
James R. Barlow	41758766a1	Test and fix page box issues	2025-12-21 14:03:28 -08:00
James R. Barlow	3e46b039ed	feat: add use_cropbox parameter to align rasterizer APIs Added use_cropbox parameter to rasterize_pdf_page hook to allow choosing between MediaBox and CropBox rendering: - Default is use_cropbox=False (MediaBox) for consistency with Ghostscript's existing behavior - Ghostscript: passes -dUseCropBox when use_cropbox=True - pypdfium: calculates crop values to expand from CropBox to MediaBox when use_cropbox=False This aligns both rasterizers to produce the same output dimensions by default, making the rasterizer choice transparent for page geometry. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:29:17 -08:00
James R. Barlow	ae783b4ae6	fix: add thread safety lock to pypdfium plugin pypdfium2/PDFium is not thread-safe - concurrent calls from different threads can crash or corrupt the process. Added a module-level lock to serialize all pdfium operations. PIL image processing and file I/O are done outside the lock since they are thread-safe, minimizing lock contention. For maximum parallelism, users can use process-based parallelism (use_threads=False) where each process has its own pdfium instance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:29:17 -08:00
James R. Barlow	b9f488d65c	test: add comprehensive tests for --rasterizer option Add test_rasterizer.py with tests covering: - Basic rasterizer option validation ('auto', 'ghostscript', 'pypdfium') - Rasterizer + --rotate-pages interaction - PDFs with nonstandard MediaBox/TrimBox/CropBox - Direct hook tests verifying plugins respect the option Also fix pluggy parameter passing: make 'options' a required parameter (no default) in the hookspec so pluggy forwards it to implementations. Update test plugins and test_rotation.py to pass the new parameter. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:29:17 -08:00
James R. Barlow	ed813cec67	feat: add --rasterizer CLI option to select PDF rasterization backend Add user control over which rasterizer is used for PDF page rendering: - 'auto' (default): prefers pypdfium when available, falls back to Ghostscript - 'pypdfium': force pypdfium2 (errors if not installed) - 'ghostscript': force traditional Ghostscript rasterizer Changes: - Add rasterizer field with validation to OCROptions model - Add --rasterizer CLI argument in the Advanced options group - Update rasterize_pdf_page hookspec to pass options to plugins - Update pypdfium plugin with check_options hook for availability check - Update both plugins to respect the rasterizer option 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:29:17 -08:00
James R. Barlow	938ce8e285	fix: make pypdfium plugin optional with Ghostscript fallback - Remove check_options hook from pypdfium that raised error when pypdfium2 wasn't installed - Return None from pypdfium's rasterize_pdf_page when pypdfium2 is unavailable, allowing the hook to fall through - Restore Ghostscript's rasterize_pdf_page hook as fallback This allows OCRmyPDF to work without pypdfium2 installed, using Ghostscript for rasterization as before. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:29:17 -08:00
James R. Barlow	cf3fb6e89b	Fix raster_device settings for pypdfium rasterizer	2025-12-21 12:29:17 -08:00
James R. Barlow	3482ea5fe5	refactor: Modularize `rasterize_pdf_page` into separate PDF, page, and image processing functions Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>	2025-12-21 12:29:17 -08:00
James R. Barlow	e85c5bbb4d	refactor: Simplify error message and code formatting in pypdfium plugin	2025-12-21 12:29:17 -08:00
James R. Barlow	740b0bddc6	feat: add pypdfium2 rasterization plugin for OCRmyPDF Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>	2025-12-21 12:29:17 -08:00
James R. Barlow	a4ee513cd4	refactor: clean up deprecated code and update plugin docs - Remove outdated Phase comments from _options.py and cli.py - Remove unused methods from PluginOptionRegistry: - get_extended_options_model() - replaced by __getattr__ in OCROptions - map_legacy_options() - unused - validate_plugin_options() - unused - Update plugin documentation to document register_options hook - Add documentation for nested plugin option access pattern 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:21:48 -08:00
James R. Barlow	0ad7f5fc13	feat: add dynamic nested access to plugin options Completes Phase 5 of the CLI refactoring plan by enabling nested plugin option access (e.g., options.tesseract.timeout) alongside the legacy flat access (options.tesseract_timeout). Changes: - Add module-level plugin option model registry in _options.py - Add __getattr__ to OCROptions for dynamic namespace access - Register plugin models in setup_plugin_infrastructure() - Add test for nested plugin option access Plugin option instances are lazily created from flat field values and cached for subsequent access. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:21:48 -08:00

1 2 3 4 5 ...

1540 Commits