OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2026-02-07 21:03:59 -05:00

Author	SHA1	Message	Date
James R. Barlow	db9f94de14	Ensure Noto font is installed where needed	2026-01-20 19:50:47 -08:00
James R. Barlow	37e7131a01	Drop support for Python 3.10, require Python 3.11+ Python 3.11 is now the minimum supported version. This aligns with the codebase's use of StrEnum (introduced in 3.11) and removes compatibility shims that were only needed for older versions.	2026-01-20 11:54:55 -08:00
James R. Barlow	bc745d4d81	Replace magic Ghostscript raster device strings with StrEnum	2026-01-20 10:44:25 -08:00
James R. Barlow	c818ad5e75	Drop deprecated NeverRaise exception	2026-01-20 10:43:21 -08:00
James R. Barlow	4b16228a4a	docs: minor adjustments	2026-01-20 10:41:55 -08:00
James R. Barlow	d40fca2590	Add verapdf to build for macOS	2026-01-20 10:41:43 -08:00
James R. Barlow	99f8106936	Update API documentation for OcrOptions-first calling convention Document the new v17 API style where OcrOptions can be passed directly to ocr(). Mark the positional argument style as legacy API for <v17 compatibility. Update examples to use modern syntax.	2026-01-20 10:30:33 -08:00
James R. Barlow	ef88ba3f95	Add OcrOptions as first-class argument to ocr() function Allow passing an OcrOptions object directly to ocr() as the first positional argument, providing a cleaner API for programmatic use. The old-style API with individual parameters remains fully supported.	2026-01-20 10:20:52 -08:00
James R. Barlow	2f4280b66c	Comprrehensive documentation update in preparation for v17	2026-01-16 01:38:47 -08:00
James R. Barlow	6cf9d1c6ee	Update release notes	2026-01-15 23:29:29 -08:00
James R. Barlow	6a7164a76c	Update release notes with branch changes	2026-01-15 23:25:51 -08:00
James R. Barlow	3f328785f0	Fix pypdfium rasterizer to match Ghostscript dimensions The pypdfium rasterizer was producing output images that differed by 1 pixel compared to Ghostscript due to floating-point precision issues in dimension calculations. Root cause: - pypdfium used harmonic mean of x/y DPI to calculate a single scale factor, losing the distinction between x and y DPI - No DPI rounding like Ghostscript's 6-decimal precision - Compound rounding errors when converting points to pixels Solution: 1. Round DPI to 6 decimals to match Ghostscript's precision 2. Calculate expected output dimensions using separate x/y DPI values 3. Handle dimension swapping for 90°/270° rotations 4. Resize output image if off by 1-2 pixels (graceful correction) This ensures pixel-perfect matching with Ghostscript while being minimally invasive and only resizing when necessary. Changes: - Modified _render_page_to_bitmap() to calculate expected dimensions - Modified _process_image_for_output() to correct small discrepancies - Updated rasterize_pdf_page() to pass dimensions through pipeline - Parametrized rotation tests to run with both rasterizers All 45 rotation tests now pass with both pypdfium and ghostscript. Fixes test_rotated_skew_timeout with pypdfium rasterizer.	2026-01-14 14:37:24 -08:00
James R. Barlow	5acf21651f	ruff lint and format	2026-01-13 01:50:57 -08:00
James R. Barlow	7bfe3ecd5b	Fix double-compression of already-deflated JPEGs Images with [FlateDecode, DCTDecode] filter chain were incorrectly being marked for additional FlateDecode compression, resulting in double-compressed data and invalid output PDFs. Add _already_flate_encoded() helper to check if an image already has FlateDecode in its filter chain, and skip such images in _find_deflatable_jpeg().	2026-01-13 01:41:59 -08:00
James R. Barlow	5371cc5e39	Update test to match new error messag	2026-01-13 01:33:10 -08:00
James R. Barlow	4c7086c609	Replace typer with cyclopts CLI library in misc scripts Migrate watcher.py and pdf_text_diff.py from typer to cyclopts for CLI argument parsing. Update pyproject.toml to reflect the dependency change in the watcher optional feature.	2026-01-13 00:43:14 -08:00
James R. Barlow	bf76c8270c	Rationalize optional dependencies vs dependency groups Establish clear separation between user-facing optional dependencies and developer-only dependency groups: Optional Dependencies (user features): - watcher: File watching service for batch processing - webservice: Streamlit-based web UI - Installable via: uv sync --extra <name> or pip install ocrmypdf[name] Dependency Groups (developer tools): - test: Testing infrastructure (merged from test + extended_test) - docs: Documentation building tools - streamlit-dev: Enhanced Streamlit development tools - dev: General development tools (mypy, ipykernel) - Installable via: uv sync --group <name> (uv only, NOT pip) Breaking changes for developers: - pip install -e .[test] no longer works → use uv sync --group test - pip install -e .[docs] no longer works → use uv sync --group docs - pip install -e .[extended_test] removed → merged into test group No breaking changes for end users: - pip install ocrmypdf[watcher] still works - pip install ocrmypdf[webservice] still works Updated: - CI/CD workflows to use uv sync --group test - Docker images to exclude test dependencies - Documentation to recommend uv with pip as fallback - pyproject.toml with clear comments explaining both systems	2026-01-13 00:34:55 -08:00
James R. Barlow	740f67091c	Rename OCROptions to OcrOptions for consistency Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.	2026-01-12 23:37:54 -08:00
James R. Barlow	36dea181e6	Update cookbook: Replace --tesseract-timeout 0 with --ocr-engine none Update documentation examples to use the new --ocr-engine none option instead of the deprecated --tesseract-timeout 0 idiom for disabling OCR.	2026-01-12 23:28:14 -08:00
James R. Barlow	c69f293322	Add --mode/-m CLI argument with ProcessingMode enum Introduce a new --mode (-m) argument that consolidates the three mutually exclusive OCR processing options into a single enum: - default: Error if text is found (standard behavior) - force: Rasterize all content and run OCR (replaces --force-ocr) - skip: Skip pages with existing text (replaces --skip-text) - redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr) The legacy flags --force-ocr, --skip-text, and --redo-ocr remain as silent aliases for backward compatibility. Both CLI and API usage continue to work unchanged.	2026-01-12 15:23:08 -08:00
James R. Barlow	e9fe061c30	Format fix	2026-01-12 10:25:24 -08:00
James R. Barlow	c9ea07e954	Reduce chattiness of fonttools	2026-01-12 10:16:58 -08:00
James R. Barlow	0c3745a1a4	Add OCR engine selection framework and null OCR engine Introduce --ocr-engine option to select between OCR engines: - 'auto' (default): Uses Tesseract - 'tesseract': Explicit Tesseract selection - 'none': Skip OCR entirely (for PDF processing only) Key changes: - Extend OcrEngine ABC with generate_ocr() and supports_generate_ocr() for direct OcrElement tree output (bypasses hOCR) - Add get_ocr_engine(options) hook parameter for engine selection - Implement NullOcrEngine for --ocr-engine none - Export OcrElement, OcrClass, BoundingBox from ocrmypdf package - Add ocr_tree support to grafting pipeline This prepares the foundation for pluggable OCR engines while maintaining full backward compatibility with existing Tesseract-based workflows.	2026-01-12 10:11:14 -08:00
James R. Barlow	664c3e2a8e	Update test cache for slow rotation tests	2026-01-10 16:30:25 -08:00
James R. Barlow	315d0df0e9	Fix incorrect rotation direction in pypdfium rasterizer pypdfium2 expects clockwise rotation values, but OCRmyPDF tracks rotation in counter-clockwise. Negate the rotation value to fix. Also refactor nested try/finally blocks to use contextlib.closing() for cleaner resource management.	2026-01-10 16:29:49 -08:00
James R. Barlow	3c94ada857	Fix tesseract_cache plugin to properly handle cache misses - Check all required output files exist before declaring cache hit, not just stderr.bin - Add 'hocr' to list of cached output file types - Fix timeout=0.0 causing immediate timeout on cache miss by treating it as "no timeout"	2026-01-09 02:10:29 -08:00
James R. Barlow	fcbdbac602	Update test_page_boxes MediaBox expectations for speculative PDF/A When speculative PDF/A succeeds (verapdf available), Ghostscript is bypassed and MediaBox is preserved rather than normalized to origin.	2026-01-09 01:25:31 -08:00
James R. Barlow	122450c19e	Fix Ghostscript tests after default output type changed to 'auto' - Add --output-type pdfa to tests that exercise Ghostscript-specific behavior (test_gs_render_failure, test_ghostscript_pdfa_failure, test_ghostscript_mandatory_color_conversion) - Add Gs106WarningFilter to suppress expected Ghostscript 10.6.x JPEG encoding warning in test logs	2026-01-09 01:02:25 -08:00
James R. Barlow	0c4ee5af4e	Add 'auto' output type for best-effort PDF/A without Ghostscript - Add new '--output-type auto' option (now the default) that produces best-effort PDF/A without requiring Ghostscript - When verapdf is available, use speculative PDF/A conversion - Without verapdf, pass through as PDF/A if safe (input claims PDF/A or --force-ocr was used), otherwise output as regular PDF - Make Ghostscript check conditional - only required for pdfa* output types - Update soft error tests to explicitly use --output-type pdfa since they exercise Ghostscript failure modes - Fix Tesseract OSD error handling to check both stdout and stderr for known non-fatal messages like "Too few characters"	2026-01-09 00:56:00 -08:00
James R. Barlow	bdc50e9470	Add explicit word spacing for pdfminer.six compatibility Insert space characters between words in the fpdf2 renderer so PDF readers like pdfminer.six can properly segment words during text extraction. Some PDF readers rely on explicit space characters rather than inferring word boundaries from positioning. - Use itertools.pairwise to iterate consecutive word pairs - Render space immediately after each word (content stream order matters) - Skip space insertion between CJK words (no spaces in CJK text) - Use 5% line height threshold to filter OCR noise - Support RTL text direction	2026-01-08 16:32:14 -08:00
James R. Barlow	4cb488d0fc	Skip speculative PDF/A when --pdfa-image-compression is set When the user explicitly sets --pdfa-image-compression to something other than 'auto', skip the speculative PDF/A conversion and use Ghostscript instead. The speculative conversion (using pikepdf + verapdf) doesn't apply image compression settings, so Ghostscript is required to honor the user's compression preference.	2026-01-08 15:12:35 -08:00
James R. Barlow	bb5238e524	Update tests to use new OcrmypdfPluginManager interface Replace pm.hook.method() calls with pm.method() calls to match the refactored plugin manager that now uses composition over inheritance. The hook attribute is no longer directly exposed; instead, type-safe methods are provided directly on the plugin manager class.	2026-01-08 13:09:19 -08:00
James R. Barlow	900a60fd10	Add verapdf integration for speculative PDF/A conversion Introduce a fast path for PDF/A conversion that uses pikepdf to add PDF/A structures directly (sRGB ICC profile and XMP metadata), then validates with verapdf. If validation passes, skip Ghostscript entirely. If validation fails or verapdf is unavailable, fall back to the existing Ghostscript conversion path. New files: - src/ocrmypdf/_exec/verapdf.py: CLI wrapper for verapdf validator - tests/test_verapdf.py: Test suite for new functionality Modified: - pdfa.py: Add speculative_pdfa_conversion() and helpers - _pipeline.py: Add try_speculative_pdfa() function - _pipelines/_common.py: Integrate speculative path into postprocess()	2026-01-08 10:58:01 -08:00
James R. Barlow	f5617ce44e	Refactor OcrmypdfPluginManager to use composition over inheritance Replace inheritance from pluggy.PluginManager with composition pattern, providing a type-safe interface for all 16 hooks defined in pluginspec.py. The underlying pluggy manager is now accessible via the .pluggy property for advanced use cases like set_blocked(). This change enables IDE autocomplete and type checking for all hook calls while maintaining full backward compatibility with the plugin system.	2026-01-07 17:23:13 -08:00
James R. Barlow	0e946a7498	Clarify messageabout number of workers	2026-01-07 16:41:18 -08:00
James R. Barlow	b2b6a7c4b1	Pass OMP_THREAD_LIMIT to Tesseract subprocesses instead of modifying parent env Instead of setting OMP_THREAD_LIMIT in the parent process's environment, calculate the thread limit in the validate hook and pass it through to Tesseract subprocess calls via the env parameter. This avoids polluting the parent process's environment while still controlling Tesseract's thread usage.	2026-01-06 18:43:29 -08:00
James R. Barlow	75c664793e	Don't share claude	2026-01-06 15:42:51 -08:00
James R. Barlow	bbd263ff48	Add tests for fpdf2 renderer and font infrastructure - Add hOCR test fixtures for Latin, Arabic, CJK, Devanagari scripts - Add tests for fpdf2 renderer, multi-font manager, system font provider - Add multilingual rendering tests - Update existing tests to use fpdf2 renderer	2026-01-06 13:46:11 -08:00
James R. Barlow	7a4b98974c	Integrate fpdf2 renderer and remove legacy hOCR renderer - Update pipeline to use fpdf2 renderer as default - Remove legacy hocrtransform PDF renderer (_font.py, _hocr.py, pdf_renderer.py) - Update CLI and options for fpdf2 renderer - Add fpdf2 dependency to pyproject.toml - Update graft module for fpdf2 multi-page rendering	2026-01-06 13:45:44 -08:00
James R. Barlow	d72a494979	Add fpdf2-based PDF text layer renderer Implement new PDF renderer using fpdf2 library that provides: - Multilingual text support via font module - Proper baseline and rotation handling - Multi-page rendering with efficient font embedding - Invisible but selectable text layer	2026-01-06 13:45:14 -08:00
James R. Barlow	64726f97b3	Add font infrastructure and glyphless font - Add font module with FontManager, FontProvider, MultiFontManager, and SystemFontProvider for multilingual font support - Add NotoSans-Regular.ttf for Latin text rendering - Replace pdf.ttf with Occulta.ttf glyphless font - Add script to generate new Occulta glyphless font - System font discovery for CJK, Arabic, Devanagari scripts	2026-01-06 13:44:54 -08:00
James R. Barlow	83a43408c2	Refactor tesseract thresholding to use enum type Replace integer-based thresholding parameter with ThresholdingMethod enum for improved type safety. The CLI still accepts the same string values (auto, otsu, adaptive-otsu, sauvola) but internally uses a strongly-typed enum. This makes the code more maintainable and catches type errors at development time.	2025-12-27 13:32:56 -08:00
James R. Barlow	2cb0973540	Improve Ghostscript API/CLI definitions	2025-12-27 01:40:12 -08:00
James R. Barlow	0d6e0c4560	Merge branch 'main' into dev	2025-12-24 00:44:18 -08:00
James R. Barlow	94d7735862	docs: missing issue ref	2025-12-24 00:14:24 -08:00
James R. Barlow	c540967429	docs: Update release notes v16.13.0	2025-12-23 15:44:44 -08:00
James R. Barlow	195344d307	Reinstate "Work around Ghostscript 10.6.0 JPEG encoding issue by forcing optimization."" This reverts commit `fc30cb8903`. It turns out that both fixes were necessary.	2025-12-23 15:41:34 -08:00
James R. Barlow	de63d6eac9	Merge remote-tracking branches 'origin/dependabot/github_actions/actions/download-artifact-7', 'origin/dependabot/github_actions/actions/upload-artifact-6', 'origin/dependabot/github_actions/sigstore/gh-action-sigstore-python-3.2.0' and 'origin/dependabot/github_actions/actions/checkout-6'	2025-12-23 15:06:50 -08:00
James R. Barlow	6ada11ddae	docs: Update release notes	2025-12-23 15:05:49 -08:00
James R. Barlow	fc30cb8903	Revert "Work around Ghostscript 10.6.0 JPEG encoding issue by forcing optimization." This reverts commit `f4c6c8121b`. The issue is now resolved by correcting the encoidng issue directly.	2025-12-23 15:03:51 -08:00

1 2 3 4 5 ...

4223 Commits