Commit Graph

4223 Commits

Author SHA1 Message Date
James R. Barlow
db9f94de14 Ensure Noto font is installed where needed 2026-01-20 19:50:47 -08:00
James R. Barlow
37e7131a01 Drop support for Python 3.10, require Python 3.11+
Python 3.11 is now the minimum supported version. This aligns with
the codebase's use of StrEnum (introduced in 3.11) and removes
compatibility shims that were only needed for older versions.
2026-01-20 11:54:55 -08:00
James R. Barlow
bc745d4d81 Replace magic Ghostscript raster device strings with StrEnum 2026-01-20 10:44:25 -08:00
James R. Barlow
c818ad5e75 Drop deprecated NeverRaise exception 2026-01-20 10:43:21 -08:00
James R. Barlow
4b16228a4a docs: minor adjustments 2026-01-20 10:41:55 -08:00
James R. Barlow
d40fca2590 Add verapdf to build for macOS 2026-01-20 10:41:43 -08:00
James R. Barlow
99f8106936 Update API documentation for OcrOptions-first calling convention
Document the new v17 API style where OcrOptions can be passed directly
to ocr(). Mark the positional argument style as legacy API for <v17
compatibility. Update examples to use modern syntax.
2026-01-20 10:30:33 -08:00
James R. Barlow
ef88ba3f95 Add OcrOptions as first-class argument to ocr() function
Allow passing an OcrOptions object directly to ocr() as the first
positional argument, providing a cleaner API for programmatic use.
The old-style API with individual parameters remains fully supported.
2026-01-20 10:20:52 -08:00
James R. Barlow
2f4280b66c Comprrehensive documentation update in preparation for v17 2026-01-16 01:38:47 -08:00
James R. Barlow
6cf9d1c6ee Update release notes 2026-01-15 23:29:29 -08:00
James R. Barlow
6a7164a76c Update release notes with branch changes 2026-01-15 23:25:51 -08:00
James R. Barlow
3f328785f0 Fix pypdfium rasterizer to match Ghostscript dimensions
The pypdfium rasterizer was producing output images that differed by 1
pixel compared to Ghostscript due to floating-point precision issues in
dimension calculations.

Root cause:
- pypdfium used harmonic mean of x/y DPI to calculate a single scale
  factor, losing the distinction between x and y DPI
- No DPI rounding like Ghostscript's 6-decimal precision
- Compound rounding errors when converting points to pixels

Solution:
1. Round DPI to 6 decimals to match Ghostscript's precision
2. Calculate expected output dimensions using separate x/y DPI values
3. Handle dimension swapping for 90°/270° rotations
4. Resize output image if off by 1-2 pixels (graceful correction)

This ensures pixel-perfect matching with Ghostscript while being
minimally invasive and only resizing when necessary.

Changes:
- Modified _render_page_to_bitmap() to calculate expected dimensions
- Modified _process_image_for_output() to correct small discrepancies
- Updated rasterize_pdf_page() to pass dimensions through pipeline
- Parametrized rotation tests to run with both rasterizers

All 45 rotation tests now pass with both pypdfium and ghostscript.

Fixes test_rotated_skew_timeout with pypdfium rasterizer.
2026-01-14 14:37:24 -08:00
James R. Barlow
5acf21651f ruff lint and format 2026-01-13 01:50:57 -08:00
James R. Barlow
7bfe3ecd5b Fix double-compression of already-deflated JPEGs
Images with [FlateDecode, DCTDecode] filter chain were incorrectly
being marked for additional FlateDecode compression, resulting in
double-compressed data and invalid output PDFs.

Add _already_flate_encoded() helper to check if an image already has
FlateDecode in its filter chain, and skip such images in
_find_deflatable_jpeg().
2026-01-13 01:41:59 -08:00
James R. Barlow
5371cc5e39 Update test to match new error messag 2026-01-13 01:33:10 -08:00
James R. Barlow
4c7086c609 Replace typer with cyclopts CLI library in misc scripts
Migrate watcher.py and pdf_text_diff.py from typer to cyclopts for
CLI argument parsing. Update pyproject.toml to reflect the dependency
change in the watcher optional feature.
2026-01-13 00:43:14 -08:00
James R. Barlow
bf76c8270c Rationalize optional dependencies vs dependency groups
Establish clear separation between user-facing optional dependencies
and developer-only dependency groups:

**Optional Dependencies (user features):**
- watcher: File watching service for batch processing
- webservice: Streamlit-based web UI
- Installable via: uv sync --extra <name> or pip install ocrmypdf[name]

**Dependency Groups (developer tools):**
- test: Testing infrastructure (merged from test + extended_test)
- docs: Documentation building tools
- streamlit-dev: Enhanced Streamlit development tools
- dev: General development tools (mypy, ipykernel)
- Installable via: uv sync --group <name> (uv only, NOT pip)

Breaking changes for developers:
- pip install -e .[test] no longer works → use uv sync --group test
- pip install -e .[docs] no longer works → use uv sync --group docs
- pip install -e .[extended_test] removed → merged into test group

No breaking changes for end users:
- pip install ocrmypdf[watcher] still works
- pip install ocrmypdf[webservice] still works

Updated:
- CI/CD workflows to use uv sync --group test
- Docker images to exclude test dependencies
- Documentation to recommend uv with pip as fallback
- pyproject.toml with clear comments explaining both systems
2026-01-13 00:34:55 -08:00
James R. Barlow
740f67091c Rename OCROptions to OcrOptions for consistency
Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.
2026-01-12 23:37:54 -08:00
James R. Barlow
36dea181e6 Update cookbook: Replace --tesseract-timeout 0 with --ocr-engine none
Update documentation examples to use the new --ocr-engine none option
instead of the deprecated --tesseract-timeout 0 idiom for disabling OCR.
2026-01-12 23:28:14 -08:00
James R. Barlow
c69f293322 Add --mode/-m CLI argument with ProcessingMode enum
Introduce a new --mode (-m) argument that consolidates the three
mutually exclusive OCR processing options into a single enum:
- default: Error if text is found (standard behavior)
- force: Rasterize all content and run OCR (replaces --force-ocr)
- skip: Skip pages with existing text (replaces --skip-text)
- redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)

The legacy flags --force-ocr, --skip-text, and --redo-ocr remain as
silent aliases for backward compatibility. Both CLI and API usage
continue to work unchanged.
2026-01-12 15:23:08 -08:00
James R. Barlow
e9fe061c30 Format fix 2026-01-12 10:25:24 -08:00
James R. Barlow
c9ea07e954 Reduce chattiness of fonttools 2026-01-12 10:16:58 -08:00
James R. Barlow
0c3745a1a4 Add OCR engine selection framework and null OCR engine
Introduce --ocr-engine option to select between OCR engines:
- 'auto' (default): Uses Tesseract
- 'tesseract': Explicit Tesseract selection
- 'none': Skip OCR entirely (for PDF processing only)

Key changes:
- Extend OcrEngine ABC with generate_ocr() and supports_generate_ocr()
  for direct OcrElement tree output (bypasses hOCR)
- Add get_ocr_engine(options) hook parameter for engine selection
- Implement NullOcrEngine for --ocr-engine none
- Export OcrElement, OcrClass, BoundingBox from ocrmypdf package
- Add ocr_tree support to grafting pipeline

This prepares the foundation for pluggable OCR engines while maintaining
full backward compatibility with existing Tesseract-based workflows.
2026-01-12 10:11:14 -08:00
James R. Barlow
664c3e2a8e Update test cache for slow rotation tests 2026-01-10 16:30:25 -08:00
James R. Barlow
315d0df0e9 Fix incorrect rotation direction in pypdfium rasterizer
pypdfium2 expects clockwise rotation values, but OCRmyPDF tracks
rotation in counter-clockwise. Negate the rotation value to fix.

Also refactor nested try/finally blocks to use contextlib.closing()
for cleaner resource management.
2026-01-10 16:29:49 -08:00
James R. Barlow
3c94ada857 Fix tesseract_cache plugin to properly handle cache misses
- Check all required output files exist before declaring cache hit,
  not just stderr.bin
- Add 'hocr' to list of cached output file types
- Fix timeout=0.0 causing immediate timeout on cache miss by treating
  it as "no timeout"
2026-01-09 02:10:29 -08:00
James R. Barlow
fcbdbac602 Update test_page_boxes MediaBox expectations for speculative PDF/A
When speculative PDF/A succeeds (verapdf available), Ghostscript is
bypassed and MediaBox is preserved rather than normalized to origin.
2026-01-09 01:25:31 -08:00
James R. Barlow
122450c19e Fix Ghostscript tests after default output type changed to 'auto'
- Add --output-type pdfa to tests that exercise Ghostscript-specific
  behavior (test_gs_render_failure, test_ghostscript_pdfa_failure,
  test_ghostscript_mandatory_color_conversion)
- Add Gs106WarningFilter to suppress expected Ghostscript 10.6.x JPEG
  encoding warning in test logs
2026-01-09 01:02:25 -08:00
James R. Barlow
0c4ee5af4e Add 'auto' output type for best-effort PDF/A without Ghostscript
- Add new '--output-type auto' option (now the default) that produces
  best-effort PDF/A without requiring Ghostscript
- When verapdf is available, use speculative PDF/A conversion
- Without verapdf, pass through as PDF/A if safe (input claims PDF/A
  or --force-ocr was used), otherwise output as regular PDF
- Make Ghostscript check conditional - only required for pdfa* output types
- Update soft error tests to explicitly use --output-type pdfa since they
  exercise Ghostscript failure modes
- Fix Tesseract OSD error handling to check both stdout and stderr for
  known non-fatal messages like "Too few characters"
2026-01-09 00:56:00 -08:00
James R. Barlow
bdc50e9470 Add explicit word spacing for pdfminer.six compatibility
Insert space characters between words in the fpdf2 renderer so PDF
readers like pdfminer.six can properly segment words during text
extraction. Some PDF readers rely on explicit space characters rather
than inferring word boundaries from positioning.

- Use itertools.pairwise to iterate consecutive word pairs
- Render space immediately after each word (content stream order matters)
- Skip space insertion between CJK words (no spaces in CJK text)
- Use 5% line height threshold to filter OCR noise
- Support RTL text direction
2026-01-08 16:32:14 -08:00
James R. Barlow
4cb488d0fc Skip speculative PDF/A when --pdfa-image-compression is set
When the user explicitly sets --pdfa-image-compression to something
other than 'auto', skip the speculative PDF/A conversion and use
Ghostscript instead. The speculative conversion (using pikepdf +
verapdf) doesn't apply image compression settings, so Ghostscript
is required to honor the user's compression preference.
2026-01-08 15:12:35 -08:00
James R. Barlow
bb5238e524 Update tests to use new OcrmypdfPluginManager interface
Replace pm.hook.method() calls with pm.method() calls to match the
refactored plugin manager that now uses composition over inheritance.
The hook attribute is no longer directly exposed; instead, type-safe
methods are provided directly on the plugin manager class.
2026-01-08 13:09:19 -08:00
James R. Barlow
900a60fd10 Add verapdf integration for speculative PDF/A conversion
Introduce a fast path for PDF/A conversion that uses pikepdf to add
PDF/A structures directly (sRGB ICC profile and XMP metadata), then
validates with verapdf. If validation passes, skip Ghostscript entirely.
If validation fails or verapdf is unavailable, fall back to the existing
Ghostscript conversion path.

New files:
- src/ocrmypdf/_exec/verapdf.py: CLI wrapper for verapdf validator
- tests/test_verapdf.py: Test suite for new functionality

Modified:
- pdfa.py: Add speculative_pdfa_conversion() and helpers
- _pipeline.py: Add try_speculative_pdfa() function
- _pipelines/_common.py: Integrate speculative path into postprocess()
2026-01-08 10:58:01 -08:00
James R. Barlow
f5617ce44e Refactor OcrmypdfPluginManager to use composition over inheritance
Replace inheritance from pluggy.PluginManager with composition pattern,
providing a type-safe interface for all 16 hooks defined in pluginspec.py.
The underlying pluggy manager is now accessible via the .pluggy property
for advanced use cases like set_blocked().

This change enables IDE autocomplete and type checking for all hook calls
while maintaining full backward compatibility with the plugin system.
2026-01-07 17:23:13 -08:00
James R. Barlow
0e946a7498 Clarify messageabout number of workers 2026-01-07 16:41:18 -08:00
James R. Barlow
b2b6a7c4b1 Pass OMP_THREAD_LIMIT to Tesseract subprocesses instead of modifying parent env
Instead of setting OMP_THREAD_LIMIT in the parent process's environment,
calculate the thread limit in the validate hook and pass it through to
Tesseract subprocess calls via the env parameter. This avoids polluting
the parent process's environment while still controlling Tesseract's
thread usage.
2026-01-06 18:43:29 -08:00
James R. Barlow
75c664793e Don't share claude 2026-01-06 15:42:51 -08:00
James R. Barlow
bbd263ff48 Add tests for fpdf2 renderer and font infrastructure
- Add hOCR test fixtures for Latin, Arabic, CJK, Devanagari scripts
- Add tests for fpdf2 renderer, multi-font manager, system font provider
- Add multilingual rendering tests
- Update existing tests to use fpdf2 renderer
2026-01-06 13:46:11 -08:00
James R. Barlow
7a4b98974c Integrate fpdf2 renderer and remove legacy hOCR renderer
- Update pipeline to use fpdf2 renderer as default
- Remove legacy hocrtransform PDF renderer (_font.py, _hocr.py,
  pdf_renderer.py)
- Update CLI and options for fpdf2 renderer
- Add fpdf2 dependency to pyproject.toml
- Update graft module for fpdf2 multi-page rendering
2026-01-06 13:45:44 -08:00
James R. Barlow
d72a494979 Add fpdf2-based PDF text layer renderer
Implement new PDF renderer using fpdf2 library that provides:
- Multilingual text support via font module
- Proper baseline and rotation handling
- Multi-page rendering with efficient font embedding
- Invisible but selectable text layer
2026-01-06 13:45:14 -08:00
James R. Barlow
64726f97b3 Add font infrastructure and glyphless font
- Add font module with FontManager, FontProvider, MultiFontManager,
  and SystemFontProvider for multilingual font support
- Add NotoSans-Regular.ttf for Latin text rendering
- Replace pdf.ttf with Occulta.ttf glyphless font
- Add script to generate new Occulta glyphless font
- System font discovery for CJK, Arabic, Devanagari scripts
2026-01-06 13:44:54 -08:00
James R. Barlow
83a43408c2 Refactor tesseract thresholding to use enum type
Replace integer-based thresholding parameter with ThresholdingMethod
enum for improved type safety. The CLI still accepts the same string
values (auto, otsu, adaptive-otsu, sauvola) but internally uses a
strongly-typed enum. This makes the code more maintainable and catches
type errors at development time.
2025-12-27 13:32:56 -08:00
James R. Barlow
2cb0973540 Improve Ghostscript API/CLI definitions 2025-12-27 01:40:12 -08:00
James R. Barlow
0d6e0c4560 Merge branch 'main' into dev 2025-12-24 00:44:18 -08:00
James R. Barlow
94d7735862 docs: missing issue ref 2025-12-24 00:14:24 -08:00
James R. Barlow
c540967429 docs: Update release notes v16.13.0 2025-12-23 15:44:44 -08:00
James R. Barlow
195344d307 Reinstate "Work around Ghostscript 10.6.0 JPEG encoding issue by forcing optimization.""
This reverts commit fc30cb8903.
It turns out that both fixes were necessary.
2025-12-23 15:41:34 -08:00
James R. Barlow
de63d6eac9 Merge remote-tracking branches 'origin/dependabot/github_actions/actions/download-artifact-7', 'origin/dependabot/github_actions/actions/upload-artifact-6', 'origin/dependabot/github_actions/sigstore/gh-action-sigstore-python-3.2.0' and 'origin/dependabot/github_actions/actions/checkout-6' 2025-12-23 15:06:50 -08:00
James R. Barlow
6ada11ddae docs: Update release notes 2025-12-23 15:05:49 -08:00
James R. Barlow
fc30cb8903 Revert "Work around Ghostscript 10.6.0 JPEG encoding issue by forcing optimization."
This reverts commit f4c6c8121b.

The issue is now resolved by correcting the encoidng issue directly.
2025-12-23 15:03:51 -08:00