Commit Graph

4196 Commits

Author SHA1 Message Date
James R. Barlow
122450c19e Fix Ghostscript tests after default output type changed to 'auto'
- Add --output-type pdfa to tests that exercise Ghostscript-specific
  behavior (test_gs_render_failure, test_ghostscript_pdfa_failure,
  test_ghostscript_mandatory_color_conversion)
- Add Gs106WarningFilter to suppress expected Ghostscript 10.6.x JPEG
  encoding warning in test logs
2026-01-09 01:02:25 -08:00
James R. Barlow
0c4ee5af4e Add 'auto' output type for best-effort PDF/A without Ghostscript
- Add new '--output-type auto' option (now the default) that produces
  best-effort PDF/A without requiring Ghostscript
- When verapdf is available, use speculative PDF/A conversion
- Without verapdf, pass through as PDF/A if safe (input claims PDF/A
  or --force-ocr was used), otherwise output as regular PDF
- Make Ghostscript check conditional - only required for pdfa* output types
- Update soft error tests to explicitly use --output-type pdfa since they
  exercise Ghostscript failure modes
- Fix Tesseract OSD error handling to check both stdout and stderr for
  known non-fatal messages like "Too few characters"
2026-01-09 00:56:00 -08:00
James R. Barlow
bdc50e9470 Add explicit word spacing for pdfminer.six compatibility
Insert space characters between words in the fpdf2 renderer so PDF
readers like pdfminer.six can properly segment words during text
extraction. Some PDF readers rely on explicit space characters rather
than inferring word boundaries from positioning.

- Use itertools.pairwise to iterate consecutive word pairs
- Render space immediately after each word (content stream order matters)
- Skip space insertion between CJK words (no spaces in CJK text)
- Use 5% line height threshold to filter OCR noise
- Support RTL text direction
2026-01-08 16:32:14 -08:00
James R. Barlow
4cb488d0fc Skip speculative PDF/A when --pdfa-image-compression is set
When the user explicitly sets --pdfa-image-compression to something
other than 'auto', skip the speculative PDF/A conversion and use
Ghostscript instead. The speculative conversion (using pikepdf +
verapdf) doesn't apply image compression settings, so Ghostscript
is required to honor the user's compression preference.
2026-01-08 15:12:35 -08:00
James R. Barlow
bb5238e524 Update tests to use new OcrmypdfPluginManager interface
Replace pm.hook.method() calls with pm.method() calls to match the
refactored plugin manager that now uses composition over inheritance.
The hook attribute is no longer directly exposed; instead, type-safe
methods are provided directly on the plugin manager class.
2026-01-08 13:09:19 -08:00
James R. Barlow
900a60fd10 Add verapdf integration for speculative PDF/A conversion
Introduce a fast path for PDF/A conversion that uses pikepdf to add
PDF/A structures directly (sRGB ICC profile and XMP metadata), then
validates with verapdf. If validation passes, skip Ghostscript entirely.
If validation fails or verapdf is unavailable, fall back to the existing
Ghostscript conversion path.

New files:
- src/ocrmypdf/_exec/verapdf.py: CLI wrapper for verapdf validator
- tests/test_verapdf.py: Test suite for new functionality

Modified:
- pdfa.py: Add speculative_pdfa_conversion() and helpers
- _pipeline.py: Add try_speculative_pdfa() function
- _pipelines/_common.py: Integrate speculative path into postprocess()
2026-01-08 10:58:01 -08:00
James R. Barlow
f5617ce44e Refactor OcrmypdfPluginManager to use composition over inheritance
Replace inheritance from pluggy.PluginManager with composition pattern,
providing a type-safe interface for all 16 hooks defined in pluginspec.py.
The underlying pluggy manager is now accessible via the .pluggy property
for advanced use cases like set_blocked().

This change enables IDE autocomplete and type checking for all hook calls
while maintaining full backward compatibility with the plugin system.
2026-01-07 17:23:13 -08:00
James R. Barlow
0e946a7498 Clarify messageabout number of workers 2026-01-07 16:41:18 -08:00
James R. Barlow
b2b6a7c4b1 Pass OMP_THREAD_LIMIT to Tesseract subprocesses instead of modifying parent env
Instead of setting OMP_THREAD_LIMIT in the parent process's environment,
calculate the thread limit in the validate hook and pass it through to
Tesseract subprocess calls via the env parameter. This avoids polluting
the parent process's environment while still controlling Tesseract's
thread usage.
2026-01-06 18:43:29 -08:00
James R. Barlow
75c664793e Don't share claude 2026-01-06 15:42:51 -08:00
James R. Barlow
bbd263ff48 Add tests for fpdf2 renderer and font infrastructure
- Add hOCR test fixtures for Latin, Arabic, CJK, Devanagari scripts
- Add tests for fpdf2 renderer, multi-font manager, system font provider
- Add multilingual rendering tests
- Update existing tests to use fpdf2 renderer
2026-01-06 13:46:11 -08:00
James R. Barlow
7a4b98974c Integrate fpdf2 renderer and remove legacy hOCR renderer
- Update pipeline to use fpdf2 renderer as default
- Remove legacy hocrtransform PDF renderer (_font.py, _hocr.py,
  pdf_renderer.py)
- Update CLI and options for fpdf2 renderer
- Add fpdf2 dependency to pyproject.toml
- Update graft module for fpdf2 multi-page rendering
2026-01-06 13:45:44 -08:00
James R. Barlow
d72a494979 Add fpdf2-based PDF text layer renderer
Implement new PDF renderer using fpdf2 library that provides:
- Multilingual text support via font module
- Proper baseline and rotation handling
- Multi-page rendering with efficient font embedding
- Invisible but selectable text layer
2026-01-06 13:45:14 -08:00
James R. Barlow
64726f97b3 Add font infrastructure and glyphless font
- Add font module with FontManager, FontProvider, MultiFontManager,
  and SystemFontProvider for multilingual font support
- Add NotoSans-Regular.ttf for Latin text rendering
- Replace pdf.ttf with Occulta.ttf glyphless font
- Add script to generate new Occulta glyphless font
- System font discovery for CJK, Arabic, Devanagari scripts
2026-01-06 13:44:54 -08:00
James R. Barlow
83a43408c2 Refactor tesseract thresholding to use enum type
Replace integer-based thresholding parameter with ThresholdingMethod
enum for improved type safety. The CLI still accepts the same string
values (auto, otsu, adaptive-otsu, sauvola) but internally uses a
strongly-typed enum. This makes the code more maintainable and catches
type errors at development time.
2025-12-27 13:32:56 -08:00
James R. Barlow
2cb0973540 Improve Ghostscript API/CLI definitions 2025-12-27 01:40:12 -08:00
James R. Barlow
0d6e0c4560 Merge branch 'main' into dev 2025-12-24 00:44:18 -08:00
James R. Barlow
94d7735862 docs: missing issue ref 2025-12-24 00:14:24 -08:00
James R. Barlow
c540967429 docs: Update release notes v16.13.0 2025-12-23 15:44:44 -08:00
James R. Barlow
195344d307 Reinstate "Work around Ghostscript 10.6.0 JPEG encoding issue by forcing optimization.""
This reverts commit fc30cb8903.
It turns out that both fixes were necessary.
2025-12-23 15:41:34 -08:00
James R. Barlow
de63d6eac9 Merge remote-tracking branches 'origin/dependabot/github_actions/actions/download-artifact-7', 'origin/dependabot/github_actions/actions/upload-artifact-6', 'origin/dependabot/github_actions/sigstore/gh-action-sigstore-python-3.2.0' and 'origin/dependabot/github_actions/actions/checkout-6' 2025-12-23 15:06:50 -08:00
James R. Barlow
6ada11ddae docs: Update release notes 2025-12-23 15:05:49 -08:00
James R. Barlow
fc30cb8903 Revert "Work around Ghostscript 10.6.0 JPEG encoding issue by forcing optimization."
This reverts commit f4c6c8121b.

The issue is now resolved by correcting the encoidng issue directly.
2025-12-23 15:03:51 -08:00
James R. Barlow
01a3706281 docs: Add release notes for v16.13.0 2025-12-23 15:01:22 -08:00
James R. Barlow
e613db6a82 Fix Ghostscript 10.6 JPEG corruption by repairing truncated images
Ghostscript 10.6 has a bug that truncates JPEG data by 1-15 bytes.
This adds detection and repair by comparing output images to input
images and restoring the original bytes when truncation is detected.

- Add warning when GS 10.6+ is used with PDF/A output
- Add _repair_gs106_jpeg_corruption() to fix damaged JPEGs after
  Ghostscript processing
- Add unit tests for the repair function
2025-12-23 14:56:24 -08:00
James R. Barlow
742a4bac17 Make rotation test more robust 2025-12-23 11:20:57 -08:00
James R. Barlow
4c1ef0b471 Also process art and bleed boxes 2025-12-23 11:20:41 -08:00
James R. Barlow
eace567f7b Test and fix page box issues 2025-12-23 11:19:51 -08:00
James R. Barlow
e9bfce34f1 Fix ruff linting issues
- Use X | Y syntax in isinstance calls (UP038)
- Remove trailing whitespace from blank lines (W293)
2025-12-23 03:07:48 -08:00
James R. Barlow
16c2604a07 Remove lossy JBIG2 support, retain lossless JBIG2 only
Lossy JBIG2 has been removed due to well-documented risks of character
substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and
--jbig2-page-group-size arguments are now deprecated and ignored with
a warning.

Changes:
- Remove jbig2_lossy and jbig2_page_group_size from OCROptions
- Simplify optimize.py to use single-image JBIG2 encoding only
  (no symbol dictionaries/JBIG2Globals)
- Remove convert_group() from jbig2enc.py
- Deprecate CLI args with warnings for backward compatibility
- Update documentation to explain lossless-only JBIG2
2025-12-23 02:45:07 -08:00
James R. Barlow
9ebba91466 Use plugin namespace access pattern throughout codebase
Migrate all code from flat accessor pattern (options.tesseract_timeout)
to the plugin namespace pattern (options.tesseract.timeout).

Key changes:
- Fix _get_plugin_options to raise AttributeError for unregistered
  namespaces instead of silently returning None
- Add _convert_value helper to convert PathLike to str for plugin
  model field compatibility
- Filter out _plugin_cache_* entries from JSON serialization to fix
  worker process serialization (test_simulate_oom_killer)
- Update tesseract_ocr.py, ghostscript.py, _validation_coordinator.py,
  and _pipelines/ocr.py to use options.tesseract.* and
  options.ghostscript.* accessors
- Update tests to use setup_plugin_infrastructure() for plugin
  model registration
2025-12-23 02:02:21 -08:00
James R. Barlow
aec995aced Require plugin model registration for namespace access in OCROptions
- Update __getattr__ docstring to clarify that plugin models must be
  registered for namespace access (e.g., options.tesseract.timeout)
- Update test_json_serialization.py to properly register TesseractOptions
  before accessing plugin namespaces
- Worker processes now register plugin models for multiprocessing tests
- Exclude plugin cache keys from extra_attrs comparison in tests
2025-12-22 15:09:55 -08:00
James R. Barlow
be425e7405 Refactor pdfinfo: split info.py into focused modules
Split the 1288-line info.py into smaller, single-responsibility modules:
- _types.py: Enums, type aliases, lookup dictionaries
- _contentstream.py: PDF content stream parsing, DPI calculation
- _image.py: ImageInfo class and image finding functions
- _worker.py: Concurrency/worker process handling
- info.py: PageInfo, PdfInfo classes (reduced to ~530 lines)

Public API unchanged - all existing imports continue to work.
2025-12-22 01:27:23 -08:00
James R. Barlow
b4f9673364 Add unit tests for HocrParser, PdfTextRenderer, and OcrElement
Comprehensive test coverage for the new hocrtransform components:

- test_ocr_element.py: Tests for BoundingBox, Baseline, FontInfo,
  OcrElement dataclass methods (iter_by_class, find_by_class,
  get_text_recursive, words/lines/paragraphs properties)

- test_hocr_parser.py: Tests for parsing hOCR files including
  page/paragraph/line/word extraction, RTL text, rotated text,
  different line types (header, caption), font info, and edge cases

- test_pdf_renderer.py: Tests for PDF rendering including text
  extraction verification, page sizing, multi-line content,
  text direction, baseline handling, textangle rotation, word breaks,
  debug options, and image overlay

Also fixes x_font regex pattern to not capture trailing semicolons.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 17:05:49 -08:00
James R. Barlow
9ea804aff5 Refactor hocrtransform: separate parsing from rendering
Split the hOCR transformation code into three distinct layers:

1. ocr_element.py - Generic OcrElement dataclass that represents OCR
   output structure from any source (hOCR, ALTO, custom engines).
   Includes helper classes: BoundingBox, Baseline, FontInfo.

2. hocr_parser.py - HocrParser class that parses hOCR XML files into
   OcrElement trees, extracting bbox, baseline, textangle, confidence,
   font info, direction, and language.

3. pdf_renderer.py - PdfTextRenderer class that renders OcrElement
   trees to PDF text layers, handling text positioning, baseline
   rotation, LTR/RTL, and word break injection.

The existing HocrTransform class is preserved for backward compatibility,
now delegating to the new components internally.

This separation enables:
- Support for non-hOCR OCR output formats
- Independent improvements to text rendering
- Reuse of OcrElement for other purposes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 16:17:22 -08:00
James R. Barlow
e162361d28 Make rotation test more robust 2025-12-21 14:42:14 -08:00
James R. Barlow
22d00837e3 WIP box tests 2025-12-21 14:03:28 -08:00
James R. Barlow
0faba42d36 test: Don't save local files 2025-12-21 14:03:28 -08:00
James R. Barlow
57e2600566 Also process art and bleed boxes 2025-12-21 14:03:28 -08:00
James R. Barlow
41758766a1 Test and fix page box issues 2025-12-21 14:03:28 -08:00
James R. Barlow
3e46b039ed feat: add use_cropbox parameter to align rasterizer APIs
Added use_cropbox parameter to rasterize_pdf_page hook to allow
choosing between MediaBox and CropBox rendering:

- Default is use_cropbox=False (MediaBox) for consistency with
  Ghostscript's existing behavior
- Ghostscript: passes -dUseCropBox when use_cropbox=True
- pypdfium: calculates crop values to expand from CropBox to MediaBox
  when use_cropbox=False

This aligns both rasterizers to produce the same output dimensions
by default, making the rasterizer choice transparent for page
geometry.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
ae783b4ae6 fix: add thread safety lock to pypdfium plugin
pypdfium2/PDFium is not thread-safe - concurrent calls from different
threads can crash or corrupt the process. Added a module-level lock to
serialize all pdfium operations.

PIL image processing and file I/O are done outside the lock since they
are thread-safe, minimizing lock contention.

For maximum parallelism, users can use process-based parallelism
(use_threads=False) where each process has its own pdfium instance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
b9f488d65c test: add comprehensive tests for --rasterizer option
Add test_rasterizer.py with tests covering:
- Basic rasterizer option validation ('auto', 'ghostscript', 'pypdfium')
- Rasterizer + --rotate-pages interaction
- PDFs with nonstandard MediaBox/TrimBox/CropBox
- Direct hook tests verifying plugins respect the option

Also fix pluggy parameter passing: make 'options' a required parameter
(no default) in the hookspec so pluggy forwards it to implementations.
Update test plugins and test_rotation.py to pass the new parameter.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
ed813cec67 feat: add --rasterizer CLI option to select PDF rasterization backend
Add user control over which rasterizer is used for PDF page rendering:
- 'auto' (default): prefers pypdfium when available, falls back to Ghostscript
- 'pypdfium': force pypdfium2 (errors if not installed)
- 'ghostscript': force traditional Ghostscript rasterizer

Changes:
- Add rasterizer field with validation to OCROptions model
- Add --rasterizer CLI argument in the Advanced options group
- Update rasterize_pdf_page hookspec to pass options to plugins
- Update pypdfium plugin with check_options hook for availability check
- Update both plugins to respect the rasterizer option

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
938ce8e285 fix: make pypdfium plugin optional with Ghostscript fallback
- Remove check_options hook from pypdfium that raised error when
  pypdfium2 wasn't installed
- Return None from pypdfium's rasterize_pdf_page when pypdfium2 is
  unavailable, allowing the hook to fall through
- Restore Ghostscript's rasterize_pdf_page hook as fallback

This allows OCRmyPDF to work without pypdfium2 installed, using
Ghostscript for rasterization as before.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
cf3fb6e89b Fix raster_device settings for pypdfium rasterizer 2025-12-21 12:29:17 -08:00
James R. Barlow
3482ea5fe5 refactor: Modularize rasterize_pdf_page into separate PDF, page, and image processing functions
Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>
2025-12-21 12:29:17 -08:00
James R. Barlow
e85c5bbb4d refactor: Simplify error message and code formatting in pypdfium plugin 2025-12-21 12:29:17 -08:00
James R. Barlow
740b0bddc6 feat: add pypdfium2 rasterization plugin for OCRmyPDF
Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>
2025-12-21 12:29:17 -08:00
James R. Barlow
a4ee513cd4 refactor: clean up deprecated code and update plugin docs
- Remove outdated Phase comments from _options.py and cli.py
- Remove unused methods from PluginOptionRegistry:
  - get_extended_options_model() - replaced by __getattr__ in OCROptions
  - map_legacy_options() - unused
  - validate_plugin_options() - unused
- Update plugin documentation to document register_options hook
- Add documentation for nested plugin option access pattern

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:21:48 -08:00