Commit Graph

874 Commits

Author SHA1 Message Date
James R. Barlow
bc745d4d81 Replace magic Ghostscript raster device strings with StrEnum 2026-01-20 10:44:25 -08:00
James R. Barlow
3f328785f0 Fix pypdfium rasterizer to match Ghostscript dimensions
The pypdfium rasterizer was producing output images that differed by 1
pixel compared to Ghostscript due to floating-point precision issues in
dimension calculations.

Root cause:
- pypdfium used harmonic mean of x/y DPI to calculate a single scale
  factor, losing the distinction between x and y DPI
- No DPI rounding like Ghostscript's 6-decimal precision
- Compound rounding errors when converting points to pixels

Solution:
1. Round DPI to 6 decimals to match Ghostscript's precision
2. Calculate expected output dimensions using separate x/y DPI values
3. Handle dimension swapping for 90°/270° rotations
4. Resize output image if off by 1-2 pixels (graceful correction)

This ensures pixel-perfect matching with Ghostscript while being
minimally invasive and only resizing when necessary.

Changes:
- Modified _render_page_to_bitmap() to calculate expected dimensions
- Modified _process_image_for_output() to correct small discrepancies
- Updated rasterize_pdf_page() to pass dimensions through pipeline
- Parametrized rotation tests to run with both rasterizers

All 45 rotation tests now pass with both pypdfium and ghostscript.

Fixes test_rotated_skew_timeout with pypdfium rasterizer.
2026-01-14 14:37:24 -08:00
James R. Barlow
5acf21651f ruff lint and format 2026-01-13 01:50:57 -08:00
James R. Barlow
5371cc5e39 Update test to match new error messag 2026-01-13 01:33:10 -08:00
James R. Barlow
740f67091c Rename OCROptions to OcrOptions for consistency
Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.
2026-01-12 23:37:54 -08:00
James R. Barlow
c69f293322 Add --mode/-m CLI argument with ProcessingMode enum
Introduce a new --mode (-m) argument that consolidates the three
mutually exclusive OCR processing options into a single enum:
- default: Error if text is found (standard behavior)
- force: Rasterize all content and run OCR (replaces --force-ocr)
- skip: Skip pages with existing text (replaces --skip-text)
- redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)

The legacy flags --force-ocr, --skip-text, and --redo-ocr remain as
silent aliases for backward compatibility. Both CLI and API usage
continue to work unchanged.
2026-01-12 15:23:08 -08:00
James R. Barlow
e9fe061c30 Format fix 2026-01-12 10:25:24 -08:00
James R. Barlow
0c3745a1a4 Add OCR engine selection framework and null OCR engine
Introduce --ocr-engine option to select between OCR engines:
- 'auto' (default): Uses Tesseract
- 'tesseract': Explicit Tesseract selection
- 'none': Skip OCR entirely (for PDF processing only)

Key changes:
- Extend OcrEngine ABC with generate_ocr() and supports_generate_ocr()
  for direct OcrElement tree output (bypasses hOCR)
- Add get_ocr_engine(options) hook parameter for engine selection
- Implement NullOcrEngine for --ocr-engine none
- Export OcrElement, OcrClass, BoundingBox from ocrmypdf package
- Add ocr_tree support to grafting pipeline

This prepares the foundation for pluggable OCR engines while maintaining
full backward compatibility with existing Tesseract-based workflows.
2026-01-12 10:11:14 -08:00
James R. Barlow
664c3e2a8e Update test cache for slow rotation tests 2026-01-10 16:30:25 -08:00
James R. Barlow
3c94ada857 Fix tesseract_cache plugin to properly handle cache misses
- Check all required output files exist before declaring cache hit,
  not just stderr.bin
- Add 'hocr' to list of cached output file types
- Fix timeout=0.0 causing immediate timeout on cache miss by treating
  it as "no timeout"
2026-01-09 02:10:29 -08:00
James R. Barlow
fcbdbac602 Update test_page_boxes MediaBox expectations for speculative PDF/A
When speculative PDF/A succeeds (verapdf available), Ghostscript is
bypassed and MediaBox is preserved rather than normalized to origin.
2026-01-09 01:25:31 -08:00
James R. Barlow
122450c19e Fix Ghostscript tests after default output type changed to 'auto'
- Add --output-type pdfa to tests that exercise Ghostscript-specific
  behavior (test_gs_render_failure, test_ghostscript_pdfa_failure,
  test_ghostscript_mandatory_color_conversion)
- Add Gs106WarningFilter to suppress expected Ghostscript 10.6.x JPEG
  encoding warning in test logs
2026-01-09 01:02:25 -08:00
James R. Barlow
0c4ee5af4e Add 'auto' output type for best-effort PDF/A without Ghostscript
- Add new '--output-type auto' option (now the default) that produces
  best-effort PDF/A without requiring Ghostscript
- When verapdf is available, use speculative PDF/A conversion
- Without verapdf, pass through as PDF/A if safe (input claims PDF/A
  or --force-ocr was used), otherwise output as regular PDF
- Make Ghostscript check conditional - only required for pdfa* output types
- Update soft error tests to explicitly use --output-type pdfa since they
  exercise Ghostscript failure modes
- Fix Tesseract OSD error handling to check both stdout and stderr for
  known non-fatal messages like "Too few characters"
2026-01-09 00:56:00 -08:00
James R. Barlow
bdc50e9470 Add explicit word spacing for pdfminer.six compatibility
Insert space characters between words in the fpdf2 renderer so PDF
readers like pdfminer.six can properly segment words during text
extraction. Some PDF readers rely on explicit space characters rather
than inferring word boundaries from positioning.

- Use itertools.pairwise to iterate consecutive word pairs
- Render space immediately after each word (content stream order matters)
- Skip space insertion between CJK words (no spaces in CJK text)
- Use 5% line height threshold to filter OCR noise
- Support RTL text direction
2026-01-08 16:32:14 -08:00
James R. Barlow
bb5238e524 Update tests to use new OcrmypdfPluginManager interface
Replace pm.hook.method() calls with pm.method() calls to match the
refactored plugin manager that now uses composition over inheritance.
The hook attribute is no longer directly exposed; instead, type-safe
methods are provided directly on the plugin manager class.
2026-01-08 13:09:19 -08:00
James R. Barlow
900a60fd10 Add verapdf integration for speculative PDF/A conversion
Introduce a fast path for PDF/A conversion that uses pikepdf to add
PDF/A structures directly (sRGB ICC profile and XMP metadata), then
validates with verapdf. If validation passes, skip Ghostscript entirely.
If validation fails or verapdf is unavailable, fall back to the existing
Ghostscript conversion path.

New files:
- src/ocrmypdf/_exec/verapdf.py: CLI wrapper for verapdf validator
- tests/test_verapdf.py: Test suite for new functionality

Modified:
- pdfa.py: Add speculative_pdfa_conversion() and helpers
- _pipeline.py: Add try_speculative_pdfa() function
- _pipelines/_common.py: Integrate speculative path into postprocess()
2026-01-08 10:58:01 -08:00
James R. Barlow
bbd263ff48 Add tests for fpdf2 renderer and font infrastructure
- Add hOCR test fixtures for Latin, Arabic, CJK, Devanagari scripts
- Add tests for fpdf2 renderer, multi-font manager, system font provider
- Add multilingual rendering tests
- Update existing tests to use fpdf2 renderer
2026-01-06 13:46:11 -08:00
James R. Barlow
0d6e0c4560 Merge branch 'main' into dev 2025-12-24 00:44:18 -08:00
James R. Barlow
e613db6a82 Fix Ghostscript 10.6 JPEG corruption by repairing truncated images
Ghostscript 10.6 has a bug that truncates JPEG data by 1-15 bytes.
This adds detection and repair by comparing output images to input
images and restoring the original bytes when truncation is detected.

- Add warning when GS 10.6+ is used with PDF/A output
- Add _repair_gs106_jpeg_corruption() to fix damaged JPEGs after
  Ghostscript processing
- Add unit tests for the repair function
2025-12-23 14:56:24 -08:00
James R. Barlow
742a4bac17 Make rotation test more robust 2025-12-23 11:20:57 -08:00
James R. Barlow
4c1ef0b471 Also process art and bleed boxes 2025-12-23 11:20:41 -08:00
James R. Barlow
eace567f7b Test and fix page box issues 2025-12-23 11:19:51 -08:00
James R. Barlow
16c2604a07 Remove lossy JBIG2 support, retain lossless JBIG2 only
Lossy JBIG2 has been removed due to well-documented risks of character
substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and
--jbig2-page-group-size arguments are now deprecated and ignored with
a warning.

Changes:
- Remove jbig2_lossy and jbig2_page_group_size from OCROptions
- Simplify optimize.py to use single-image JBIG2 encoding only
  (no symbol dictionaries/JBIG2Globals)
- Remove convert_group() from jbig2enc.py
- Deprecate CLI args with warnings for backward compatibility
- Update documentation to explain lossless-only JBIG2
2025-12-23 02:45:07 -08:00
James R. Barlow
9ebba91466 Use plugin namespace access pattern throughout codebase
Migrate all code from flat accessor pattern (options.tesseract_timeout)
to the plugin namespace pattern (options.tesseract.timeout).

Key changes:
- Fix _get_plugin_options to raise AttributeError for unregistered
  namespaces instead of silently returning None
- Add _convert_value helper to convert PathLike to str for plugin
  model field compatibility
- Filter out _plugin_cache_* entries from JSON serialization to fix
  worker process serialization (test_simulate_oom_killer)
- Update tesseract_ocr.py, ghostscript.py, _validation_coordinator.py,
  and _pipelines/ocr.py to use options.tesseract.* and
  options.ghostscript.* accessors
- Update tests to use setup_plugin_infrastructure() for plugin
  model registration
2025-12-23 02:02:21 -08:00
James R. Barlow
aec995aced Require plugin model registration for namespace access in OCROptions
- Update __getattr__ docstring to clarify that plugin models must be
  registered for namespace access (e.g., options.tesseract.timeout)
- Update test_json_serialization.py to properly register TesseractOptions
  before accessing plugin namespaces
- Worker processes now register plugin models for multiprocessing tests
- Exclude plugin cache keys from extra_attrs comparison in tests
2025-12-22 15:09:55 -08:00
James R. Barlow
be425e7405 Refactor pdfinfo: split info.py into focused modules
Split the 1288-line info.py into smaller, single-responsibility modules:
- _types.py: Enums, type aliases, lookup dictionaries
- _contentstream.py: PDF content stream parsing, DPI calculation
- _image.py: ImageInfo class and image finding functions
- _worker.py: Concurrency/worker process handling
- info.py: PageInfo, PdfInfo classes (reduced to ~530 lines)

Public API unchanged - all existing imports continue to work.
2025-12-22 01:27:23 -08:00
James R. Barlow
b4f9673364 Add unit tests for HocrParser, PdfTextRenderer, and OcrElement
Comprehensive test coverage for the new hocrtransform components:

- test_ocr_element.py: Tests for BoundingBox, Baseline, FontInfo,
  OcrElement dataclass methods (iter_by_class, find_by_class,
  get_text_recursive, words/lines/paragraphs properties)

- test_hocr_parser.py: Tests for parsing hOCR files including
  page/paragraph/line/word extraction, RTL text, rotated text,
  different line types (header, caption), font info, and edge cases

- test_pdf_renderer.py: Tests for PDF rendering including text
  extraction verification, page sizing, multi-line content,
  text direction, baseline handling, textangle rotation, word breaks,
  debug options, and image overlay

Also fixes x_font regex pattern to not capture trailing semicolons.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 17:05:49 -08:00
James R. Barlow
e162361d28 Make rotation test more robust 2025-12-21 14:42:14 -08:00
James R. Barlow
22d00837e3 WIP box tests 2025-12-21 14:03:28 -08:00
James R. Barlow
0faba42d36 test: Don't save local files 2025-12-21 14:03:28 -08:00
James R. Barlow
41758766a1 Test and fix page box issues 2025-12-21 14:03:28 -08:00
James R. Barlow
3e46b039ed feat: add use_cropbox parameter to align rasterizer APIs
Added use_cropbox parameter to rasterize_pdf_page hook to allow
choosing between MediaBox and CropBox rendering:

- Default is use_cropbox=False (MediaBox) for consistency with
  Ghostscript's existing behavior
- Ghostscript: passes -dUseCropBox when use_cropbox=True
- pypdfium: calculates crop values to expand from CropBox to MediaBox
  when use_cropbox=False

This aligns both rasterizers to produce the same output dimensions
by default, making the rasterizer choice transparent for page
geometry.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
ae783b4ae6 fix: add thread safety lock to pypdfium plugin
pypdfium2/PDFium is not thread-safe - concurrent calls from different
threads can crash or corrupt the process. Added a module-level lock to
serialize all pdfium operations.

PIL image processing and file I/O are done outside the lock since they
are thread-safe, minimizing lock contention.

For maximum parallelism, users can use process-based parallelism
(use_threads=False) where each process has its own pdfium instance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
b9f488d65c test: add comprehensive tests for --rasterizer option
Add test_rasterizer.py with tests covering:
- Basic rasterizer option validation ('auto', 'ghostscript', 'pypdfium')
- Rasterizer + --rotate-pages interaction
- PDFs with nonstandard MediaBox/TrimBox/CropBox
- Direct hook tests verifying plugins respect the option

Also fix pluggy parameter passing: make 'options' a required parameter
(no default) in the hookspec so pluggy forwards it to implementations.
Update test plugins and test_rotation.py to pass the new parameter.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
0ad7f5fc13 feat: add dynamic nested access to plugin options
Completes Phase 5 of the CLI refactoring plan by enabling nested
plugin option access (e.g., options.tesseract.timeout) alongside
the legacy flat access (options.tesseract_timeout).

Changes:
- Add module-level plugin option model registry in _options.py
- Add __getattr__ to OCROptions for dynamic namespace access
- Register plugin models in setup_plugin_infrastructure()
- Add test for nested plugin option access

Plugin option instances are lazily created from flat field values
and cached for subsequent access.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:21:48 -08:00
James R. Barlow
28d6ea0f10 feat: Add CLI generation methods to plugin option models
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:48 -08:00
James R. Barlow
62ad37b276 refactor: centralize plugin manager setup
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:48 -08:00
James R. Barlow
b1de6a6ad4 Add more cached tests 2025-12-21 12:21:48 -08:00
James R. Barlow
b7737446e4 cli: push up imports 2025-12-21 12:21:48 -08:00
James R. Barlow
42891346d1 fix: update test files to use new get_options_and_plugins function
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:48 -08:00
James R. Barlow
ade3ecd5a1 fix: add error handling in hOCR pipeline 2025-12-21 12:21:47 -08:00
James R. Barlow
1f493ba789 refactor: post-AI code cleanup 2025-12-21 12:21:47 -08:00
James R. Barlow
60182ac8a8 fix: update JSON serialization tests to match default values
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
d77d63f1dc feat: add JSON serialization tests for OCROptions in multiprocessing
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
e1216eddb0 test: modify test_two_languages to use list of languages
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:47 -08:00
James R. Barlow
f5bfd2fd3e Add compat function for pages_from_ranges
fix: handle Pydantic validation errors with correct exit code

Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:46 -08:00
James R. Barlow
66a3e8508e feat: add comprehensive validators to OCROptions
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-13 11:42:23 -08:00
James R. Barlow
7bb3a97208 refactor: update test_validation.py to use Pydantic model validation
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-13 11:42:23 -08:00
James R. Barlow
d556014185 Remove language warning 2025-12-13 11:41:58 -08:00
James R. Barlow
057eaff36d Skip devnull testing on Windows
No longer seems to work - Windows Server 2025 change, perhaps? Doesn't really matter.
2025-11-10 16:57:30 -08:00