The 2024 guard against runaway recursion in _find_image_xrefs_container
only deduplicated image xrefs, but Form XObject xrefs are never added to
include_xrefs/exclude_xrefs, so a self-referential or DAG-shaped Form
graph re-entered every branch until the depth limit fired -- producing
the reported flood of warnings (and minutes-long hangs) on PowerPoint
exports.
Thread a visited_forms set through the recursion so each Form XObject is
descended into at most once per document. With memoization in place the
depth limit is no longer a cycle defense, so demote its log to debug.
Add a regression test that synthesises a circular-Form PDF from the
existing formxobject.pdf fixture (no new binary fixture, no license
issues) and asserts zero "Recursion depth exceeded" warnings.
Split ocrmypdf.subprocess/__init__.py into three private submodules by
concern (_run, _version, _check) and reduce __init__ to re-exports.
Introduce ocrmypdf._exec._probe.ToolProbe to centralize the version()/
available() pattern each tool module was reimplementing, so the "is this
tool installed and suitable?" question is cleanly distinct from the
pure, picklable functions that do the work.
Also replace the ghostscript module-import log.addFilter() side effect
with an idempotent _ensure_log_filter_installed() called at the top of
each work function, so the DuplicateFilter is present in subprocess
workers without relying on import-time ordering.
Public API of ocrmypdf.subprocess is unchanged.
fpdf2 >= 2.8.7 emits a custom begincidchar Encoding CMap for CFF-based
CID fonts (e.g. NotoSansCJK). pdfminer.six returns <CMap: None> for such
CMaps, so text extraction yields empty output. Switch to pdftotext (poppler)
which handles the new encoding correctly.
The API previously clobbered PIL.Image.MAX_IMAGE_PIXELS unconditionally
on every call, so host applications (e.g. Paperless-NGX) that configured
the PIL limit before invoking ocrmypdf.ocr() saw their setting silently
overwritten with the 250 MP default. Make max_image_mpixels default to
None and only apply the override when the caller explicitly sets it.
The CLI default of 250 MP is unchanged.
Fixes#1665
The JPEG truncation bug (1-15 bytes) persists in Ghostscript 10.7.0.
Update the warning message to show the actual GS version instead of
hardcoding "10.6.x", and remove the stale date reference. Also make
the test warning filter match the new message format.
fpdf2's shape_text() produces RTL ligature glyphs (e.g. lam-alef) with
multi-character CMap entries whose character order gets reversed by the
bidi algorithm during text extraction, producing garbled output like
"سالح" instead of "سلاح".
For invisible text (the production OCR overlay path), bypass text shaping
and use encode_text() with pre-reversed strings. encode_text() maps
characters 1:1 in logical order, avoiding the ligature CMap issue. The
pre-reversal compensates for bidi reversal by text extractors. Since the
text is invisible (Tr=3), the lack of joining forms is harmless.
Add RTL text extraction tests that verify glyph stream order, ToUnicode
CMap 1:1 mappings, and correct logical order for Arabic (including
lam-alef ligature) and Hebrew scripts.
The inter-word Tz calculation stretched "word " to span from the current
word to the next, producing extreme horizontal scaling (300-500%) for
words far apart (e.g. in tables). Use per-word Tz instead — Td
positioning already handles inter-word gaps correctly.
Fixes#1635
Catch OSError (parent of both FileNotFoundError and
NotADirectoryError) in verapdf.available() so environments where
executing `verapdf` raises NotADirectoryError gracefully fall back
instead of crashing the pipeline. Fixes#1638.
Fixes#1642. Adds an early check in check_requested_output_file() that
raises OutputFileAccessError (exit code 5) if the destination file
already exists and --no-overwrite is set. The option is wired through
CLI, OcrOptions, and the Python API.
The jpg_quality and png_quality options default to None in the pydantic
model, but the fallback check only handled == 0. This caused a TypeError
when calling ocrmypdf.ocr() with optimize >= 2 without explicitly
setting quality values. Fixes#1641.
Replace hatch-vcs dynamic versioning with static version in _version.py
and pyproject.toml. Split CI into build.yml (test + stage draft release
on main) and release.yml (publish from draft on tag push). Docker images
are built on main pushes and re-tagged with the release version on tag
push without rebuilding.
The API's 'language' param was silently dropped because OcrOptions uses
'languages' (plural). Map language->languages in create_options() and
_pdf_to_hocr(), coercing bare strings to lists and splitting '+'
separated codes to match CLI behavior.