4316 Commits

Author SHA1 Message Date
dependabot[bot]
c355d927ba Bump gitpython from 3.1.46 to 3.1.47
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.46 to 3.1.47.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.46...3.1.47)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-version: 3.1.47
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-26 01:22:53 +00:00
James R. Barlow
c993857752 Fix Form XObject cycle detection in image xref scan (#1321)
The 2024 guard against runaway recursion in _find_image_xrefs_container
only deduplicated image xrefs, but Form XObject xrefs are never added to
include_xrefs/exclude_xrefs, so a self-referential or DAG-shaped Form
graph re-entered every branch until the depth limit fired -- producing
the reported flood of warnings (and minutes-long hangs) on PowerPoint
exports.

Thread a visited_forms set through the recursion so each Form XObject is
descended into at most once per document. With memoization in place the
depth limit is no longer a cycle defense, so demote its log to debug.

Add a regression test that synthesises a circular-Form PDF from the
existing formxobject.pdf fixture (no new binary fixture, no license
issues) and asserts zero "Recursion depth exceeded" warnings.
2026-04-25 00:48:25 -07:00
James R. Barlow
84f5fe9ee0 Separate probing from execution in _exec and subprocess modules
Split ocrmypdf.subprocess/__init__.py into three private submodules by
concern (_run, _version, _check) and reduce __init__ to re-exports.
Introduce ocrmypdf._exec._probe.ToolProbe to centralize the version()/
available() pattern each tool module was reimplementing, so the "is this
tool installed and suitable?" question is cleanly distinct from the
pure, picklable functions that do the work.

Also replace the ghostscript module-import log.addFilter() side effect
with an idempotent _ensure_log_filter_installed() called at the top of
each work function, so the DuplicateFilter is present in subprocess
workers without relying on import-time ordering.

Public API of ocrmypdf.subprocess is unchanged.
2026-04-24 13:33:34 -07:00
James R. Barlow
3336d67e77 Fix CJK test broken by fpdf2 2.8.7 CFF font encoding change
fpdf2 >= 2.8.7 emits a custom begincidchar Encoding CMap for CFF-based
CID fonts (e.g. NotoSansCJK). pdfminer.six returns <CMap: None> for such
CMaps, so text extraction yields empty output. Switch to pdftotext (poppler)
which handles the new encoding correctly.
v17.4.2
2026-04-19 23:26:30 -07:00
James R. Barlow
73e16e7821 Merge remote-tracking branch 'origin/dependabot/github_actions/codecov/codecov-action-6' 2026-04-19 13:59:42 -07:00
James R. Barlow
6f1d37d78f Merge remote-tracking branch 'origin/dependabot/github_actions/sigstore/gh-action-sigstore-python-3.3.0' 2026-04-19 13:59:30 -07:00
James R. Barlow
2ed82de2e0 Update uv.lock again - pygithub 2026-04-19 13:58:46 -07:00
James R. Barlow
c43903fa14 Bump version: v17.4.2 2026-04-19 13:45:34 -07:00
James R. Barlow
1c89cacfef Respect host-set PIL.Image.MAX_IMAGE_PIXELS in Python API
The API previously clobbered PIL.Image.MAX_IMAGE_PIXELS unconditionally
on every call, so host applications (e.g. Paperless-NGX) that configured
the PIL limit before invoking ocrmypdf.ocr() saw their setting silently
overwritten with the 250 MP default. Make max_image_mpixels default to
None and only apply the override when the caller explicitly sets it.
The CLI default of 250 MP is unchanged.

Fixes #1665
2026-04-19 13:44:57 -07:00
James R. Barlow
75714fe43e Update uv.lock
For Pillow vuln. Fixes #1669
2026-04-19 13:06:22 -07:00
dependabot[bot]
e371ce95ca Bump sigstore/gh-action-sigstore-python from 3.2.0 to 3.3.0
Bumps [sigstore/gh-action-sigstore-python](https://github.com/sigstore/gh-action-sigstore-python) from 3.2.0 to 3.3.0.
- [Release notes](https://github.com/sigstore/gh-action-sigstore-python/releases)
- [Changelog](https://github.com/sigstore/gh-action-sigstore-python/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sigstore/gh-action-sigstore-python/compare/v3.2.0...v3.3.0)

---
updated-dependencies:
- dependency-name: sigstore/gh-action-sigstore-python
  dependency-version: 3.3.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-06 10:50:26 +00:00
dependabot[bot]
716a2e22c3 Bump codecov/codecov-action from 5 to 6
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5 to 6.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](https://github.com/codecov/codecov-action/compare/v5...v6)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-06 10:50:22 +00:00
James R. Barlow
10e6019ada Bump version: v17.4.1 v17.4.1 2026-04-06 00:34:08 -07:00
James R. Barlow
89c76b5145 v17.4.1 release notes 2026-04-05 00:23:07 -07:00
James R. Barlow
83c04e6399 Update GS JPEG corruption warning for 10.7.0+
The JPEG truncation bug (1-15 bytes) persists in Ghostscript 10.7.0.
Update the warning message to show the actual GS version instead of
hardcoding "10.6.x", and remove the stale date reference. Also make
the test warning filter match the new message format.
2026-04-04 01:59:57 -07:00
James R. Barlow
7fdeeb3635 Refactor word_render_data tuple into WordRenderData dataclass 2026-04-04 01:43:25 -07:00
James R. Barlow
5be368fe75 Fix RTL text extraction order in fpdf2 renderer (#1655)
fpdf2's shape_text() produces RTL ligature glyphs (e.g. lam-alef) with
multi-character CMap entries whose character order gets reversed by the
bidi algorithm during text extraction, producing garbled output like
"سالح" instead of "سلاح".

For invisible text (the production OCR overlay path), bypass text shaping
and use encode_text() with pre-reversed strings. encode_text() maps
characters 1:1 in logical order, avoiding the ligature CMap issue. The
pre-reversal compensates for bidi reversal by text extractors. Since the
text is invisible (Tr=3), the lack of joining forms is harmless.

Add RTL text extraction tests that verify glyph stream order, ToUnicode
CMap 1:1 mappings, and correct logical order for Arabic (including
lam-alef ligature) and Hebrew scripts.
2026-04-04 01:40:38 -07:00
jbarlow
91c5b1e480 Merge pull request #1613 from bluebox-steven:add-options.work_folder-to-pdfcontext
Set work_folder in PdfContext options initialization
2026-04-03 01:28:18 -07:00
jbarlow
73154b97ba Merge pull request #1643 from ocrmypdf:dependabot/github_actions/actions/upload-artifact-7
Bump actions/upload-artifact from 6 to 7
2026-04-03 01:13:07 -07:00
jbarlow
76a40759ae Merge pull request #1644 from ocrmypdf:dependabot/github_actions/actions/download-artifact-8
Bump actions/download-artifact from 7 to 8
2026-04-03 01:12:43 -07:00
jbarlow
12ce565e98 Merge pull request #1646 from ocrmypdf:dependabot/github_actions/docker/setup-qemu-action-4
Bump docker/setup-qemu-action from 3 to 4
2026-04-03 01:12:02 -07:00
jbarlow
9f46126859 Merge pull request #1647 from ocrmypdf:dependabot/github_actions/docker/login-action-4
Bump docker/login-action from 3 to 4
2026-04-03 01:11:36 -07:00
jbarlow
11849e5a70 Merge pull request #1648 from ocrmypdf:dependabot/github_actions/docker/setup-buildx-action-4
Bump docker/setup-buildx-action from 3 to 4
2026-04-03 01:08:58 -07:00
jbarlow
e30c00cc26 Merge pull request #1649 from ocrmypdf:dependabot/uv/tornado-6.5.5
Bump tornado from 6.5.4 to 6.5.5
2026-04-03 01:07:59 -07:00
dependabot[bot]
001b403657 Bump tornado from 6.5.4 to 6.5.5
Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.5.4 to 6.5.5.
- [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst)
- [Commits](https://github.com/tornadoweb/tornado/compare/v6.5.4...v6.5.5)

---
updated-dependencies:
- dependency-name: tornado
  dependency-version: 6.5.5
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-03 08:06:38 +00:00
jbarlow
851c61ee85 Merge pull request #1657 from ocrmypdf:dependabot/uv/cryptography-46.0.6
Bump cryptography from 46.0.5 to 46.0.6
2026-04-03 01:06:25 -07:00
jbarlow
f5ebd23b8f Merge pull request #1653 from ocrmypdf:dependabot/uv/requests-2.33.0
Bump requests from 2.32.5 to 2.33.0
2026-04-03 01:05:58 -07:00
jbarlow
81118c6195 Merge pull request #1658 from ocrmypdf:dependabot/uv/pygments-2.20.0
Bump pygments from 2.19.2 to 2.20.0
2026-04-03 01:05:22 -07:00
dependabot[bot]
834b60a02a Bump pygments from 2.19.2 to 2.20.0
Bumps [pygments](https://github.com/pygments/pygments) from 2.19.2 to 2.20.0.
- [Release notes](https://github.com/pygments/pygments/releases)
- [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES)
- [Commits](https://github.com/pygments/pygments/compare/2.19.2...2.20.0)

---
updated-dependencies:
- dependency-name: pygments
  dependency-version: 2.20.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-30 20:07:51 +00:00
dependabot[bot]
47e3b5b4d2 Bump cryptography from 46.0.5 to 46.0.6
Bumps [cryptography](https://github.com/pyca/cryptography) from 46.0.5 to 46.0.6.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/46.0.5...46.0.6)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-version: 46.0.6
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-29 02:02:03 +00:00
dependabot[bot]
d9346cc3d8 Bump requests from 2.32.5 to 2.33.0
Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.5...v2.33.0)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.33.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-26 17:26:22 +00:00
James R. Barlow
4e974ebd46 Bump version: v17.4.0 v17.4.0 2026-03-21 01:43:13 -07:00
James R. Barlow
6f2b8408c1 v17.4.0 release notes 2026-03-21 01:43:03 -07:00
James R. Barlow
1dba941261 Add cyclopts for dev 2026-03-21 01:37:48 -07:00
James R. Barlow
ef76625abb Fix text stretching in fpdf2 renderer for widely-spaced words
The inter-word Tz calculation stretched "word " to span from the current
word to the next, producing extreme horizontal scaling (300-500%) for
words far apart (e.g. in tables). Use per-word Tz instead — Td
positioning already handles inter-word gaps correctly.

Fixes #1635
2026-03-16 16:00:00 -07:00
James R. Barlow
57bb554a70 Fix verapdf NotADirectoryError crash on some platforms
Catch OSError (parent of both FileNotFoundError and
NotADirectoryError) in verapdf.available() so environments where
executing `verapdf` raises NotADirectoryError gracefully fall back
instead of crashing the pipeline. Fixes #1638.
2026-03-10 02:08:59 -07:00
James R. Barlow
5b9d6f979e Add --no-overwrite / -n option to prevent overwriting output files
Fixes #1642. Adds an early check in check_requested_output_file() that
raises OutputFileAccessError (exit code 5) if the destination file
already exists and --no-overwrite is set. The option is wired through
CLI, OcrOptions, and the Python API.
2026-03-10 01:58:57 -07:00
James R. Barlow
b588e3bfd7 Fix optimize=2/3 crash when using Python API
The jpg_quality and png_quality options default to None in the pydantic
model, but the fallback check only handled == 0. This caused a TypeError
when calling ocrmypdf.ocr() with optimize >= 2 without explicitly
setting quality values. Fixes #1641.
2026-03-10 01:51:07 -07:00
dependabot[bot]
a35dd1f9ee Bump docker/setup-buildx-action from 3 to 4
Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3 to 4.
- [Release notes](https://github.com/docker/setup-buildx-action/releases)
- [Commits](https://github.com/docker/setup-buildx-action/compare/v3...v4)

---
updated-dependencies:
- dependency-name: docker/setup-buildx-action
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-09 11:18:20 +00:00
dependabot[bot]
bf46f4fe35 Bump docker/login-action from 3 to 4
Bumps [docker/login-action](https://github.com/docker/login-action) from 3 to 4.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](https://github.com/docker/login-action/compare/v3...v4)

---
updated-dependencies:
- dependency-name: docker/login-action
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-09 11:18:16 +00:00
dependabot[bot]
55b76338a8 Bump docker/setup-qemu-action from 3 to 4
Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 3 to 4.
- [Release notes](https://github.com/docker/setup-qemu-action/releases)
- [Commits](https://github.com/docker/setup-qemu-action/compare/v3...v4)

---
updated-dependencies:
- dependency-name: docker/setup-qemu-action
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-09 11:18:10 +00:00
dependabot[bot]
2af7b1c179 Bump actions/download-artifact from 7 to 8
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 7 to 8.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v7...v8)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-02 11:32:46 +00:00
dependabot[bot]
69f4cca9b6 Bump actions/upload-artifact from 6 to 7
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 6 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v6...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-02 11:32:40 +00:00
James R. Barlow
59190ef643 Bump version: v17.3.0 v17.3.0 2026-02-21 00:00:26 -08:00
James R. Barlow
910ccccc7d Fix bump-version 2026-02-21 00:00:14 -08:00
James R. Barlow
0c15ff594c v17.3.0 release notes 2026-02-20 23:52:48 -08:00
James R. Barlow
e19ea653aa Switch to static versioning and two-workflow release model
Replace hatch-vcs dynamic versioning with static version in _version.py
and pyproject.toml. Split CI into build.yml (test + stage draft release
on main) and release.yml (publish from draft on tag push). Docker images
are built on main pushes and re-tagged with the release version on tag
push without rebuilding.
2026-02-20 23:34:03 -08:00
James R. Barlow
a899f0d59a Split release_notes into parts for each major release 2026-02-20 18:19:31 -08:00
James R. Barlow
b4e8e9dac9 Fix Python API ignoring language parameter (fixes #1640)
The API's 'language' param was silently dropped because OcrOptions uses
'languages' (plural). Map language->languages in create_options() and
_pdf_to_hocr(), coercing bare strings to lists and splitting '+'
separated codes to match CLI behavior.
2026-02-20 17:10:57 -08:00
James R. Barlow
aca5eb626b Docker: increase alpine version to 3.23 2026-02-20 11:06:33 -08:00