767 Commits

Author SHA1 Message Date
James R. Barlow
1c89cacfef Respect host-set PIL.Image.MAX_IMAGE_PIXELS in Python API
The API previously clobbered PIL.Image.MAX_IMAGE_PIXELS unconditionally
on every call, so host applications (e.g. Paperless-NGX) that configured
the PIL limit before invoking ocrmypdf.ocr() saw their setting silently
overwritten with the 250 MP default. Make max_image_mpixels default to
None and only apply the override when the caller explicitly sets it.
The CLI default of 250 MP is unchanged.

Fixes #1665
2026-04-19 13:44:57 -07:00
James R. Barlow
89c76b5145 v17.4.1 release notes 2026-04-05 00:23:07 -07:00
James R. Barlow
6f2b8408c1 v17.4.0 release notes 2026-03-21 01:43:03 -07:00
James R. Barlow
0c15ff594c v17.3.0 release notes 2026-02-20 23:52:48 -08:00
James R. Barlow
a899f0d59a Split release_notes into parts for each major release 2026-02-20 18:19:31 -08:00
James R. Barlow
c85c8941d3 Fix pdftotext word spacing by emitting single BT block per line
poppler/pdftotext does not carry Tz (horizontal scaling) across
BT/ET boundaries, causing words to appear on separate lines.
Replace per-word BT blocks (via fpdf2's cell/set_stretching API)
with a single BT block per line using raw PDF operators. Each
non-last word gets a trailing space with Tz calculated to span
exactly to the next word's start position.
2026-02-11 00:42:10 -08:00
James R. Barlow
5c83dab8a7 Fix fpdf text mode in multi-page renderer; add v17.2.0 release notes
The previous fix (e62e73e4) only corrected text_rendering_mode →
text_mode in the single-page Fpdf2PdfRenderer, but the main OCR
pipeline uses Fpdf2MultiPageRenderer which still had the old
attribute name. Since fpdf2 has no text_rendering_mode property,
setting it silently created a no-op attribute while text_mode stayed
at FILL — so 3 Tr (invisible text) was never emitted.

Fixes #1631, #1632
2026-02-10 14:12:49 -08:00
James R. Barlow
1684982cde Further adjustments to install docs 2026-02-06 17:17:44 -08:00
James R. Barlow
4d97dfd218 Update installation docs for modern tooling
- Prioritize uv over pip throughout, with uv as the recommended installer
- Update repology badges: Debian 13, Ubuntu 24.04, Fedora 40/41
- Make Python 3.12 the default (3.11 still supported)
- Promote Homebrew as full-featured option for macOS and Linux
- Add dependency summary table aligned with maintainers.md
- Document uharfbuzz and fonts-noto requirements
- Remove outdated warnings and simplify 32-bit section
2026-02-05 15:04:12 -08:00
James R. Barlow
9d8aa5a0c3 v17.1.0 release notes 2026-01-30 16:15:50 -08:00
James R. Barlow
3abe8f71c7 v17.0.1 release notes 2026-01-30 00:15:13 -08:00
James R. Barlow
c5d3ef4b17 Tighten ruff rules and modernize style 2026-01-27 14:04:52 -08:00
James R. Barlow
db9f94de14 Ensure Noto font is installed where needed 2026-01-20 19:50:47 -08:00
James R. Barlow
37e7131a01 Drop support for Python 3.10, require Python 3.11+
Python 3.11 is now the minimum supported version. This aligns with
the codebase's use of StrEnum (introduced in 3.11) and removes
compatibility shims that were only needed for older versions.
2026-01-20 11:54:55 -08:00
James R. Barlow
4b16228a4a docs: minor adjustments 2026-01-20 10:41:55 -08:00
James R. Barlow
99f8106936 Update API documentation for OcrOptions-first calling convention
Document the new v17 API style where OcrOptions can be passed directly
to ocr(). Mark the positional argument style as legacy API for <v17
compatibility. Update examples to use modern syntax.
2026-01-20 10:30:33 -08:00
James R. Barlow
2f4280b66c Comprrehensive documentation update in preparation for v17 2026-01-16 01:38:47 -08:00
James R. Barlow
6cf9d1c6ee Update release notes 2026-01-15 23:29:29 -08:00
James R. Barlow
6a7164a76c Update release notes with branch changes 2026-01-15 23:25:51 -08:00
James R. Barlow
bf76c8270c Rationalize optional dependencies vs dependency groups
Establish clear separation between user-facing optional dependencies
and developer-only dependency groups:

**Optional Dependencies (user features):**
- watcher: File watching service for batch processing
- webservice: Streamlit-based web UI
- Installable via: uv sync --extra <name> or pip install ocrmypdf[name]

**Dependency Groups (developer tools):**
- test: Testing infrastructure (merged from test + extended_test)
- docs: Documentation building tools
- streamlit-dev: Enhanced Streamlit development tools
- dev: General development tools (mypy, ipykernel)
- Installable via: uv sync --group <name> (uv only, NOT pip)

Breaking changes for developers:
- pip install -e .[test] no longer works → use uv sync --group test
- pip install -e .[docs] no longer works → use uv sync --group docs
- pip install -e .[extended_test] removed → merged into test group

No breaking changes for end users:
- pip install ocrmypdf[watcher] still works
- pip install ocrmypdf[webservice] still works

Updated:
- CI/CD workflows to use uv sync --group test
- Docker images to exclude test dependencies
- Documentation to recommend uv with pip as fallback
- pyproject.toml with clear comments explaining both systems
2026-01-13 00:34:55 -08:00
James R. Barlow
740f67091c Rename OCROptions to OcrOptions for consistency
Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.
2026-01-12 23:37:54 -08:00
James R. Barlow
36dea181e6 Update cookbook: Replace --tesseract-timeout 0 with --ocr-engine none
Update documentation examples to use the new --ocr-engine none option
instead of the deprecated --tesseract-timeout 0 idiom for disabling OCR.
2026-01-12 23:28:14 -08:00
James R. Barlow
0d6e0c4560 Merge branch 'main' into dev 2025-12-24 00:44:18 -08:00
James R. Barlow
94d7735862 docs: missing issue ref 2025-12-24 00:14:24 -08:00
James R. Barlow
c540967429 docs: Update release notes 2025-12-23 15:44:44 -08:00
James R. Barlow
6ada11ddae docs: Update release notes 2025-12-23 15:05:49 -08:00
James R. Barlow
01a3706281 docs: Add release notes for v16.13.0 2025-12-23 15:01:22 -08:00
James R. Barlow
16c2604a07 Remove lossy JBIG2 support, retain lossless JBIG2 only
Lossy JBIG2 has been removed due to well-documented risks of character
substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and
--jbig2-page-group-size arguments are now deprecated and ignored with
a warning.

Changes:
- Remove jbig2_lossy and jbig2_page_group_size from OCROptions
- Simplify optimize.py to use single-image JBIG2 encoding only
  (no symbol dictionaries/JBIG2Globals)
- Remove convert_group() from jbig2enc.py
- Deprecate CLI args with warnings for backward compatibility
- Update documentation to explain lossless-only JBIG2
2025-12-23 02:45:07 -08:00
James R. Barlow
a4ee513cd4 refactor: clean up deprecated code and update plugin docs
- Remove outdated Phase comments from _options.py and cli.py
- Remove unused methods from PluginOptionRegistry:
  - get_extended_options_model() - replaced by __getattr__ in OCROptions
  - map_legacy_options() - unused
  - validate_plugin_options() - unused
- Update plugin documentation to document register_options hook
- Add documentation for nested plugin option access pattern

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:21:48 -08:00
James R. Barlow
530186b468 docs: update documentation for OCROptions plugin interface migration
Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
2025-12-21 12:21:46 -08:00
rugk
8d715c4157 docs: fix and clarify podman usage instructions (#1601)
* docs: fix and clarify podman usage instructions

* the full reference `jbarlow83/ocrmypdf-alpine` as in the other commands may fix an issue if you do not have `ocrmypdf` already downloaded locally
* also clarified the command at the end for usage when SELinux is enabled

* docs: clarify difference between SeLinux and rootless user mapping
2025-12-01 13:07:09 -08:00
James R. Barlow
54ce09496c v16.12.0 release notes 2025-11-11 13:48:06 -08:00
James R. Barlow
f181307e50 v16.11.1 release notes 2025-10-16 10:59:13 +02:00
James R. Barlow
9a2c0cf6ff v16.11.0 release notes 2025-09-12 00:08:11 -07:00
clach04
d07231a7aa Doc typo plugins.md (#1568) 2025-09-08 12:07:51 -07:00
Christoph Dyllick-Brenzinger
74305e8741 Update batch.md (#1552)
Add two missing available parameters for watcher.py (used with docker):
- OCR_LOGLEVEL
- OCR_JSON_SETTINGS
2025-08-05 14:11:55 -07:00
Máté Gyöngyösi
d6b069d3fa Unify --tesseract-timeout flag syntax (#1546)
As pointed out at 
https://github.com/tldr-pages/tldr/pull/17175#discussion_r2192340014.
2025-07-08 11:40:58 -07:00
James R. Barlow
194ca699a8 v16.10.4 release notes 2025-07-07 12:36:15 -07:00
James R. Barlow
7ea940a3a6 v16.10.3 release notes 2025-06-13 00:28:33 -07:00
James R. Barlow
9f6e5a48ad Deny use of pikepdf 9.8.0 due to GlyphlessFont error 2025-05-27 12:16:19 -07:00
James R. Barlow
b166e86216 jbig2 doc: mention pkg-config
Closes #1484
2025-05-26 13:04:05 -07:00
James R. Barlow
7c5bed41f1 v16.10.1 2025-04-21 01:15:29 -07:00
James R. Barlow
3304498bdc Fix some anchors and markdown quirks 2025-04-21 00:50:26 -07:00
James R. Barlow
e4a8f7a354 Remove redundant optimizer content 2025-04-17 15:10:59 -07:00
James R. Barlow
d1a45e4abc Convert remaining rst -> md 2025-04-17 15:03:21 -07:00
James R. Barlow
3b9367fc69 Continuing rst -> md 2025-04-17 02:27:59 -07:00
James R. Barlow
92a78f611e rst -> md migration in progress 2025-04-17 02:10:40 -07:00
Ikko Eltociear Ashimine
0f5ccb71ca docs: update installation.rst
instal -> install
2025-03-09 01:27:25 +09:00
James R. Barlow
7b2dd892e5 v16.10.0 release notes 2025-02-26 15:16:18 -08:00
James R. Barlow
2a55ceadd0 Merge branch 'pr/rugk/1489' 2025-02-26 14:59:06 -08:00