Commit Graph

45 Commits

Author SHA1 Message Date
James R. Barlow
5acf21651f ruff lint and format 2026-01-13 01:50:57 -08:00
James R. Barlow
3c94ada857 Fix tesseract_cache plugin to properly handle cache misses
- Check all required output files exist before declaring cache hit,
  not just stderr.bin
- Add 'hocr' to list of cached output file types
- Fix timeout=0.0 causing immediate timeout on cache miss by treating
  it as "no timeout"
2026-01-09 02:10:29 -08:00
James R. Barlow
bbd263ff48 Add tests for fpdf2 renderer and font infrastructure
- Add hOCR test fixtures for Latin, Arabic, CJK, Devanagari scripts
- Add tests for fpdf2 renderer, multi-font manager, system font provider
- Add multilingual rendering tests
- Update existing tests to use fpdf2 renderer
2026-01-06 13:46:11 -08:00
James R. Barlow
3e46b039ed feat: add use_cropbox parameter to align rasterizer APIs
Added use_cropbox parameter to rasterize_pdf_page hook to allow
choosing between MediaBox and CropBox rendering:

- Default is use_cropbox=False (MediaBox) for consistency with
  Ghostscript's existing behavior
- Ghostscript: passes -dUseCropBox when use_cropbox=True
- pypdfium: calculates crop values to expand from CropBox to MediaBox
  when use_cropbox=False

This aligns both rasterizers to produce the same output dimensions
by default, making the rasterizer choice transparent for page
geometry.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
b9f488d65c test: add comprehensive tests for --rasterizer option
Add test_rasterizer.py with tests covering:
- Basic rasterizer option validation ('auto', 'ghostscript', 'pypdfium')
- Rasterizer + --rotate-pages interaction
- PDFs with nonstandard MediaBox/TrimBox/CropBox
- Direct hook tests verifying plugins respect the option

Also fix pluggy parameter passing: make 'options' a required parameter
(no default) in the hookspec so pluggy forwards it to implementations.
Update test plugins and test_rotation.py to pass the new parameter.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 12:29:17 -08:00
James R. Barlow
065bddbc6c Reformat with ruff format 2024-04-07 00:25:32 -07:00
James R. Barlow
6a746a1cbb ruff linting/Python 3.10 cleanup 2024-02-14 12:41:51 -08:00
James R. Barlow
f6e90a5934 hOCR renderer is now default 2023-12-02 19:58:00 -08:00
James R. Barlow
fadc0cf69b Replace cryptic test error messages with more informative ones 2023-10-24 00:54:31 -07:00
James R. Barlow
c77ae4b34c Change hookspec to migration parameters for generate_pdf to options object
Breaking change for PDF rendering plugins (although none are known to exist).
This provides better separation of Ghostscript specific concerns from
the generic plugin interface.
2023-09-20 15:40:18 -07:00
James R. Barlow
e3c813fc67 Added support for changing color conversion strategy 2023-09-20 01:08:15 -07:00
James R. Barlow
5f211ecf6f Add test cases for soft errors 2023-06-02 02:47:41 -07:00
James R. Barlow
e8ed510543 Add stop on soft render errors and option to override 2023-06-01 23:49:34 -07:00
James R. Barlow
9b8d14d16e Accept most of ruff's delinting 2023-04-14 00:45:34 -07:00
James R. Barlow
b7eb93eb79 Adopt ruff and fix prelim lints 2023-04-14 00:19:17 -07:00
James R. Barlow
6dbaebdc0c Merge branch 'master' into feature/drop-3.7 2022-09-15 23:00:27 -07:00
James R. Barlow
2e937dee9f Refactor cache manifest creation 2022-08-19 00:19:38 -07:00
James R. Barlow
acc70036cc Set minimum Tesseract to 4.1.1 2022-08-02 15:20:29 -07:00
James R. Barlow
67773da309 Drop support for Ghostscript <9.50 2022-08-02 15:01:10 -07:00
James R. Barlow
80ed2117cc Change to SPDX license tracking 2022-07-28 01:10:07 -07:00
James R. Barlow
dc6f1a266a Modernize type annotations 2022-07-23 00:39:24 -07:00
James R. Barlow
e6aa3a4299 tests: explain why CacheOcrEngine needs lock 2022-04-05 16:16:51 -07:00
James Barlow
776ada6713 Upgrade pre-commit and associated tools; various lints 2022-04-03 20:53:01 -07:00
James Barlow
dfe31a2f6d Add lock to certain "with patch" cases
Switch to --use-threads seems to have broken tests that assumed they could
monkeypatch things. Although that's odd, since while we can have multiple
worker threads, we should never have
parallel tests in the same process.
2022-04-03 17:22:04 -07:00
James R. Barlow
9de06f62ee Use Python executors instead of pools
ProcessPool/ThreadPool don't have the ability to notice when a child worker
was terminated. ProcessPoolExecutor and ThreadPoolExecutor do notice and
provide better error messages.

Add tests to check.
2021-12-06 15:38:27 -08:00
James R. Barlow
4c1ff1086c tess cache: don't include full platform - could be sensitive 2021-12-06 15:38:26 -08:00
James R. Barlow
a55ab05d16 Replace leptonica deskew with tesseract find skew and pillow rotate
Also rebuild the cache.
2021-11-12 16:35:08 -08:00
James R. Barlow
064f935699 Fix page rotation regression
Page size fixes in commit b26749 did accounted for a "kept" rotation,
but not a corrected rotation.

Fixes #730.
2021-02-15 01:47:09 -08:00
James R. Barlow
babc76fa74 tests: assert that most patched functions are called
We were not actually checking if functions we patched we called when
expected.
2020-12-28 23:58:33 -08:00
James R. Barlow
81602cf420 Fix test not patching properly after Ghostscript polling change 2020-12-27 16:01:50 -08:00
James R. Barlow
f11bb53e61 Change prefix of temporary folders
Shouldn't really use a name that suggests a connection to GitHub.
2020-12-07 21:51:46 -08:00
James R. Barlow
ce0e0ecd4d Decouple tqdm from progressbar setup 2020-12-04 13:20:28 -08:00
James R. Barlow
7e1223c12c ghostscript: add output tracing 2020-11-29 14:53:35 -08:00
James R. Barlow
fef14778d5 Fix missing f-string in log message 2020-06-22 01:17:16 -07:00
James R. Barlow
64891c2fc3 Pre-release delinting 2020-06-09 15:27:14 -07:00
James R. Barlow
0f942fb714 Rename ocrmypdf.exec -> ocrmypdf._exec 2020-06-09 14:59:09 -07:00
James R. Barlow
be8ca589d4 Move ocrmypdf.exec.run and friends to ocrmypdf.subprocess 2020-06-09 14:53:10 -07:00
James R. Barlow
2059e916da Convert all ghostscript spoofs to test plugins 2020-06-09 00:00:25 -07:00
James R. Barlow
a9a473f2e5 Convert all tesseract cache usages to plugin 2020-06-05 17:55:18 -07:00
James R. Barlow
6268e2faff Begin replacing tests/spoof/tesseract_cache with plugin 2020-06-05 17:27:10 -07:00
James R. Barlow
ec3f506500 Convert tesseract_badutf8 to plugin 2020-06-05 16:38:19 -07:00
James R. Barlow
1b92f447c3 Convert tesseract_crash to plugin 2020-06-02 02:36:41 -07:00
James R. Barlow
82e7eb91d2 Tidy tesseract_noop 2020-06-02 01:50:02 -07:00
James R. Barlow
4f4ad0fb76 Convert tesseract_big_image_error to plugin 2020-06-02 01:49:47 -07:00
James R. Barlow
2b23f7ec73 tesseract_noop: begin implementing with plugin 2020-06-01 02:45:49 -07:00