Commit Graph

821 Commits

Author SHA1 Message Date
James R. Barlow
4fc0c3a0d5 Add watcher test, such as it is 2025-08-13 01:04:58 -07:00
James R. Barlow
175b743ffe Fix version test 2025-07-03 11:30:05 -07:00
James R. Barlow
45cf92f40b xfail Python logging bug in 3.13.3/4 2025-07-03 09:21:31 -07:00
James R. Barlow
3beabf55e7 Skip optimizing images with pre-blended soft masks
Fixes issue [Bug]: Optimized pdf not rendering with Quartz / Core Graphics #1536
2025-06-12 23:58:43 -07:00
James R. Barlow
6851ea7f11 Remove test since ghostscript error handling changed 2025-04-21 12:23:34 -07:00
James R. Barlow
32322a9fe9 Fix broken test_hocrtransform_matches_sandwich
Expect word similarity rather than exact match. Difference appears to be due to quote styles.

Thanks @QuLogic for reporting.
2025-02-09 13:57:50 -08:00
James R. Barlow
137b054f43 Adjust test again for older Ghostscript 2025-01-27 23:44:37 -08:00
James R. Barlow
65df44f670 Modify tests to deal with variety of Ghostscript versions 2025-01-09 02:14:29 -08:00
James R. Barlow
6edc749023 Fix error handling when PDF contains an invalid image with both ImageMask and ColorSpace set
Fixes #1453
2025-01-07 00:27:07 -08:00
Kara Engelhardt
636623ab49 graft: fix invisible text appearing after strip_invisible_text
strip_invisible_text resets the text render mode on each `BT` (begin text) command. However the text state is not actually reset for each text element, only for each page.

The pdf reference says:

> The text state operators can appear outside text objects, and the values they set
> are retained across text objects in a single content stream. Like other graphics
> state parameters, these parameters are initialized to their default values at the
> beginning of each page.
>
> -- https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf#page=397

With the current implementation, a text object is only deleted if it contains a `3 Tr` command (setting the text rendering mode to invalid). However the rendering mode may be set once and then not changed for multiple text objects or set outside of a text object.
In that case only the first text object (which contains the `3 Tr`-command) is removed. This not only leaves the other text objects in the pdf, but also makes them visible, since the text object that contained the `3 Tr`-command is removed.

This PR updates `strip_invisible_text` to not reset the rendering mode for each object and to keep track of the rendering mode when the graphic state is pushed/popped.
2024-12-11 18:01:12 +01:00
James R. Barlow
9a075039b5 Remove empty test file 2024-12-02 11:23:35 -08:00
James R. Barlow
fe89be5dc0 Fix test broken in commit 85d6fb8c 2024-11-27 15:44:12 -08:00
James R. Barlow
a659f83d67 Remove invalid hyperlink annotations to satisfy Ghostscript 10.x during PDF/A conversion
Closes #1425
2024-11-16 19:02:10 -08:00
James R. Barlow
dbd3c93757 Fix issue with unpickling HOCRResult
Fixes [Bug]: HOCRResult.from_json() not unpickling correctly #1427
2024-11-10 02:05:57 -08:00
James R. Barlow
f77f701a50 Fix quadratic time performance regression on scanning pages 2024-10-27 21:57:53 -07:00
James R. Barlow
6f755321b8 Ignore unpaper warning message when checking version
Fixes #1409
2024-10-27 13:06:30 -07:00
James R. Barlow
18b59c57b4 Refactor our tests that check if we are in a container 2024-10-27 11:55:22 -07:00
Elliott Sales de Andrade
bb4c47e707 Fix broken test_rotate_page_level (#1382)
Before 42ff7fc842, `make_rotate_test`
always used `resources / 'typewriter.png'`, but after the change the
second call accidentally used just `resources`, which is a directory,
and fails to open.
2024-08-21 01:25:07 -07:00
James R. Barlow
59f6bc8306 More Tesseract-specific language checks to its plugin 2024-06-01 00:15:50 -07:00
James R. Barlow
abf9729c61 Semfree test: accept pdfa conversion failed as a valid return code
Fixes #1316
2024-05-21 01:26:11 -07:00
James R. Barlow
7a8cc21e31 Add support for sidecar output to io.BytesIO
Closes #1252
2024-04-07 01:38:55 -07:00
James R. Barlow
065bddbc6c Reformat with ruff format 2024-04-07 00:25:32 -07:00
James Barlow
855de287b2 Fix test suite failure with Ghostscript >= 10.3
Ghostscript is more picky about a specific case with SMask that cannot be converted to PDF/A

Details here
4dcfae36bb
2024-03-19 17:20:33 -07:00
James R. Barlow
6a746a1cbb ruff linting/Python 3.10 cleanup 2024-02-14 12:41:51 -08:00
James R. Barlow
42ff7fc842 Fix handling of pages that are restored to correct orientation with /Rotate
Appears inversion of CTM was incorrect, introduced in commit 9898904
2024-02-12 01:32:26 -08:00
James R. Barlow
26470fe16a Suppress reportlab deprecation warning 2024-02-12 01:17:08 -08:00
James R. Barlow
74d2a156c4 Update cache 2024-01-07 01:35:05 -08:00
James R. Barlow
14365d10b8 Skip testing oom killer on Python 3.12
Need to investigate further if there's a safe way to do this test.
2024-01-02 16:28:22 -08:00
James R. Barlow
9489c01259 Skip test_encrypted on Py3.12 + macOS 2023-12-08 00:12:24 -08:00
James R. Barlow
a4987733c4 Filter rl_safe_eval deprecation warning
Full message
eportlab/lib/rl_safe_eval.py:11: DeprecationWarning: ast.NameConstant is deprecated and will be removed in Python 3.14; use ast.Constant instead
    haveNameConstant = hasattr(ast,'NameConstant')

Warning triggered by reportlab-4.0.7 and Python 3.12
2023-12-07 23:40:23 -08:00
James R. Barlow
445617a1a5 Rebuild cache for hocr default case 2023-12-03 15:16:18 -08:00
James R. Barlow
f6e90a5934 hOCR renderer is now default 2023-12-02 19:58:00 -08:00
James R. Barlow
11d3e32f1e Fix hocrtransform CLI 2023-12-02 08:08:29 -08:00
James R. Barlow
03669183d7 Rationalize canvas interface 2023-11-20 15:54:13 -08:00
James R. Barlow
db2e5132e6 Remove some obsolete parameters 2023-11-20 00:10:55 -08:00
James R. Barlow
c591f9601a Remove Latin hOCR test 2023-11-19 23:51:27 -08:00
James R. Barlow
27d5229842 Make logger names unique 2023-11-09 23:03:39 -08:00
James R. Barlow
a596ccf844 Raise exception if resulting PDF might appear blank in a known in some PDF viewers
Fixes #1187
2023-11-09 22:33:22 -08:00
James R. Barlow
e7fa97731f ghostscript duplicate filter: filter within a window of previous messages 2023-11-09 22:32:39 -08:00
James R. Barlow
290aa28108 Fix error on attempt to write to debug log after removing debug log handler 2023-11-09 16:02:41 -08:00
James R. Barlow
916106733c Skip semfree unless on Linux 2023-10-30 00:33:21 -07:00
James R. Barlow
71166f7be8 Make hocr API experimental for now
This commit can be reverted when we are ready to release a new version.
2023-10-30 00:07:10 -07:00
James R. Barlow
580252a1a0 Merge branch 'feature/gscan2pdf'
Reconcile release notes and copy_final() with new pipeline.
2023-10-30 00:01:28 -07:00
James R. Barlow
b5e73ac4e4 Drop check for obsolete .dockerinit file 2023-10-24 13:49:46 -07:00
James R. Barlow
db3df13e95 Remove ocrmypdf._sync 2023-10-24 00:54:31 -07:00
James R. Barlow
9ffb45f283 Remove public domain congress.jpg and replace with baiona_color.jpg
For reuse compliance we are phasing out public domain licenses
2023-10-24 00:54:31 -07:00
James R. Barlow
a06ab2a1c5 unpaper: Remove format conversion
Code is no longer reachable since we rasterize a 1/L/RGB image prior to this point.
2023-10-24 00:54:31 -07:00
James R. Barlow
dfa4ebf1a6 Simplify function signature of extract_image_filter 2023-10-24 00:54:31 -07:00
James R. Barlow
58f388c69d optimize: better coverage 2023-10-24 00:54:31 -07:00
James R. Barlow
990b462a94 Fix coverage settings and cover semfree 2023-10-24 00:54:31 -07:00