829 Commits

Author SHA1 Message Date
James R. Barlow
e613db6a82 Fix Ghostscript 10.6 JPEG corruption by repairing truncated images
Ghostscript 10.6 has a bug that truncates JPEG data by 1-15 bytes.
This adds detection and repair by comparing output images to input
images and restoring the original bytes when truncation is detected.

- Add warning when GS 10.6+ is used with PDF/A output
- Add _repair_gs106_jpeg_corruption() to fix damaged JPEGs after
  Ghostscript processing
- Add unit tests for the repair function
2025-12-23 14:56:24 -08:00
James R. Barlow
742a4bac17 Make rotation test more robust 2025-12-23 11:20:57 -08:00
James R. Barlow
4c1ef0b471 Also process art and bleed boxes 2025-12-23 11:20:41 -08:00
James R. Barlow
eace567f7b Test and fix page box issues 2025-12-23 11:19:51 -08:00
James R. Barlow
057eaff36d Skip devnull testing on Windows
No longer seems to work - Windows Server 2025 change, perhaps? Doesn't really matter.
2025-11-10 16:57:30 -08:00
James R. Barlow
599fb1a1f6 Fix test_semfree (skip Python 3.14)
This feature is now deprecated and won't be fixed for Python 3.14. Instead we just use threads on platforms that don't support semaphores.

Closes #1558
2025-09-14 13:02:33 -07:00
James R. Barlow
414d80fc16 Deprecate semfree and don't auto activate it
Instead the standard executor will fall back to threads.

semfree caused test failures  with Py3.14:
https://github.com/ocrmypdf/OCRmyPDF/issues/1558

In retrospect and with emerging Python tech like freethreading, semfree is becoming less necessary. We can use threads for the time being.

A consequence is that performance may be lower on Lambda and Termux when we are using threads and not shelling out work.
2025-09-11 17:13:04 -07:00
James R. Barlow
7e7e2f2e91 Raw value in pdfa XML block uses upper case codes, so account for this 2025-09-08 12:46:26 -07:00
James R. Barlow
4fc0c3a0d5 Add watcher test, such as it is 2025-08-13 01:04:58 -07:00
James R. Barlow
175b743ffe Fix version test 2025-07-03 11:30:05 -07:00
James R. Barlow
45cf92f40b xfail Python logging bug in 3.13.3/4 2025-07-03 09:21:31 -07:00
James R. Barlow
3beabf55e7 Skip optimizing images with pre-blended soft masks
Fixes issue [Bug]: Optimized pdf not rendering with Quartz / Core Graphics #1536
2025-06-12 23:58:43 -07:00
James R. Barlow
6851ea7f11 Remove test since ghostscript error handling changed 2025-04-21 12:23:34 -07:00
James R. Barlow
32322a9fe9 Fix broken test_hocrtransform_matches_sandwich
Expect word similarity rather than exact match. Difference appears to be due to quote styles.

Thanks @QuLogic for reporting.
2025-02-09 13:57:50 -08:00
James R. Barlow
137b054f43 Adjust test again for older Ghostscript 2025-01-27 23:44:37 -08:00
James R. Barlow
65df44f670 Modify tests to deal with variety of Ghostscript versions 2025-01-09 02:14:29 -08:00
James R. Barlow
6edc749023 Fix error handling when PDF contains an invalid image with both ImageMask and ColorSpace set
Fixes #1453
2025-01-07 00:27:07 -08:00
Kara Engelhardt
636623ab49 graft: fix invisible text appearing after strip_invisible_text
strip_invisible_text resets the text render mode on each `BT` (begin text) command. However the text state is not actually reset for each text element, only for each page.

The pdf reference says:

> The text state operators can appear outside text objects, and the values they set
> are retained across text objects in a single content stream. Like other graphics
> state parameters, these parameters are initialized to their default values at the
> beginning of each page.
>
> -- https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf#page=397

With the current implementation, a text object is only deleted if it contains a `3 Tr` command (setting the text rendering mode to invalid). However the rendering mode may be set once and then not changed for multiple text objects or set outside of a text object.
In that case only the first text object (which contains the `3 Tr`-command) is removed. This not only leaves the other text objects in the pdf, but also makes them visible, since the text object that contained the `3 Tr`-command is removed.

This PR updates `strip_invisible_text` to not reset the rendering mode for each object and to keep track of the rendering mode when the graphic state is pushed/popped.
2024-12-11 18:01:12 +01:00
James R. Barlow
9a075039b5 Remove empty test file 2024-12-02 11:23:35 -08:00
James R. Barlow
fe89be5dc0 Fix test broken in commit 85d6fb8c 2024-11-27 15:44:12 -08:00
James R. Barlow
a659f83d67 Remove invalid hyperlink annotations to satisfy Ghostscript 10.x during PDF/A conversion
Closes #1425
2024-11-16 19:02:10 -08:00
James R. Barlow
dbd3c93757 Fix issue with unpickling HOCRResult
Fixes [Bug]: HOCRResult.from_json() not unpickling correctly #1427
2024-11-10 02:05:57 -08:00
James R. Barlow
f77f701a50 Fix quadratic time performance regression on scanning pages 2024-10-27 21:57:53 -07:00
James R. Barlow
6f755321b8 Ignore unpaper warning message when checking version
Fixes #1409
2024-10-27 13:06:30 -07:00
James R. Barlow
18b59c57b4 Refactor our tests that check if we are in a container 2024-10-27 11:55:22 -07:00
Elliott Sales de Andrade
bb4c47e707 Fix broken test_rotate_page_level (#1382)
Before 42ff7fc842, `make_rotate_test`
always used `resources / 'typewriter.png'`, but after the change the
second call accidentally used just `resources`, which is a directory,
and fails to open.
2024-08-21 01:25:07 -07:00
James R. Barlow
59f6bc8306 More Tesseract-specific language checks to its plugin 2024-06-01 00:15:50 -07:00
James R. Barlow
abf9729c61 Semfree test: accept pdfa conversion failed as a valid return code
Fixes #1316
2024-05-21 01:26:11 -07:00
James R. Barlow
7a8cc21e31 Add support for sidecar output to io.BytesIO
Closes #1252
2024-04-07 01:38:55 -07:00
James R. Barlow
065bddbc6c Reformat with ruff format 2024-04-07 00:25:32 -07:00
James Barlow
855de287b2 Fix test suite failure with Ghostscript >= 10.3
Ghostscript is more picky about a specific case with SMask that cannot be converted to PDF/A

Details here
4dcfae36bb
2024-03-19 17:20:33 -07:00
James R. Barlow
6a746a1cbb ruff linting/Python 3.10 cleanup 2024-02-14 12:41:51 -08:00
James R. Barlow
42ff7fc842 Fix handling of pages that are restored to correct orientation with /Rotate
Appears inversion of CTM was incorrect, introduced in commit 9898904
2024-02-12 01:32:26 -08:00
James R. Barlow
26470fe16a Suppress reportlab deprecation warning 2024-02-12 01:17:08 -08:00
James R. Barlow
74d2a156c4 Update cache 2024-01-07 01:35:05 -08:00
James R. Barlow
14365d10b8 Skip testing oom killer on Python 3.12
Need to investigate further if there's a safe way to do this test.
2024-01-02 16:28:22 -08:00
James R. Barlow
9489c01259 Skip test_encrypted on Py3.12 + macOS 2023-12-08 00:12:24 -08:00
James R. Barlow
a4987733c4 Filter rl_safe_eval deprecation warning
Full message
eportlab/lib/rl_safe_eval.py:11: DeprecationWarning: ast.NameConstant is deprecated and will be removed in Python 3.14; use ast.Constant instead
    haveNameConstant = hasattr(ast,'NameConstant')

Warning triggered by reportlab-4.0.7 and Python 3.12
2023-12-07 23:40:23 -08:00
James R. Barlow
445617a1a5 Rebuild cache for hocr default case 2023-12-03 15:16:18 -08:00
James R. Barlow
f6e90a5934 hOCR renderer is now default 2023-12-02 19:58:00 -08:00
James R. Barlow
11d3e32f1e Fix hocrtransform CLI 2023-12-02 08:08:29 -08:00
James R. Barlow
03669183d7 Rationalize canvas interface 2023-11-20 15:54:13 -08:00
James R. Barlow
db2e5132e6 Remove some obsolete parameters 2023-11-20 00:10:55 -08:00
James R. Barlow
c591f9601a Remove Latin hOCR test 2023-11-19 23:51:27 -08:00
James R. Barlow
27d5229842 Make logger names unique 2023-11-09 23:03:39 -08:00
James R. Barlow
a596ccf844 Raise exception if resulting PDF might appear blank in a known in some PDF viewers
Fixes #1187
2023-11-09 22:33:22 -08:00
James R. Barlow
e7fa97731f ghostscript duplicate filter: filter within a window of previous messages 2023-11-09 22:32:39 -08:00
James R. Barlow
290aa28108 Fix error on attempt to write to debug log after removing debug log handler 2023-11-09 16:02:41 -08:00
James R. Barlow
916106733c Skip semfree unless on Linux 2023-10-30 00:33:21 -07:00
James R. Barlow
71166f7be8 Make hocr API experimental for now
This commit can be reverted when we are ready to release a new version.
2023-10-30 00:07:10 -07:00