Commit Graph

686 Commits

Author SHA1 Message Date
James R. Barlow
a5efc4af9b unpaper: replace input pnm with png
Unpaper or its underlying libraries don't seem to accept pnms with an
odd integer width. Although it's not clear if this is the issue at all.

In any case, keeping the image a PNG works around the issue. unpaper
only accepted PNM input in the past, which is why we send it PNM.
Since it now accepts PNG, we might as well use PNG.

Unpaper can write PNG as output too, but this added a few seconds to
the test suite was not committed.

Related issues:

https://github.com/ocrmypdf/OCRmyPDF/issues/887

https://github.com/ocrmypdf/OCRmyPDF/issues/665

https://github.com/unpaper/unpaper/issues/82
2022-07-03 15:32:16 -07:00
James R. Barlow
61600111d3 test_pdfinfo: refactor by extracting fixtures 2022-06-18 16:29:57 -07:00
James R. Barlow
17a5b8b43c Refactor reporting of optimization failures 2022-06-13 01:30:15 -07:00
James R. Barlow
13d11e76e5 optimize plugin: solve linearization and "is optimization enabled?" issues 2022-06-13 00:59:41 -07:00
James R. Barlow
61069660a2 Move optimization options to plugin 2022-06-12 02:42:16 -07:00
James R. Barlow
3d4f80639d Remove test that is now always skipped 2022-06-12 00:31:01 -07:00
James R. Barlow
b17fb61389 Configure pylint in pyproject and delint 2022-06-12 00:30:44 -07:00
James R. Barlow
0ac15dd0b2 Suppress libxmp DeprecationWarning during test 2022-06-01 00:46:16 -07:00
James R. Barlow
33cdabaf65 tests: account for test that expected pngquant for windows 2022-05-26 13:52:22 -07:00
James R. Barlow
5d0cc0a092 tests: Extract some test fixtures for better clarity 2022-05-26 00:57:31 -07:00
James R. Barlow
6c427f82ea Add test case for corrupt ICC profiles 2022-05-26 00:41:19 -07:00
James R. Barlow
b00fe3dc5d pytest.skip() - remove kwarg entirely, to avoid breaking older pytest and not getting warns from newer pytest 2022-04-14 20:15:00 -07:00
James R. Barlow
e6aa3a4299 tests: explain why CacheOcrEngine needs lock 2022-04-05 16:16:51 -07:00
James R. Barlow
43302d7e12 Fix pytest.warns() on older pytest
Thanks @QuLogic
2022-04-05 16:02:50 -07:00
James Barlow
776ada6713 Upgrade pre-commit and associated tools; various lints 2022-04-03 20:53:01 -07:00
James Barlow
dfe31a2f6d Add lock to certain "with patch" cases
Switch to --use-threads seems to have broken tests that assumed they could
monkeypatch things. Although that's odd, since while we can have multiple
worker threads, we should never have
parallel tests in the same process.
2022-04-03 17:22:04 -07:00
James Barlow
0c43963d69 Fix pytest deprecation warnings 2022-04-03 13:30:58 -07:00
James Barlow
f29fe7f23e Fix Pillow deprecation warnings 2022-04-03 13:30:50 -07:00
James R. Barlow
13917c051c Disable oom killer test for --use-threads 2022-03-13 01:02:28 -08:00
James R. Barlow
514038d4ec optimize: recognize and produce [/FlateDecode /DCTDecode] images 2022-02-08 00:38:08 -08:00
James R. Barlow
3b406112d0 ghostscript: improve test coverage of error cases 2022-01-25 23:45:47 -08:00
James R. Barlow
2d0ac4707c Use better img2pdf settings where possible while supporting old versions
Fixes #894
2022-01-14 11:55:54 -08:00
James R. Barlow
ea69e868ed unpaper: issue warning if image too large to clean 2022-01-11 10:44:38 -08:00
James R. Barlow
ee21bf9ef6 Update cache 2021-12-13 20:45:30 -08:00
James R. Barlow
d48254d477 Fix issue with attempting to deskew a blank page on Tesseract 5
Closes #868
2021-12-10 21:48:09 -08:00
James R. Barlow
13af3252ff tests: simplify run_ocrmypdf API 2021-12-06 17:00:25 -08:00
James R. Barlow
6910c48b81 Fix test_outputtype_none on Windows and cleanup docs 2021-12-06 15:38:38 -08:00
James R. Barlow
e642dd4b35 Fix kill signal on Windows 2021-12-06 15:38:32 -08:00
James R. Barlow
9de06f62ee Use Python executors instead of pools
ProcessPool/ThreadPool don't have the ability to notice when a child worker
was terminated. ProcessPoolExecutor and ThreadPoolExecutor do notice and
provide better error messages.

Add tests to check.
2021-12-06 15:38:27 -08:00
James R. Barlow
8fdcb15b4e tests: improve typing and remove some legacy code 2021-12-06 15:38:27 -08:00
James R. Barlow
4c1ff1086c tess cache: don't include full platform - could be sensitive 2021-12-06 15:38:26 -08:00
James R. Barlow
f91faf9795 Add new argument --tesseract-thresholding to control tesseract thresholding where available
Also add missing test for --tesseract-oem
2021-12-06 15:38:14 -08:00
James R. Barlow
c75ff4687a Turning on Ghostscript interpolation changes this test
Seems acceptable. We don't normally use Ghostscript to downsample PDFs
like is happening in this test.
2021-11-15 16:36:24 -08:00
James R. Barlow
acc9d58c39 Skip no language test for Tess 5 2021-11-13 01:37:27 -08:00
James R. Barlow
e3126d2806 Adjust test to support Tesseract 5 working harder to find its files 2021-11-13 01:16:35 -08:00
James R. Barlow
f51164aff8 Upgrade test version of pymupdf 2021-11-13 00:53:41 -08:00
James R. Barlow
6f58a14351 pdfa: remove deprecated pkg_resources based access and tests 2021-11-13 00:52:03 -08:00
James R. Barlow
7ba04267b1 Remove shims to support for old versions of pikepdf < 4 2021-11-13 00:43:20 -08:00
James R. Barlow
380b981763 Remove most Python 3.6 special casing 2021-11-13 00:27:48 -08:00
James R. Barlow
5abfb14c2a Remove leptonica and cffi 2021-11-13 00:06:35 -08:00
James R. Barlow
036afc4d88 Update cache, related to previous apparently 2021-11-12 23:57:50 -08:00
James R. Barlow
59642a98b2 Disable --remove-background so we can remove leptonica 2021-11-12 23:56:52 -08:00
James R. Barlow
f8c6be2e26 test_rotation: replace leptonica test with Pillow channel ops
New function is likely not as robust but seems capable of inexact image comparison.
2021-11-12 23:49:38 -08:00
James R. Barlow
30440104ba Remove --threshold argument
Tesseract is now included better thresholding (binarization) in v5. Users that have
thresholding issues should try that first. If we find further problems
this can be brought back as a plugin.
2021-11-12 20:09:55 -08:00
James R. Barlow
b159e02110 Convert deskew to use degrees, since all our other angles are in degrees 2021-11-12 16:40:51 -08:00
James R. Barlow
a55ab05d16 Replace leptonica deskew with tesseract find skew and pillow rotate
Also rebuild the cache.
2021-11-12 16:35:08 -08:00
James R. Barlow
6c34d59836 tesseract: yet another version variant 2021-11-04 00:17:18 -07:00
James R. Barlow
690f88119d Fix test failures on pikepdf 3.2.0 + pybind11 2.8.0
When compiled without pybind11 2.8.0, pikepdf supplies a shim to implement
pikepdf._ObjectMapping.values() which has subtly different semantics
from a true dict-like objects; in particular it supports
next(objectmap.values())
where a standard dict requires
next(iter(objectmap.values()).

pybind11 2.8.0 now implements .values() properly, meaning some misuses of
protocol  in ocrmypdf fail.

If pybind11 < 2.8.0, pikepdf will
continue to offer its shim. If pybind11 >= 2.8.0, pikepdf does not add its shim.

Consequently no changes were needed in pikepdf.

Closes #843
2021-10-12 13:38:52 -07:00
James R. Barlow
78f391536b Offer hint to user to use --max-image-mpixels after decompression bob error
Closes #801
2021-10-06 00:19:11 -07:00
James R. Barlow
790d3022f6 Implement --output-type=none to skip producing the PDF and use only the sidecar
Closes #787
2021-09-26 01:07:34 -07:00