Commit Graph

632 Commits

Author SHA1 Message Date
James R. Barlow
aa10a70d70 Rebuild test cache due to hocr output change 2021-08-01 01:00:05 -07:00
James R. Barlow
37923ffe52 Work around Pillow 8.3.1 DPI changes
Pillow decided against round-tripping DPI values.
https://github.com/python-pillow/Pillow/pull/5476

Fixes #802
2021-07-14 02:34:28 -07:00
James R. Barlow
5cba68b93d tests: Don't require symlink permissions on Windows
Some of tests required symlink permissions, which CI workers have but typical Windows
user accounts do not. Mostly these are just correctness tests.
2021-07-14 00:11:47 -07:00
James R. Barlow
5f01c5e330 Fix another species of Tesseract version number breaking regex
Fixes #795
2021-06-16 00:09:03 -07:00
James R. Barlow
7b1e5b4f41 Fix "invalid version number" for untagged tesseract versions
Fixes #770
2021-04-26 01:18:07 -07:00
James R. Barlow
757b72b0af Revert "Remove apparently unused portion of a test"
This reverts commit d89a633ba7.
2021-04-16 00:21:11 -07:00
James R. Barlow
d673126994 Fix ZeroDivisionError on files containing images drawn at scale 0
Fixes #761
2021-04-15 23:26:14 -07:00
James R. Barlow
d89a633ba7 Remove apparently unused portion of a test 2021-04-15 23:25:18 -07:00
James R. Barlow
9db9a3d6ec helpers: improve test coverage of Resolution 2021-04-07 23:26:37 -07:00
James R. Barlow
336d274a54 Drop remnants of support for Tesseract without has_textonly_pdf
Also improve Tesseract version checking so it can compare all of their
weird conventions.
2021-04-07 23:05:21 -07:00
James R. Barlow
906d77b389 tests: remove obsolete running_in_travis() 2021-04-07 02:25:10 -07:00
James R. Barlow
9416e850ff Remove another instance of helpers_namespace 2021-04-07 02:23:04 -07:00
James R. Barlow
2a09a668f6 Delinting: unused args 2021-04-07 02:18:08 -07:00
James R. Barlow
e788dde607 tests: eliminate unnecessary mmap 2021-04-07 02:11:31 -07:00
James R. Barlow
173a80864d Delinting 2021-04-07 02:09:45 -07:00
James R. Barlow
aa115a8be3 Remove pytest_helpers_namespace 2021-04-07 01:56:51 -07:00
James R. Barlow
b1306bd7a8 tests: skip test_bash on Windows 2021-04-06 01:15:00 -07:00
James R. Barlow
ec1d585d40 Merge branch 'feature/misc-breaking' 2021-04-01 16:51:04 -07:00
James R. Barlow
a4e1f8e1f3 Merge branch 'feature/lambda' 2021-04-01 16:36:22 -07:00
James R. Barlow
0a42934c08 Exclude Group 3 images from optimization 2021-03-20 23:28:21 -07:00
James R. Barlow
079c162a96 Ensure sidecar is not input or output file 2021-03-05 00:29:42 -08:00
James R. Barlow
4124889f36 Don't generate PDF/A-1b with object streams
Acrobat insists that PDF/A-1b should not have object streams.
Other programs like veraPDF disagree with this restriction, but
we can accommodate Acrobat so we will.

Also add more tests around this.
2021-02-26 00:23:57 -08:00
Dima Kuznetsov
5e2206bae7 Allow --sidecar along --pages (#735) 2021-02-19 16:55:35 -08:00
James R. Barlow
064f935699 Fix page rotation regression
Page size fixes in commit b26749 did accounted for a "kept" rotation,
but not a corrected rotation.

Fixes #730.
2021-02-15 01:47:09 -08:00
James R. Barlow
8770fff968 tests: remove unreliable/incomplete test 2021-02-15 01:05:08 -08:00
James R. Barlow
bccf2f423f Stricter parameter checking for many public functions 2021-01-31 19:27:25 -08:00
James R. Barlow
390fdf8c05 Package OCR in Form XObject
Should improve results in some situations where the initial content
stream is messy or not well-formed.
2021-01-31 19:27:25 -08:00
James R. Barlow
16bda74974 Refactor - decouple progressbar from executor 2021-01-30 20:42:00 -08:00
James R. Barlow
d274d88929 Refactor to eliminate global state in _concurrent 2021-01-30 17:36:30 -08:00
James R. Barlow
7bccb8c748 tests: fix concurrency 2021-01-24 23:46:33 -08:00
James R. Barlow
1a982da442 tests: confirm that we produce pdf when optimization is off 2021-01-24 01:54:25 -08:00
James R. Barlow
ebacff1b39 tests: Fix debug logging test 2021-01-09 16:41:57 -08:00
James R. Barlow
c7c447be66 Add test for configure_debug_logging
Since we can't directly test it
2021-01-09 16:02:12 -08:00
James R. Barlow
91aa175602 Consider text when determining page raster DPI
Previously if we found vectors of any sort on a page, we would bump
the DPI up to 400. We did nothing
about pages with text. As a result,
pages with a low image resolution
and printable text would have the text downgraded to image
resolution when --force-ocr was used.

We don't try to determine if the text is visible or invisible OCR text, since
that is a slower test. --redo-ocr would improve such cases anyway.
2021-01-09 16:01:49 -08:00
James R. Barlow
b267494e4a Create raster PDF pages to match input page size
Previously we produced a raster image, then multiplied image width
by DPI to get the page size. However if there is rounding the
page size may not match exactly. In this modified approach we
constrain the page size to match.
2021-01-08 15:10:43 -08:00
James R. Barlow
f687180ecc tests: tidy pdfinfo 2021-01-08 15:04:52 -08:00
James R. Barlow
2846d46bb8 Remove .coveragerc and fold into setup.cfg 2021-01-06 03:58:18 -08:00
James R. Barlow
0b3a526049 Partial fix crash on 'userunit' None (#700)
Our method of getting data from pdfminer would silently consume a StopIteration
if pdfminer returned no processed pages, leading to odd error message.

We improve an error from pdfminer properly, and returning a more
descriptive error of our own.

It would be possible for ocrmypdf to repair the file before sending it to
pdfminer, but this seems to be rare enough that we won't do that yet.
2021-01-01 01:11:32 -08:00
James R. Barlow
bd0f005861 tests: tag tests that need pngquant, jbig2enc 2020-12-30 01:58:57 -08:00
James R. Barlow
72fa347c38 tests: skip metadata test for two pikepdf versions that warn incorrectly 2020-12-29 01:47:52 -08:00
James R. Barlow
babc76fa74 tests: assert that most patched functions are called
We were not actually checking if functions we patched we called when
expected.
2020-12-28 23:58:33 -08:00
James R. Barlow
81602cf420 Fix test not patching properly after Ghostscript polling change 2020-12-27 16:01:50 -08:00
James R. Barlow
bb258fc99c pdfinfo: Refactor pageinfo dictionary into a class 2020-12-24 01:47:53 -08:00
James R. Barlow
3675ae918c Fix certain invalid page ranges causing exception
Closes #686
2020-12-22 01:22:14 -08:00
James R. Barlow
f11bb53e61 Change prefix of temporary folders
Shouldn't really use a name that suggests a connection to GitHub.
2020-12-07 21:51:46 -08:00
James R. Barlow
3cba50bfbd windows: look in registry for Tesseract and Ghostscript 2020-12-04 13:21:54 -08:00
James R. Barlow
ce0e0ecd4d Decouple tqdm from progressbar setup 2020-12-04 13:20:28 -08:00
James R. Barlow
7e1223c12c ghostscript: add output tracing 2020-11-29 14:53:35 -08:00
James R. Barlow
895fddd85e Replace most uses of universal_newlines with text
The parameters are equivalent but the latter is better named. Since
Python 3.6 doesn't support text= we use our wrapper to add it in that
place.

This is for subprocess.run.
2020-11-07 00:48:08 -08:00
James R. Barlow
3707af3b74 Change pdf.root to pdf.Root 2020-11-03 01:30:31 -08:00