Commit Graph

3132 Commits

Author SHA1 Message Date
James R. Barlow
190ca81951 v13.1.1 release notes v13.1.1 2021-12-10 21:49:04 -08:00
James R. Barlow
d48254d477 Fix issue with attempting to deskew a blank page on Tesseract 5
Closes #868
2021-12-10 21:48:09 -08:00
James R. Barlow
1ec2ccca14 docs: add warning about multiproc on macOS 2021-12-10 17:42:50 -08:00
James R. Barlow
e78f0cc56f v13.1.0 release notes v13.1.0 2021-12-06 20:00:53 -08:00
James R. Barlow
13af3252ff tests: simplify run_ocrmypdf API 2021-12-06 17:00:25 -08:00
James R. Barlow
0528867e0b config: yaml strings for versions 2021-12-06 16:59:44 -08:00
James R. Barlow
6910c48b81 Fix test_outputtype_none on Windows and cleanup docs 2021-12-06 15:38:38 -08:00
James R. Barlow
69aa3981c4 docs: Remove reference to long removed 'tesseract' renderer 2021-12-06 15:38:38 -08:00
James R. Barlow
9c1e5adfe6 docs: remove Ubuntu 16.04 install instructions
It's EOL.
2021-12-06 15:38:38 -08:00
James R. Barlow
e642dd4b35 Fix kill signal on Windows 2021-12-06 15:38:32 -08:00
James R. Barlow
9de06f62ee Use Python executors instead of pools
ProcessPool/ThreadPool don't have the ability to notice when a child worker
was terminated. ProcessPoolExecutor and ThreadPoolExecutor do notice and
provide better error messages.

Add tests to check.
2021-12-06 15:38:27 -08:00
James R. Barlow
1414a8f5dc Tidy pyproject.toml 2021-12-06 15:38:27 -08:00
James R. Barlow
26badf2882 typing: small improvements 2021-12-06 15:38:27 -08:00
James R. Barlow
8f873aaa45 sync: typing improvements 2021-12-06 15:38:27 -08:00
James R. Barlow
8fdcb15b4e tests: improve typing and remove some legacy code 2021-12-06 15:38:27 -08:00
James R. Barlow
0323738ada ocrmypdf.fish: fix indents
[ci skip]
2021-12-06 15:38:27 -08:00
FPille
aae5591f7e Update ocrmypdf.bash completion
Squashed commit of the following:

commit 974de2e8ccad7fd34694f2c3a7a17c64bb52cdab
Merge: a8d7f969 ee04aa72
Author: James R. Barlow <james@purplerock.ca>
Date:   Sat Dec 4 20:22:50 2021 -0800

    Merge branch 'update_bash-completion' of git://github.com/FPille/OCRmyPDF into FPille-update_bash-completion

commit ee04aa7225
Author: FPille <f.pille@gmail.com>
Date:   Thu Oct 14 11:09:23 2021 +0200

    update

commit 76f64537aa
Author: FPille <f.pille@gmail.com>
Date:   Thu Oct 14 11:04:10 2021 +0200

    updated and descriptions for arguments and choices added
    deprecated arguments removed
    bug fix: typo "_init_completion" instead of "_init_completions"

commit de9b93e852
Merge: c23374de 42713b77
Author: Frank <50119297+FPille@users.noreply.github.com>
Date:   Thu Oct 14 08:08:11 2021 +0200

    Merge branch 'jbarlow83:master' into master

commit c23374de81
Merge: 40b2ebcb c409fa58
Author: Frank <50119297+FPille@users.noreply.github.com>
Date:   Wed May 26 20:31:00 2021 +0200

    Merge branch 'jbarlow83:master' into master

commit 40b2ebcb37
Merge: 79c84eef 7e388f59
Author: Frank <50119297+FPille@users.noreply.github.com>
Date:   Sat Jun 1 11:09:07 2019 +0200

    Merge pull request #1 from jbarlow83/master

    update master
2021-12-06 15:38:26 -08:00
James R. Barlow
4c1ff1086c tess cache: don't include full platform - could be sensitive 2021-12-06 15:38:26 -08:00
James R. Barlow
f91faf9795 Add new argument --tesseract-thresholding to control tesseract thresholding where available
Also add missing test for --tesseract-oem
2021-12-06 15:38:14 -08:00
James R. Barlow
793cc33a90 Whitespace 2021-12-04 16:07:34 -08:00
James R. Barlow
fbd72efd45 build: typo v13.0.0 2021-12-04 01:41:31 -08:00
James R. Barlow
1115923995 build: address checksum error from choco 2021-12-04 01:26:38 -08:00
James R. Barlow
8478d67b28 Merge branch 'release/v13' of github.com:jbarlow83/OCRmyPDF into release/v13 2021-11-15 16:38:11 -08:00
James R. Barlow
c75ff4687a Turning on Ghostscript interpolation changes this test
Seems acceptable. We don't normally use Ghostscript to downsample PDFs
like is happening in this test.
2021-11-15 16:36:24 -08:00
mara004
312c1e51b5 [ci skip] minor corrections to maintainers.rst (#858) 2021-11-15 15:13:12 -08:00
James R. Barlow
cfe2bb25ba Merge commit 'cd49e70154f82f54bf74fc5bb2586fe7e0358971' into release/v13 2021-11-15 00:33:34 -08:00
Tristan Porteries
cd49e70154 ghostscript: force interpolation when rendering (#855)
Specifying option --oversample tends to introduce upsampling in rendering
by rasterizing page to an higher DPI.

This upsampling improves OCR results, but a correct choice of interpolation
method can increase even more the OCR quality.

Ghostscript seems to use a nearest interpolation as default choice for pdf.
This method doesn't average new introduced pixels with original pixels
resulting in an almost similar image but with more pixels.

Providing -dInterpolateControl=-1 force switching interpolation on.

In this commit the above option is passed to all ghostscript rendering
calls.

After testing, rendering a page at same DPI with interpolation
enabled does not introduce significant time overhead.

time (repeat 40 gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m \
	-dFirstPage=1 -dLastPage=1 -r100.000000x100.000000 \
	-dInterpolateControl=-1 -o /dev/null -dAutoRotatePages=/None -f pzII.pdf)
7,66s user 0,33s system 99% cpu 8,012 total

time (repeat 40 gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m \
	-dFirstPage=1 -dLastPage=1 -r100.000000x100.000000 \
        -o /dev/null -dAutoRotatePages=/None -f pzII.pdf)
7,42s user 0,39s system 99% cpu 7,808 total

Ghostscript interpolation control reference:
https://www.ghostscript.com/doc/current/Use.htm
2021-11-15 00:32:58 -08:00
James R. Barlow
7ce1692eef windows: default version to '0' when looking for Ghostscript
To avoid ValueError: max() arg is an empty sequence

As suggested by @meet1919 in #833.
2021-11-14 23:00:08 -08:00
James R. Barlow
7959f7628d pyproject: tell black to target py37 2021-11-14 15:49:01 -08:00
James R. Barlow
4634b20de5 Raise max-image-mpixels again
PDFs are quite likely to have a lot of pixels, e.g. large high resolution scans.
250 MP is a page of A0 sized paper scanned at 400 DPI,
should be enough in most cases.
2021-11-14 15:47:39 -08:00
James R. Barlow
3810e576ff optimize: fix mypy lint 2021-11-13 14:48:00 -08:00
James R. Barlow
01c7895044 pipeline: tidy 2021-11-13 14:47:49 -08:00
James R. Barlow
fdc6aa03fb docs: new maintainer notes 2021-11-13 14:29:30 -08:00
James R. Barlow
25cc17ee03 v13 release notes (2) v13.0.0rc1 2021-11-13 02:02:04 -08:00
James R. Barlow
e8098a1475 Dockerfile: remove requirements/ 2021-11-13 01:57:17 -08:00
James R. Barlow
6b773883dc build: use latest pip and wheel in all cases 2021-11-13 01:57:03 -08:00
James R. Barlow
4ed9622335 v13 release notes 2021-11-13 01:37:38 -08:00
James R. Barlow
acc9d58c39 Skip no language test for Tess 5 2021-11-13 01:37:27 -08:00
James R. Barlow
659e738f92 Remove some 'liblept' references we no longer need 2021-11-13 01:22:09 -08:00
James R. Barlow
7b3d7ca92a ghostscript: choco doesn't put Ghostscript on PATH anymore
It seems that chocolately doesn't put gswin[32,64]c on PATH anymore,
so compensate.
2021-11-13 01:18:12 -08:00
James R. Barlow
e3126d2806 Adjust test to support Tesseract 5 working harder to find its files 2021-11-13 01:16:35 -08:00
James R. Barlow
45020a7fcd build: tweak CI 2021-11-13 00:56:49 -08:00
James R. Barlow
f51164aff8 Upgrade test version of pymupdf 2021-11-13 00:53:41 -08:00
James R. Barlow
6f58a14351 pdfa: remove deprecated pkg_resources based access and tests 2021-11-13 00:52:03 -08:00
James R. Barlow
7ba04267b1 Remove shims to support for old versions of pikepdf < 4 2021-11-13 00:43:20 -08:00
James R. Barlow
9749564313 Remove requirements/*.txt - use pip install ocrmypdf[etc] instead 2021-11-13 00:31:42 -08:00
James R. Barlow
698e8791d7 Remove Python 3.6 specific unicode environment checks 2021-11-13 00:28:52 -08:00
James R. Barlow
380b981763 Remove most Python 3.6 special casing 2021-11-13 00:27:48 -08:00
James R. Barlow
5abfb14c2a Remove leptonica and cffi 2021-11-13 00:06:35 -08:00
James R. Barlow
036afc4d88 Update cache, related to previous apparently 2021-11-12 23:57:50 -08:00