Commit Graph

913 Commits

Author SHA1 Message Date
James R. Barlow
f99fd686de hocrtransform: ensure text is rendered in document order
Previously we rendered text objects based on their vertical position, but this confuses
some PDF viewers.

Closes #813.
2021-08-01 00:49:57 -07:00
James R. Barlow
37923ffe52 Work around Pillow 8.3.1 DPI changes
Pillow decided against round-tripping DPI values.
https://github.com/python-pillow/Pillow/pull/5476

Fixes #802
2021-07-14 02:34:28 -07:00
James R. Barlow
773e28478c graft: don't use deprecated pikepdf APIs 2021-07-14 01:44:26 -07:00
James R. Barlow
0b834411fe validation: mention ISO 639-2 to give people a clue about how to find the appropriate code 2021-06-28 15:19:59 -07:00
James R. Barlow
5f01c5e330 Fix another species of Tesseract version number breaking regex
Fixes #795
2021-06-16 00:09:03 -07:00
James R. Barlow
4030258bbc Modernize build system to use setup.cfg
For now, keep but deprecate the requirements/*.txt files.
2021-06-10 00:27:52 -07:00
James R. Barlow
a964080f77 validation: a word 2021-05-27 13:42:05 -07:00
James R. Barlow
b4f2582766 Show ExitCodeException traceback in verbose mode 2021-05-27 01:26:25 -07:00
James R. Barlow
f3715daf15 Move HOCR_OK_LANGS to hocrtransform.py 2021-05-23 01:15:53 -07:00
James R. Barlow
c87221a4e6 validation: add proper list of languages supported by hocr
Based on Latin-1 support in default PDF fonts.
2021-05-23 01:14:15 -07:00
James R. Barlow
09c485bd88 Convert harmless Leptonica exception to warning
Closes Error when trying --remove-background on pdf #769
2021-05-18 23:19:51 -07:00
James R. Barlow
7b1e5b4f41 Fix "invalid version number" for untagged tesseract versions
Fixes #770
2021-04-26 01:18:07 -07:00
James R. Barlow
a613722e96 Auto-register semfree.py if needed 2021-04-21 23:39:38 -07:00
James R. Barlow
ad0126185f Rename awslambda -> semfree.py 2021-04-21 23:29:55 -07:00
James R. Barlow
be45871d10 docker: add special hint for using docker 2021-04-21 23:18:29 -07:00
James R. Barlow
5112e9e857 Redo dpi calc to avoid 'math.ulp' 2021-04-16 00:54:41 -07:00
James R. Barlow
d673126994 Fix ZeroDivisionError on files containing images drawn at scale 0
Fixes #761
2021-04-15 23:26:14 -07:00
James R. Barlow
710d797299 Fix comment typos 2021-04-14 00:36:58 -07:00
James R. Barlow
fc75254c60 Fix awslambda progressbar.update() error
Fixes #759
2021-04-13 14:37:30 -07:00
James R. Barlow
f453e94f14 awslambda: better documentation 2021-04-13 13:16:35 -07:00
James R. Barlow
8f8aaa93ed Refactor removing log handlers 2021-04-13 13:16:21 -07:00
James R. Barlow
a90b9e669f Add reminder to not mess with pool/listener order 2021-04-09 13:22:48 -07:00
James R. Barlow
e4f69cc1d6 Maybe fix a deadlock on attempting sys.stderr.flush()
Closes #758
Closes #733
2021-04-09 01:59:30 -07:00
James R. Barlow
a5852ba199 Remove parent process's log handlers properly 2021-04-08 21:39:18 -07:00
James R. Barlow
051b9da991 Remove undocumented/unused debug environment variables 2021-04-08 21:08:12 -07:00
James R. Barlow
139d9f9841 Shut up pikepdf mmap disabled message 2021-04-08 20:58:46 -07:00
James R. Barlow
9de38afb13 Fix Tesseract version for ubuntu 18.04 2021-04-08 13:07:13 -07:00
James R. Barlow
9db9a3d6ec helpers: improve test coverage of Resolution 2021-04-07 23:26:37 -07:00
James R. Barlow
8423bd549b helpers: don't trap exception on failure to unlink
If we can't unlink a file we expect to unlink, logging and moving on is
probably the wrong action.

Coverage never hits this line.
2021-04-07 23:16:10 -07:00
James R. Barlow
336d274a54 Drop remnants of support for Tesseract without has_textonly_pdf
Also improve Tesseract version checking so it can compare all of their
weird conventions.
2021-04-07 23:05:21 -07:00
James R. Barlow
2a09a668f6 Delinting: unused args 2021-04-07 02:18:08 -07:00
James R. Barlow
173a80864d Delinting 2021-04-07 02:09:45 -07:00
James R. Barlow
aa115a8be3 Remove pytest_helpers_namespace 2021-04-07 01:56:51 -07:00
James R. Barlow
a2033698fa graft: use newer pikepdf style 2021-04-02 00:11:13 -07:00
James R. Barlow
ec1d585d40 Merge branch 'feature/misc-breaking' 2021-04-01 16:51:04 -07:00
James R. Barlow
a4e1f8e1f3 Merge branch 'feature/lambda' 2021-04-01 16:36:22 -07:00
James R. Barlow
2e155c31bf Ensure builtin module registration is deterministic 2021-04-01 16:30:42 -07:00
James R. Barlow
e09ae9c68a Fix test suite failure if filter_pdf_page is missing 2021-04-01 16:25:40 -07:00
James R. Barlow
0a42934c08 Exclude Group 3 images from optimization 2021-03-20 23:28:21 -07:00
James R. Barlow
079c162a96 Ensure sidecar is not input or output file 2021-03-05 00:29:42 -08:00
James R. Barlow
25c8c4656f Fix error message change 2021-03-03 01:02:59 -08:00
James R. Barlow
6e71fe1186 Clarify --unpaper-args errors 2021-03-03 00:44:21 -08:00
James R. Barlow
8ffc99f648 optimize: log errors more loudly 2021-03-03 00:43:40 -08:00
James R. Barlow
4124889f36 Don't generate PDF/A-1b with object streams
Acrobat insists that PDF/A-1b should not have object streams.
Other programs like veraPDF disagree with this restriction, but
we can accommodate Acrobat so we will.

Also add more tests around this.
2021-02-26 00:23:57 -08:00
James R. Barlow
a23c22b0e8 helpers: tidy check_pdf 2021-02-25 22:51:53 -08:00
Dima Kuznetsov
5e2206bae7 Allow --sidecar along --pages (#735) 2021-02-19 16:55:35 -08:00
James R. Barlow
064f935699 Fix page rotation regression
Page size fixes in commit b26749 did accounted for a "kept" rotation,
but not a corrected rotation.

Fixes #730.
2021-02-15 01:47:09 -08:00
James R. Barlow
2a52c6dec2 optimize: skip images with unusually small dimensions
They're unlikely to be handled well by our recompressors. It seems
that JBIG2 cannot handle very small widths.

Fixes #732
2021-02-14 01:43:25 -08:00
James R. Barlow
a48ca556c7 Add filter_pdf_page hook 2021-02-14 01:22:33 -08:00
James R. Barlow
9cba738b48 Remove deprecated code 2021-01-31 19:27:59 -08:00