James R. Barlow
f99fd686de
hocrtransform: ensure text is rendered in document order
...
Previously we rendered text objects based on their vertical position, but this confuses
some PDF viewers.
Closes #813 .
2021-08-01 00:49:57 -07:00
James R. Barlow
37923ffe52
Work around Pillow 8.3.1 DPI changes
...
Pillow decided against round-tripping DPI values.
https://github.com/python-pillow/Pillow/pull/5476
Fixes #802
2021-07-14 02:34:28 -07:00
James R. Barlow
773e28478c
graft: don't use deprecated pikepdf APIs
2021-07-14 01:44:26 -07:00
James R. Barlow
0b834411fe
validation: mention ISO 639-2 to give people a clue about how to find the appropriate code
2021-06-28 15:19:59 -07:00
James R. Barlow
5f01c5e330
Fix another species of Tesseract version number breaking regex
...
Fixes #795
2021-06-16 00:09:03 -07:00
James R. Barlow
4030258bbc
Modernize build system to use setup.cfg
...
For now, keep but deprecate the requirements/*.txt files.
2021-06-10 00:27:52 -07:00
James R. Barlow
a964080f77
validation: a word
2021-05-27 13:42:05 -07:00
James R. Barlow
b4f2582766
Show ExitCodeException traceback in verbose mode
2021-05-27 01:26:25 -07:00
James R. Barlow
f3715daf15
Move HOCR_OK_LANGS to hocrtransform.py
2021-05-23 01:15:53 -07:00
James R. Barlow
c87221a4e6
validation: add proper list of languages supported by hocr
...
Based on Latin-1 support in default PDF fonts.
2021-05-23 01:14:15 -07:00
James R. Barlow
09c485bd88
Convert harmless Leptonica exception to warning
...
Closes Error when trying --remove-background on pdf #769
2021-05-18 23:19:51 -07:00
James R. Barlow
7b1e5b4f41
Fix "invalid version number" for untagged tesseract versions
...
Fixes #770
2021-04-26 01:18:07 -07:00
James R. Barlow
a613722e96
Auto-register semfree.py if needed
2021-04-21 23:39:38 -07:00
James R. Barlow
ad0126185f
Rename awslambda -> semfree.py
2021-04-21 23:29:55 -07:00
James R. Barlow
be45871d10
docker: add special hint for using docker
2021-04-21 23:18:29 -07:00
James R. Barlow
5112e9e857
Redo dpi calc to avoid 'math.ulp'
2021-04-16 00:54:41 -07:00
James R. Barlow
d673126994
Fix ZeroDivisionError on files containing images drawn at scale 0
...
Fixes #761
2021-04-15 23:26:14 -07:00
James R. Barlow
710d797299
Fix comment typos
2021-04-14 00:36:58 -07:00
James R. Barlow
fc75254c60
Fix awslambda progressbar.update() error
...
Fixes #759
2021-04-13 14:37:30 -07:00
James R. Barlow
f453e94f14
awslambda: better documentation
2021-04-13 13:16:35 -07:00
James R. Barlow
8f8aaa93ed
Refactor removing log handlers
2021-04-13 13:16:21 -07:00
James R. Barlow
a90b9e669f
Add reminder to not mess with pool/listener order
2021-04-09 13:22:48 -07:00
James R. Barlow
e4f69cc1d6
Maybe fix a deadlock on attempting sys.stderr.flush()
...
Closes #758
Closes #733
2021-04-09 01:59:30 -07:00
James R. Barlow
a5852ba199
Remove parent process's log handlers properly
2021-04-08 21:39:18 -07:00
James R. Barlow
051b9da991
Remove undocumented/unused debug environment variables
2021-04-08 21:08:12 -07:00
James R. Barlow
139d9f9841
Shut up pikepdf mmap disabled message
2021-04-08 20:58:46 -07:00
James R. Barlow
9de38afb13
Fix Tesseract version for ubuntu 18.04
2021-04-08 13:07:13 -07:00
James R. Barlow
9db9a3d6ec
helpers: improve test coverage of Resolution
2021-04-07 23:26:37 -07:00
James R. Barlow
8423bd549b
helpers: don't trap exception on failure to unlink
...
If we can't unlink a file we expect to unlink, logging and moving on is
probably the wrong action.
Coverage never hits this line.
2021-04-07 23:16:10 -07:00
James R. Barlow
336d274a54
Drop remnants of support for Tesseract without has_textonly_pdf
...
Also improve Tesseract version checking so it can compare all of their
weird conventions.
2021-04-07 23:05:21 -07:00
James R. Barlow
2a09a668f6
Delinting: unused args
2021-04-07 02:18:08 -07:00
James R. Barlow
173a80864d
Delinting
2021-04-07 02:09:45 -07:00
James R. Barlow
aa115a8be3
Remove pytest_helpers_namespace
2021-04-07 01:56:51 -07:00
James R. Barlow
a2033698fa
graft: use newer pikepdf style
2021-04-02 00:11:13 -07:00
James R. Barlow
ec1d585d40
Merge branch 'feature/misc-breaking'
2021-04-01 16:51:04 -07:00
James R. Barlow
a4e1f8e1f3
Merge branch 'feature/lambda'
2021-04-01 16:36:22 -07:00
James R. Barlow
2e155c31bf
Ensure builtin module registration is deterministic
2021-04-01 16:30:42 -07:00
James R. Barlow
e09ae9c68a
Fix test suite failure if filter_pdf_page is missing
2021-04-01 16:25:40 -07:00
James R. Barlow
0a42934c08
Exclude Group 3 images from optimization
2021-03-20 23:28:21 -07:00
James R. Barlow
079c162a96
Ensure sidecar is not input or output file
2021-03-05 00:29:42 -08:00
James R. Barlow
25c8c4656f
Fix error message change
2021-03-03 01:02:59 -08:00
James R. Barlow
6e71fe1186
Clarify --unpaper-args errors
2021-03-03 00:44:21 -08:00
James R. Barlow
8ffc99f648
optimize: log errors more loudly
2021-03-03 00:43:40 -08:00
James R. Barlow
4124889f36
Don't generate PDF/A-1b with object streams
...
Acrobat insists that PDF/A-1b should not have object streams.
Other programs like veraPDF disagree with this restriction, but
we can accommodate Acrobat so we will.
Also add more tests around this.
2021-02-26 00:23:57 -08:00
James R. Barlow
a23c22b0e8
helpers: tidy check_pdf
2021-02-25 22:51:53 -08:00
Dima Kuznetsov
5e2206bae7
Allow --sidecar along --pages ( #735 )
2021-02-19 16:55:35 -08:00
James R. Barlow
064f935699
Fix page rotation regression
...
Page size fixes in commit b26749 did accounted for a "kept" rotation,
but not a corrected rotation.
Fixes #730 .
2021-02-15 01:47:09 -08:00
James R. Barlow
2a52c6dec2
optimize: skip images with unusually small dimensions
...
They're unlikely to be handled well by our recompressors. It seems
that JBIG2 cannot handle very small widths.
Fixes #732
2021-02-14 01:43:25 -08:00
James R. Barlow
a48ca556c7
Add filter_pdf_page hook
2021-02-14 01:22:33 -08:00
James R. Barlow
9cba738b48
Remove deprecated code
2021-01-31 19:27:59 -08:00