Commit Graph

803 Commits

Author SHA1 Message Date
James R. Barlow
59f6bc8306 More Tesseract-specific language checks to its plugin 2024-06-01 00:15:50 -07:00
James R. Barlow
abf9729c61 Semfree test: accept pdfa conversion failed as a valid return code
Fixes #1316
2024-05-21 01:26:11 -07:00
James R. Barlow
7a8cc21e31 Add support for sidecar output to io.BytesIO
Closes #1252
2024-04-07 01:38:55 -07:00
James R. Barlow
065bddbc6c Reformat with ruff format 2024-04-07 00:25:32 -07:00
James Barlow
855de287b2 Fix test suite failure with Ghostscript >= 10.3
Ghostscript is more picky about a specific case with SMask that cannot be converted to PDF/A

Details here
4dcfae36bb
2024-03-19 17:20:33 -07:00
James R. Barlow
6a746a1cbb ruff linting/Python 3.10 cleanup 2024-02-14 12:41:51 -08:00
James R. Barlow
42ff7fc842 Fix handling of pages that are restored to correct orientation with /Rotate
Appears inversion of CTM was incorrect, introduced in commit 9898904
2024-02-12 01:32:26 -08:00
James R. Barlow
26470fe16a Suppress reportlab deprecation warning 2024-02-12 01:17:08 -08:00
James R. Barlow
74d2a156c4 Update cache 2024-01-07 01:35:05 -08:00
James R. Barlow
14365d10b8 Skip testing oom killer on Python 3.12
Need to investigate further if there's a safe way to do this test.
2024-01-02 16:28:22 -08:00
James R. Barlow
9489c01259 Skip test_encrypted on Py3.12 + macOS 2023-12-08 00:12:24 -08:00
James R. Barlow
a4987733c4 Filter rl_safe_eval deprecation warning
Full message
eportlab/lib/rl_safe_eval.py:11: DeprecationWarning: ast.NameConstant is deprecated and will be removed in Python 3.14; use ast.Constant instead
    haveNameConstant = hasattr(ast,'NameConstant')

Warning triggered by reportlab-4.0.7 and Python 3.12
2023-12-07 23:40:23 -08:00
James R. Barlow
445617a1a5 Rebuild cache for hocr default case 2023-12-03 15:16:18 -08:00
James R. Barlow
f6e90a5934 hOCR renderer is now default 2023-12-02 19:58:00 -08:00
James R. Barlow
11d3e32f1e Fix hocrtransform CLI 2023-12-02 08:08:29 -08:00
James R. Barlow
03669183d7 Rationalize canvas interface 2023-11-20 15:54:13 -08:00
James R. Barlow
db2e5132e6 Remove some obsolete parameters 2023-11-20 00:10:55 -08:00
James R. Barlow
c591f9601a Remove Latin hOCR test 2023-11-19 23:51:27 -08:00
James R. Barlow
27d5229842 Make logger names unique 2023-11-09 23:03:39 -08:00
James R. Barlow
a596ccf844 Raise exception if resulting PDF might appear blank in a known in some PDF viewers
Fixes #1187
2023-11-09 22:33:22 -08:00
James R. Barlow
e7fa97731f ghostscript duplicate filter: filter within a window of previous messages 2023-11-09 22:32:39 -08:00
James R. Barlow
290aa28108 Fix error on attempt to write to debug log after removing debug log handler 2023-11-09 16:02:41 -08:00
James R. Barlow
916106733c Skip semfree unless on Linux 2023-10-30 00:33:21 -07:00
James R. Barlow
71166f7be8 Make hocr API experimental for now
This commit can be reverted when we are ready to release a new version.
2023-10-30 00:07:10 -07:00
James R. Barlow
580252a1a0 Merge branch 'feature/gscan2pdf'
Reconcile release notes and copy_final() with new pipeline.
2023-10-30 00:01:28 -07:00
James R. Barlow
b5e73ac4e4 Drop check for obsolete .dockerinit file 2023-10-24 13:49:46 -07:00
James R. Barlow
db3df13e95 Remove ocrmypdf._sync 2023-10-24 00:54:31 -07:00
James R. Barlow
9ffb45f283 Remove public domain congress.jpg and replace with baiona_color.jpg
For reuse compliance we are phasing out public domain licenses
2023-10-24 00:54:31 -07:00
James R. Barlow
a06ab2a1c5 unpaper: Remove format conversion
Code is no longer reachable since we rasterize a 1/L/RGB image prior to this point.
2023-10-24 00:54:31 -07:00
James R. Barlow
dfa4ebf1a6 Simplify function signature of extract_image_filter 2023-10-24 00:54:31 -07:00
James R. Barlow
58f388c69d optimize: better coverage 2023-10-24 00:54:31 -07:00
James R. Barlow
990b462a94 Fix coverage settings and cover semfree 2023-10-24 00:54:31 -07:00
James R. Barlow
b928dc0808 Skip fewer tests 2023-10-24 00:54:31 -07:00
James R. Barlow
8916955f45 Convert many run_ocrmypdf -> run_ocrmypdf_api 2023-10-24 00:54:31 -07:00
James R. Barlow
82bef40aa6 Eliminate more run_ocrmypdf calls 2023-10-24 00:54:31 -07:00
James R. Barlow
1c45f32941 tests: replace many run_ocrmypdf -> run_ocrmypdf_api 2023-10-24 00:54:31 -07:00
James R. Barlow
fadc0cf69b Replace cryptic test error messages with more informative ones 2023-10-24 00:54:31 -07:00
James R. Barlow
eb3a51e33a Prefer pikepdf's newer Page.mediabox accessor over .MediaBox 2023-10-24 00:54:31 -07:00
James R. Barlow
a4059762e6 Fix hocrtransform test to generate blank hocr 2023-10-24 00:54:31 -07:00
James R. Barlow
16eb5627a7 Fix unused imports and other trivia 2023-10-24 00:54:31 -07:00
James R. Barlow
fbf0674189 hocr_to_ocr_pdf: handle missing hocr json file 2023-10-24 00:54:31 -07:00
James R. Barlow
7935914f55 Use empty .hocr file instead of dummy template for symmetry with sandwich 2023-10-24 00:54:31 -07:00
James R. Barlow
23951c9e38 Working HOCR folder to PDF converter 2023-10-24 00:54:30 -07:00
James R. Barlow
e8ae370ceb Eliminate api= kwarg and implicit creation of pluginmanager 2023-10-24 00:54:30 -07:00
James R. Barlow
1a7738a925 Refactor -migrate metadata repair to new module 2023-10-24 00:54:30 -07:00
James R. Barlow
68bb38d0ad pdf_to_hocr: improve plugin handling 2023-10-24 00:52:31 -07:00
James R. Barlow
0443e87345 Introduce pdf_to_hocr API 2023-10-24 00:52:31 -07:00
James R. Barlow
95b14ee282 Refactor lossless reconstruction setter into separate function
Still messy but good enough as a start.
2023-10-24 00:52:31 -07:00
James R. Barlow
93fda0dd00 Detect and warn about Tagged PDFs 2023-10-12 01:03:09 -07:00
James R. Barlow
91a14660b3 Require Pillow >= 10.0.1 and drop shims for older versions 2023-10-04 00:04:28 -07:00