Commit Graph

351 Commits

Author SHA1 Message Date
James R. Barlow
de80fb6bc8 Fix some failing tests after --redo-ocr changes 2018-10-29 11:49:38 -07:00
James R. Barlow
f564aaf485 Remove only_ocr_text 2018-10-28 22:41:18 -07:00
James R. Barlow
58cc70725e Reorganize around getting bboxes for visible/invisible text 2018-10-26 01:07:02 -07:00
James R. Barlow
16af753206 Add functional "redo OCR" feature
Needs argument validation and some other changes. Needs testing
with mixed-content PDFs.

Only really works for pure invisible text at the moment.
2018-10-19 00:02:19 -07:00
James R. Barlow
b18e66e2ca pdfinfo: learn to detect vector graphic objects 2018-10-18 01:21:51 -07:00
James R. Barlow
1495b78330 Remove cruft to support leptonica < 1.72 in test suite 2018-10-11 01:37:32 -07:00
James R. Barlow
5c229d48d5 optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp
Closes #297
2018-10-04 11:53:11 -07:00
James R. Barlow
5b84549716 Change JBIG2 lossy mode to require --jbig2-lossy 2018-10-04 01:20:49 -07:00
James R. Barlow
a71e4488b3 test: fix pytest warning about direct use of a fixture 2018-10-03 15:04:46 -07:00
James R. Barlow
9fa471e053 Test: send stderr to stderr, why don't we? 2018-10-03 14:23:34 -07:00
James R. Barlow
31ef2fe907 test: this error message changed case in newer Tesseract 2018-10-03 13:58:20 -07:00
James R. Barlow
9a8ec4b210 optimize: only enable lossy JBIG2 for -O3 2018-10-03 00:38:58 -07:00
James R. Barlow
17a3fa671c ghostscript: API docs update 2018-09-14 23:51:52 -07:00
James R. Barlow
686207ab7f Check for and reject Adobe LiveCycle Designer PDFs
These are the ones that display a "Please wait..." message.

Closes #296
2018-09-13 21:50:51 -07:00
James R. Barlow
517b385fe5 Work around loss of Unicode DOCINFO in Ghostscript 9.24+
Ghostscript no longer supports UTF-16-BE-hex strings as a way of
supplying Unicode data in pdfmark so we have lost this functionality too:
http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=e997c6836d243ab37fe3a5f0d57974af95eb5eac

For users this means setting --title, --author, etc. will not work if gs
9.24 is installed, but if the file has existing metadata it might work.

For now we enforce police-state-strict ASCII, until there's time to
implement proper metadata editing. Relevant tests set to xfail.
2018-09-13 21:33:39 -07:00
James R. Barlow
795019b0c1 Work around invalid TOC entries
Kodak Capture Desktop and probably other software creates a
/Outlines entry with /First being set to an invalid indirect reference to
an object that hasn't been created. This is legal in the PDF spec but
problematic for qpdf. The objgen will be (max valid object ID + 1, 0).
Because we create new objects in _weave, some TOC entries will end
up assigned to new objects we create. Typically /ProcSet.

We solve the issue by refactoring page traversal and then doing it
twice, once to resolve all references (eliminating the null
reference problem) and a second pass to make our changes.
2018-09-11 14:44:16 -07:00
James R. Barlow
3aac3a98ca tests: Migrate metadata tests to pikepdf
For some reason PyPDF2 has begun to trigger internal errors in
pytest on macOS alone. Not sure why, but nothing is wrong that I can
see. Seemed like an opportune time to switch to pikepdf; found some
new issues in the process anyway.
2018-09-10 16:06:01 -07:00
James R. Barlow
7aa4e60af2 Explain pytest --runslow 2018-08-03 00:57:59 -07:00
James R. Barlow
55eb481f30 Add intensive (optional) rotation test 2018-08-03 00:42:59 -07:00
James R. Barlow
c171cb7286 Merge img2pdf 0.3.0 fix from v6.2.3 2018-08-01 15:17:33 -07:00
James R. Barlow
1d09061130 Revert previous commit amd reject input images with alpha channel
Decided on this for simplicity of old release branch.

Modifies baiona.png by stripping
alpha, adds baiona_alpha which
includes the alpha.
2018-07-31 23:45:28 -07:00
James R. Barlow
a2203b2447 Discard alpha channel when triaging images 2018-07-25 22:23:41 -04:00
James R. Barlow
e7d21dd826 Skip locale check on Python 3.7 2018-07-12 03:03:34 -07:00
James R. Barlow
ea69883386 Tests: Speed up a slow test (cherry-picked from v7) 2018-07-12 02:47:15 -07:00
James R. Barlow
eb343b1e37 Tests: Add ability to disable use of cache (cherrypicked from v7) 2018-07-12 02:46:53 -07:00
James R. Barlow
1cc9d2d3d1 Fix path error on Py3.5 2018-07-08 01:01:06 -07:00
James R. Barlow
58642aa98b Fix issue #275: doesn't work when installed in non-Unicode path
Closes #275
2018-07-07 01:35:05 -07:00
James R. Barlow
7baaf00a38 Fix wrong return code tested 2018-07-05 13:49:22 -07:00
James R. Barlow
216d60ea2c pdfinfo: improve the regex 2018-07-04 00:59:32 -07:00
James R. Barlow
47885f4230 Remove initial qpdf.repair
Since pikepdf is doing the work the initial repair takes time and gives
little benefit.

It turns out to not be worthwhile to
save the results of PdfInfo parsing,
since the time to save this seems to exceed the costs of recalculating
it since the "weave" code. At least
for small files.
2018-07-03 16:50:05 -07:00
James R. Barlow
85f96b7fb0 Add test to optimize if jbig2 is present 2018-07-02 23:49:11 -07:00
James R. Barlow
39c44bdd2f Don't use --optimize in test since jbig2enc is not always installed 2018-07-02 23:48:23 -07:00
James R. Barlow
2974929b26 Make jpeg/png quality tunable args 2018-07-02 22:22:59 -07:00
James R. Barlow
7200623007 Fix installation for Python 3.7
Need to use private fork of ruffus for Python 3.7. Backward compatible with Python 3.6 for ruffus 2.6.3

Disable locale checking for 3.7 since the various fixes in that release should make it unnecessary.
2018-07-02 16:47:14 -07:00
James R. Barlow
d4cbef9457 Update test cache with naming rule change 2018-06-29 12:04:20 -07:00
James R. Barlow
ed8ff79e10 Optimize some of our bigger test files
Only partially optimize multipage.pdf so that it hopefully
improves speed of test suite without being useless as an
optimization test.
2018-06-29 00:35:49 -07:00
James R. Barlow
e725f64b6a Add test case to ensure mono is not inverted 2018-06-29 00:25:11 -07:00
James R. Barlow
9637696a54 Fix test resources naming inconsistency 2018-06-28 23:37:14 -07:00
James R. Barlow
02b3ca6862 Compress test images more heavily 2018-06-28 21:40:12 -07:00
James R. Barlow
bc90f40a8f Replace all Pix.read with Pix.open 2018-06-28 15:13:26 -07:00
James R. Barlow
bf96171b65 Ignore whether or not textonly_pdf was used in cache
The difference doesn't matter in 7.0.0 anymore.
2018-06-23 02:58:26 -07:00
James R. Barlow
b81daf71d1 Regenerate test cache 2018-06-23 02:02:58 -07:00
James R. Barlow
faad1fc58a Reactivate two tests that weren't using their fixtures properly 2018-06-23 01:54:09 -07:00
James R. Barlow
6f48181a56 Disable a pylint 2018-06-23 01:53:04 -07:00
James R. Barlow
807c8b0726 Trailing whitespace 2018-06-23 01:51:19 -07:00
James R. Barlow
b0dbaeafc5 Cleanup unused imports 2018-06-23 01:47:53 -07:00
James R. Barlow
2530d1791b Fix several pylint errors and warnings 2018-06-23 00:54:22 -07:00
James R. Barlow
94150f414a Remove qpdf.merge
We no longer need to merge pages this way. Much of the functionality
was there to implement page splitting without hitting ulimit which
will be fixed in qpdf > 8.0.2. The tests were expensive to run.

Also remove pytest-timeout since it breaks the Linux build.
2018-06-23 00:45:03 -07:00
James R. Barlow
76e7e8dbbb Replace several uses of str(path) with fspath(path)
Helps make it more explicit. Did not do this to tests because use of paths
is more involved there.
2018-06-22 21:00:47 -07:00
James R. Barlow
9e765ddf46 Rename _optimize to optimize.py 2018-06-22 17:51:57 -07:00