Commit Graph

517 Commits

Author SHA1 Message Date
James R. Barlow
5dbc080fa0 Rename PDFContext->PdfContext 2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97 Support plugin invocation with API 2020-05-02 03:34:31 -07:00
James R. Barlow
016dfd420c Add warning if problematic --tesseract-pagesegmode is selected
Fixes #549
2020-04-30 04:12:11 -07:00
James R. Barlow
8f5c95f0f4 Remove last vestiges of command line usage of qpdf - change to check_pdf 2020-04-26 05:33:26 -07:00
James R. Barlow
991db17fde Remove Ghostscript-based text extraction
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
7513f5425c Fix some broken tests 2020-04-26 03:49:20 -07:00
James R. Barlow
94c52a6fa3 Refactor 'xyres' into Resolution 2020-04-24 04:12:05 -07:00
James R. Barlow
57771f06a3 Refactor xy-pair for resolution to tuple 2020-04-16 15:38:33 -07:00
James R. Barlow
31b5f63f85 hocrtransform: cleanup/PEP8
Some API breaking changes.
2020-04-15 02:48:56 -07:00
James R. Barlow
957fb1494e pytest picky about list vs tuple 2020-04-15 02:26:20 -07:00
James R. Barlow
2155bcacb4 Loosen test language requirements - eng/deu 2020-04-15 00:30:38 -07:00
James R. Barlow
346da95899 Suppress loglevel since we have color now 2020-04-15 00:09:36 -07:00
James R. Barlow
d146d2b65c The Great Logging Refactor
Remove all instances of logger object being passed as parameters.
This was a holdover from ruffus, and complicated a lot of simple things.
2020-04-14 23:59:33 -07:00
James R. Barlow
4a640b8dcd Fix language argument not working as list
Fixes #523
2020-04-14 23:18:52 -07:00
James R. Barlow
9471bc8921 Fix versions with leading v, e.g. v5.0 2020-04-10 13:42:33 -07:00
James R. Barlow
d13d70fd56 Fix version checker failing for qpdf 10.0.0
Fixes #527
2020-04-10 13:00:19 -07:00
James R. Barlow
23bc3d3a29 tests: workaround for Ghostscript 9.52 txtwrite problem 2020-03-29 22:45:16 -07:00
James R. Barlow
8307832ce9 tests: add force OCR to a file with text that Ghostscript doesn't see
For gs 9.52 support.

Also refactor use of pikepdf.open() to use with blocks.
2020-03-29 22:44:27 -07:00
James R. Barlow
378e4dae3b Expand documentation for subprocess.run() from test 2020-03-04 13:37:44 -08:00
James R. Barlow
b3b61c152c Handle malformed DocumentInfo (#497)
User submitted a PDF in which /Trailer /Info pointed to the XMP metadata
block instead of a DocumentInfo dictionary. Fix and add test.
2020-03-03 03:27:01 -08:00
James R. Barlow
4a27124eab Simplify metadata for invalid xml in output
Removes possibly non-free resource enron1.pdf.
2020-02-12 00:07:18 -08:00
James R. Barlow
ce97af5a79 Add OCR quality measurement API 2020-01-17 03:10:27 -08:00
James R. Barlow
61a2674317 Skip test that needs chmod when on Windows 2020-01-06 02:36:04 -08:00
James R. Barlow
9ad8cbf1f6 Fix assert that depends on POSIX-y file handling 2020-01-06 02:02:05 -08:00
James R. Barlow
9c5f0d0ec6 Eliminate last use of PyPDF2 from test suite 2020-01-04 16:32:01 -08:00
James R. Barlow
32041c43e1 tests: improve tesseract coverage 2020-01-04 02:35:14 -08:00
James R. Barlow
1037d73efb tests: use smaller files for ghostscript 2019-12-31 17:20:28 -08:00
James R. Barlow
aeb7b142a9 tests: skip tests not compatible with coverage
For reasons not entirely clear, stdout will get some data injected when
pytest-cov is running. Our tests that
check for clean stdout need to ignore this.

We check for an environment variable that is defined only when coverage is
running.
2019-12-31 17:10:51 -08:00
James R. Barlow
422ea9777e Remove session scope from fixtures
pytest seems to prepare os.environ in complex ways, so we want to ensure
these fixtures are not reused.
2019-12-31 17:09:23 -08:00
James R. Barlow
2f1c743227 Rewrite main pool loop
pytest-cov documentation recommends using explicit
management of multiprocessing.Pool rather than the context manager.
This is supposed to work better for collecting coverage data, particularly
on Windows.
2019-12-31 16:23:41 -08:00
James R. Barlow
96ee21aee9 Try to set up subprocess coverage better 2019-12-31 15:39:45 -08:00
James R. Barlow
4b759af6ff tests: fix problems with ghostscript spoofers 2019-12-31 15:33:03 -08:00
James R. Barlow
25d2b0cda4 test: environment warnings/cleanup 2019-12-30 22:38:50 -08:00
James R. Barlow
c36e9950ae tests: test TqdmConsole 2019-12-30 17:51:09 -08:00
James R. Barlow
0c0d53b10f tests: AcroForm test case did not work correctly; fixed 2019-12-30 17:50:32 -08:00
James R. Barlow
63de7e1677 Improve error message for unreadable input files 2019-12-30 16:14:52 -08:00
James R. Barlow
b0e92760a2 tests: add coverage for helpers 2019-12-30 15:52:10 -08:00
James R. Barlow
c5edff2c2f Sort imports 2019-12-19 15:31:18 -08:00
James R. Barlow
c5571388e2 Improve test coverage of _sync.py 2019-12-10 01:06:27 -08:00
James R. Barlow
607eee198d tests: split out preprocessing tests 2019-12-09 16:18:01 -08:00
James R. Barlow
5e2a7f8a56 tests: speed up several slow tests 2019-12-09 16:17:57 -08:00
James R. Barlow
7be293f628 Address tests that fail on Windows with Python 3.7 or 3.6 2019-12-09 16:17:10 -08:00
James R. Barlow
f6510e2b15 Document function of symlink shim 2019-12-06 15:00:12 -08:00
James R. Barlow
51abd79136 Tesseract no longer posts an error message if config file not found 2019-12-04 21:35:28 -08:00
James R. Barlow
5607429d9a tests: error message from tesseract change 2019-12-04 21:31:01 -08:00
James R. Barlow
9db01c7ff5 Remove test_bad_utf8
Due to difficulties of getting this to work on Python 3.8, Windows, and
high probability that this behavior is now gone from Tesseract 4.0+.

Originally added in 2017.
2019-12-04 21:01:09 -08:00
James R. Barlow
cff37bf681 Make test_german more Windows-friendly 2019-12-04 21:01:09 -08:00
James R. Barlow
66d04dd6e3 Don't expect filenames to be replicated on NT 2019-12-04 21:01:09 -08:00
James R. Barlow
06a1f987d4 Use _OCRMYPDF_TEST_PATH for testing and .py stubs to simulate symlinks 2019-12-04 21:01:06 -08:00
James R. Barlow
e51e21c6b6 ghostscript: Refactor checking for executable name on Windows 2019-12-04 21:01:06 -08:00