James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext
2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API
2020-05-02 03:34:31 -07:00
James R. Barlow
016dfd420c
Add warning if problematic --tesseract-pagesegmode is selected
...
Fixes #549
2020-04-30 04:12:11 -07:00
James R. Barlow
8f5c95f0f4
Remove last vestiges of command line usage of qpdf - change to check_pdf
2020-04-26 05:33:26 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
...
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
7513f5425c
Fix some broken tests
2020-04-26 03:49:20 -07:00
James R. Barlow
94c52a6fa3
Refactor 'xyres' into Resolution
2020-04-24 04:12:05 -07:00
James R. Barlow
57771f06a3
Refactor xy-pair for resolution to tuple
2020-04-16 15:38:33 -07:00
James R. Barlow
31b5f63f85
hocrtransform: cleanup/PEP8
...
Some API breaking changes.
2020-04-15 02:48:56 -07:00
James R. Barlow
957fb1494e
pytest picky about list vs tuple
2020-04-15 02:26:20 -07:00
James R. Barlow
2155bcacb4
Loosen test language requirements - eng/deu
2020-04-15 00:30:38 -07:00
James R. Barlow
346da95899
Suppress loglevel since we have color now
2020-04-15 00:09:36 -07:00
James R. Barlow
d146d2b65c
The Great Logging Refactor
...
Remove all instances of logger object being passed as parameters.
This was a holdover from ruffus, and complicated a lot of simple things.
2020-04-14 23:59:33 -07:00
James R. Barlow
4a640b8dcd
Fix language argument not working as list
...
Fixes #523
2020-04-14 23:18:52 -07:00
James R. Barlow
9471bc8921
Fix versions with leading v, e.g. v5.0
2020-04-10 13:42:33 -07:00
James R. Barlow
d13d70fd56
Fix version checker failing for qpdf 10.0.0
...
Fixes #527
2020-04-10 13:00:19 -07:00
James R. Barlow
23bc3d3a29
tests: workaround for Ghostscript 9.52 txtwrite problem
2020-03-29 22:45:16 -07:00
James R. Barlow
8307832ce9
tests: add force OCR to a file with text that Ghostscript doesn't see
...
For gs 9.52 support.
Also refactor use of pikepdf.open() to use with blocks.
2020-03-29 22:44:27 -07:00
James R. Barlow
378e4dae3b
Expand documentation for subprocess.run() from test
2020-03-04 13:37:44 -08:00
James R. Barlow
b3b61c152c
Handle malformed DocumentInfo ( #497 )
...
User submitted a PDF in which /Trailer /Info pointed to the XMP metadata
block instead of a DocumentInfo dictionary. Fix and add test.
2020-03-03 03:27:01 -08:00
James R. Barlow
4a27124eab
Simplify metadata for invalid xml in output
...
Removes possibly non-free resource enron1.pdf.
2020-02-12 00:07:18 -08:00
James R. Barlow
ce97af5a79
Add OCR quality measurement API
2020-01-17 03:10:27 -08:00
James R. Barlow
61a2674317
Skip test that needs chmod when on Windows
2020-01-06 02:36:04 -08:00
James R. Barlow
9ad8cbf1f6
Fix assert that depends on POSIX-y file handling
2020-01-06 02:02:05 -08:00
James R. Barlow
9c5f0d0ec6
Eliminate last use of PyPDF2 from test suite
2020-01-04 16:32:01 -08:00
James R. Barlow
32041c43e1
tests: improve tesseract coverage
2020-01-04 02:35:14 -08:00
James R. Barlow
1037d73efb
tests: use smaller files for ghostscript
2019-12-31 17:20:28 -08:00
James R. Barlow
aeb7b142a9
tests: skip tests not compatible with coverage
...
For reasons not entirely clear, stdout will get some data injected when
pytest-cov is running. Our tests that
check for clean stdout need to ignore this.
We check for an environment variable that is defined only when coverage is
running.
2019-12-31 17:10:51 -08:00
James R. Barlow
422ea9777e
Remove session scope from fixtures
...
pytest seems to prepare os.environ in complex ways, so we want to ensure
these fixtures are not reused.
2019-12-31 17:09:23 -08:00
James R. Barlow
2f1c743227
Rewrite main pool loop
...
pytest-cov documentation recommends using explicit
management of multiprocessing.Pool rather than the context manager.
This is supposed to work better for collecting coverage data, particularly
on Windows.
2019-12-31 16:23:41 -08:00
James R. Barlow
96ee21aee9
Try to set up subprocess coverage better
2019-12-31 15:39:45 -08:00
James R. Barlow
4b759af6ff
tests: fix problems with ghostscript spoofers
2019-12-31 15:33:03 -08:00
James R. Barlow
25d2b0cda4
test: environment warnings/cleanup
2019-12-30 22:38:50 -08:00
James R. Barlow
c36e9950ae
tests: test TqdmConsole
2019-12-30 17:51:09 -08:00
James R. Barlow
0c0d53b10f
tests: AcroForm test case did not work correctly; fixed
2019-12-30 17:50:32 -08:00
James R. Barlow
63de7e1677
Improve error message for unreadable input files
2019-12-30 16:14:52 -08:00
James R. Barlow
b0e92760a2
tests: add coverage for helpers
2019-12-30 15:52:10 -08:00
James R. Barlow
c5edff2c2f
Sort imports
2019-12-19 15:31:18 -08:00
James R. Barlow
c5571388e2
Improve test coverage of _sync.py
2019-12-10 01:06:27 -08:00
James R. Barlow
607eee198d
tests: split out preprocessing tests
2019-12-09 16:18:01 -08:00
James R. Barlow
5e2a7f8a56
tests: speed up several slow tests
2019-12-09 16:17:57 -08:00
James R. Barlow
7be293f628
Address tests that fail on Windows with Python 3.7 or 3.6
2019-12-09 16:17:10 -08:00
James R. Barlow
f6510e2b15
Document function of symlink shim
2019-12-06 15:00:12 -08:00
James R. Barlow
51abd79136
Tesseract no longer posts an error message if config file not found
2019-12-04 21:35:28 -08:00
James R. Barlow
5607429d9a
tests: error message from tesseract change
2019-12-04 21:31:01 -08:00
James R. Barlow
9db01c7ff5
Remove test_bad_utf8
...
Due to difficulties of getting this to work on Python 3.8, Windows, and
high probability that this behavior is now gone from Tesseract 4.0+.
Originally added in 2017.
2019-12-04 21:01:09 -08:00
James R. Barlow
cff37bf681
Make test_german more Windows-friendly
2019-12-04 21:01:09 -08:00
James R. Barlow
66d04dd6e3
Don't expect filenames to be replicated on NT
2019-12-04 21:01:09 -08:00
James R. Barlow
06a1f987d4
Use _OCRMYPDF_TEST_PATH for testing and .py stubs to simulate symlinks
2019-12-04 21:01:06 -08:00
James R. Barlow
e51e21c6b6
ghostscript: Refactor checking for executable name on Windows
2019-12-04 21:01:06 -08:00