Commit Graph

81 Commits

Author SHA1 Message Date
James R. Barlow
67553fc5c6 Display page numbers in log messages when grafting 2020-09-17 01:20:50 -07:00
James R. Barlow
1f15ecbca5 Add "Postprocessing" message as a hint for long Ghostscript runs 2020-09-08 02:34:10 -07:00
James R. Barlow
aa0ec40102 Change license of all GPLv3 files to MPL-2.0
https://github.com/jbarlow83/OCRmyPDF/issues/600
2020-08-05 00:44:42 -07:00
James R. Barlow
a672422b0b Enable pikepdf mmap in other contexts 2020-07-22 00:20:07 -07:00
James R. Barlow
addc2cbad0 Enable pikepdf mmap and set up signal handlers 2020-07-22 00:19:50 -07:00
James R. Barlow
60be64a5f1 Fix debug.log missing pageno handler 2020-07-04 03:59:38 -07:00
James R. Barlow
f4cb424451 Support input/output streams at API level 2020-06-22 02:02:18 -07:00
James R. Barlow
642998ead6 sync: refactor preprocess image filtering 2020-06-15 15:26:41 -07:00
James R. Barlow
698aab4f75 Add a lot of type annotations 2020-06-15 15:20:50 -07:00
James R. Barlow
34231ac667 sync: refactor intermediate image production 2020-06-15 15:02:28 -07:00
James R. Barlow
ddedf7cd2e For --clean-final, use same image as --clean if possible 2020-06-15 13:48:49 -07:00
James R. Barlow
872bafad4b Reinstate quick test for text/no text
Partial revert of commit 991db17
2020-06-10 12:00:52 -07:00
James R. Barlow
8599400445 Only do page analysis on pages we will do OCR on 2020-06-10 11:33:27 -07:00
James R. Barlow
aa060db5bc Refactor tesseract_env variable into the plugin
Removed all cases except one in api.py, which isn't worth solving because
it should be removed anyway.

This also fixes a logic error in the OMP_THREAD_LIMIT decision, api.py
did not use pass kwargs correctly so they never worked before.
2020-05-26 02:14:06 -07:00
James R. Barlow
a0f9ca3a30 Move Tesseract options validation into plugin 2020-05-25 01:31:46 -07:00
James R. Barlow
9bccff4f88 Move Tesseract specific arguments to plugin 2020-05-16 03:24:31 -07:00
James R. Barlow
8174089c8b Begin transforming Tesseract into pluggable OCR engine 2020-05-14 03:54:21 -07:00
James R. Barlow
e760622a5c graft: refactor 2020-05-07 02:03:42 -07:00
James R. Barlow
85cbf94a6e Convert many uses of str paths to Path 2020-05-06 02:53:47 -07:00
James R. Barlow
6f4286e1b1 New hook: filter_page_image 2020-05-06 02:24:07 -07:00
James R. Barlow
c85278b31d Delinting 2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0 Rename PDFContext->PdfContext 2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97 Support plugin invocation with API 2020-05-02 03:34:31 -07:00
James R. Barlow
be107b4fed Set up filter_ocr_image hook 2020-05-01 02:56:41 -07:00
James R. Barlow
8d2535e327 Get pluggy to work with forking workers 2020-05-01 02:39:50 -07:00
James R. Barlow
5eb4fe0052 Refactor plugin setup to get_plugin_manager 2020-05-01 02:18:31 -07:00
James R. Barlow
d8ff4485f8 Move samefile to helpers 2020-05-01 02:18:11 -07:00
James R. Barlow
82bce463ae Start pluggy-based plugin system 2020-05-01 02:15:23 -07:00
James R. Barlow
8f5c95f0f4 Remove last vestiges of command line usage of qpdf - change to check_pdf 2020-04-26 05:33:26 -07:00
James R. Barlow
18c4aa10bf Adjust number of workers for concurrent page scanning 2020-04-26 04:21:15 -07:00
James R. Barlow
991db17fde Remove Ghostscript-based text extraction
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
8c381a0227 Replace task_initargs with use of partial() 2020-04-26 03:49:20 -07:00
James R. Barlow
af3c3c6466 Further refactoring of concurrency concerns 2020-04-26 03:49:20 -07:00
James R. Barlow
db3e75e33e Refactor multiprocessing pool 2020-04-26 03:49:13 -07:00
James R. Barlow
c2919f2e1c Reinstate logging of page numbers 2020-04-15 00:05:23 -07:00
James R. Barlow
d146d2b65c The Great Logging Refactor
Remove all instances of logger object being passed as parameters.
This was a holdover from ruffus, and complicated a lot of simple things.
2020-04-14 23:59:33 -07:00
James R. Barlow
2490be8490 Fix debug.log not being deleted on Windows (probably)
Fixes #515
2020-03-29 21:53:56 -07:00
James R. Barlow
e4cc9fcba7 Wrong number of threads to use shown when OMP_THREAD_LIMIT is defined 2020-03-23 01:06:55 -07:00
James R. Barlow
bcf77375c0 Fix grammar in output message 2020-01-28 07:33:28 -08:00
James R. Barlow
ce97af5a79 Add OCR quality measurement API 2020-01-17 03:10:27 -08:00
James R. Barlow
123fde174d Don't use debug.log in pytest
pytest does not reset the state of logging if we install a file handler,
which will cause FileNotFoundError after the temporary folder is removed.

Semi-related:
https://github.com/pytest-dev/pytest/issues/5502
2020-01-06 01:46:19 -08:00
James R. Barlow
6f5d77d930 Also generate log file in temp folder on verbose mode 2020-01-05 21:33:32 -08:00
James R. Barlow
e2a563cc76 logging: create a debug log when -k parameter is issued 2020-01-01 16:47:15 -08:00
James R. Barlow
2f1c743227 Rewrite main pool loop
pytest-cov documentation recommends using explicit
management of multiprocessing.Pool rather than the context manager.
This is supposed to work better for collecting coverage data, particularly
on Windows.
2019-12-31 16:23:41 -08:00
James R. Barlow
63de7e1677 Improve error message for unreadable input files 2019-12-30 16:14:52 -08:00
James R. Barlow
c5edff2c2f Sort imports 2019-12-19 15:31:18 -08:00
James R. Barlow
b8b7ecfe7f Fix DecompressionBomb related errors due to Windows process differences 2019-12-04 21:10:27 -08:00
James R. Barlow
b7f63bc93d Make devnull check compatible with Windows 2019-12-04 17:04:08 -08:00
James R. Barlow
4d26867dee Delinting 2019-09-20 17:17:11 -07:00
James R. Barlow
78e8bf9cbf Use at most 3 Tesseract threads
Based on a user suggestion and
tesseract-ocr/tesseract#2611, I reviewed thread limits and found that
thread limit of 3 is still beneficial, but not 4.

> time env OMP_THREAD_LIMIT=2 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
116.67user 1.67system 1:26.26elapsed 137%CPU (0avgtext+0avgdata 356752maxresident)k
2213inputs+0outputs (18major+131059minor)pagefaults 0swaps
> time env OMP_THREAD_LIMIT=3 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
136.89user 1.63system 1:19.56elapsed 174%CPU (0avgtext+0avgdata 356784maxresident)k
821inputs+0outputs (0major+131080minor)pagefaults 0swaps
> time env OMP_THREAD_LIMIT=4 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
161.31user 1.51system 1:18.80elapsed 206%CPU (0avgtext+0avgdata 356632maxresident)k
8477inputs+0outputs (12major+131074minor)pagefaults 0swaps
> time env OMP_THREAD_LIMIT=8 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
160.30user 1.62system 1:18.01elapsed 207%CPU (0avgtext+0avgdata 356640maxresident)k
821inputs+0outputs (0major+131078minor)pagefaults 0swaps
2019-09-20 17:12:36 -07:00