James R. Barlow
67553fc5c6
Display page numbers in log messages when grafting
2020-09-17 01:20:50 -07:00
James R. Barlow
1f15ecbca5
Add "Postprocessing" message as a hint for long Ghostscript runs
2020-09-08 02:34:10 -07:00
James R. Barlow
aa0ec40102
Change license of all GPLv3 files to MPL-2.0
...
https://github.com/jbarlow83/OCRmyPDF/issues/600
2020-08-05 00:44:42 -07:00
James R. Barlow
a672422b0b
Enable pikepdf mmap in other contexts
2020-07-22 00:20:07 -07:00
James R. Barlow
addc2cbad0
Enable pikepdf mmap and set up signal handlers
2020-07-22 00:19:50 -07:00
James R. Barlow
60be64a5f1
Fix debug.log missing pageno handler
2020-07-04 03:59:38 -07:00
James R. Barlow
f4cb424451
Support input/output streams at API level
2020-06-22 02:02:18 -07:00
James R. Barlow
642998ead6
sync: refactor preprocess image filtering
2020-06-15 15:26:41 -07:00
James R. Barlow
698aab4f75
Add a lot of type annotations
2020-06-15 15:20:50 -07:00
James R. Barlow
34231ac667
sync: refactor intermediate image production
2020-06-15 15:02:28 -07:00
James R. Barlow
ddedf7cd2e
For --clean-final, use same image as --clean if possible
2020-06-15 13:48:49 -07:00
James R. Barlow
872bafad4b
Reinstate quick test for text/no text
...
Partial revert of commit 991db17
2020-06-10 12:00:52 -07:00
James R. Barlow
8599400445
Only do page analysis on pages we will do OCR on
2020-06-10 11:33:27 -07:00
James R. Barlow
aa060db5bc
Refactor tesseract_env variable into the plugin
...
Removed all cases except one in api.py, which isn't worth solving because
it should be removed anyway.
This also fixes a logic error in the OMP_THREAD_LIMIT decision, api.py
did not use pass kwargs correctly so they never worked before.
2020-05-26 02:14:06 -07:00
James R. Barlow
a0f9ca3a30
Move Tesseract options validation into plugin
2020-05-25 01:31:46 -07:00
James R. Barlow
9bccff4f88
Move Tesseract specific arguments to plugin
2020-05-16 03:24:31 -07:00
James R. Barlow
8174089c8b
Begin transforming Tesseract into pluggable OCR engine
2020-05-14 03:54:21 -07:00
James R. Barlow
e760622a5c
graft: refactor
2020-05-07 02:03:42 -07:00
James R. Barlow
85cbf94a6e
Convert many uses of str paths to Path
2020-05-06 02:53:47 -07:00
James R. Barlow
6f4286e1b1
New hook: filter_page_image
2020-05-06 02:24:07 -07:00
James R. Barlow
c85278b31d
Delinting
2020-05-03 00:53:29 -07:00
James R. Barlow
5dbc080fa0
Rename PDFContext->PdfContext
2020-05-02 04:32:46 -07:00
James R. Barlow
e02f6c1e97
Support plugin invocation with API
2020-05-02 03:34:31 -07:00
James R. Barlow
be107b4fed
Set up filter_ocr_image hook
2020-05-01 02:56:41 -07:00
James R. Barlow
8d2535e327
Get pluggy to work with forking workers
2020-05-01 02:39:50 -07:00
James R. Barlow
5eb4fe0052
Refactor plugin setup to get_plugin_manager
2020-05-01 02:18:31 -07:00
James R. Barlow
d8ff4485f8
Move samefile to helpers
2020-05-01 02:18:11 -07:00
James R. Barlow
82bce463ae
Start pluggy-based plugin system
2020-05-01 02:15:23 -07:00
James R. Barlow
8f5c95f0f4
Remove last vestiges of command line usage of qpdf - change to check_pdf
2020-04-26 05:33:26 -07:00
James R. Barlow
18c4aa10bf
Adjust number of workers for concurrent page scanning
2020-04-26 04:21:15 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
...
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
8c381a0227
Replace task_initargs with use of partial()
2020-04-26 03:49:20 -07:00
James R. Barlow
af3c3c6466
Further refactoring of concurrency concerns
2020-04-26 03:49:20 -07:00
James R. Barlow
db3e75e33e
Refactor multiprocessing pool
2020-04-26 03:49:13 -07:00
James R. Barlow
c2919f2e1c
Reinstate logging of page numbers
2020-04-15 00:05:23 -07:00
James R. Barlow
d146d2b65c
The Great Logging Refactor
...
Remove all instances of logger object being passed as parameters.
This was a holdover from ruffus, and complicated a lot of simple things.
2020-04-14 23:59:33 -07:00
James R. Barlow
2490be8490
Fix debug.log not being deleted on Windows (probably)
...
Fixes #515
2020-03-29 21:53:56 -07:00
James R. Barlow
e4cc9fcba7
Wrong number of threads to use shown when OMP_THREAD_LIMIT is defined
2020-03-23 01:06:55 -07:00
James R. Barlow
bcf77375c0
Fix grammar in output message
2020-01-28 07:33:28 -08:00
James R. Barlow
ce97af5a79
Add OCR quality measurement API
2020-01-17 03:10:27 -08:00
James R. Barlow
123fde174d
Don't use debug.log in pytest
...
pytest does not reset the state of logging if we install a file handler,
which will cause FileNotFoundError after the temporary folder is removed.
Semi-related:
https://github.com/pytest-dev/pytest/issues/5502
2020-01-06 01:46:19 -08:00
James R. Barlow
6f5d77d930
Also generate log file in temp folder on verbose mode
2020-01-05 21:33:32 -08:00
James R. Barlow
e2a563cc76
logging: create a debug log when -k parameter is issued
2020-01-01 16:47:15 -08:00
James R. Barlow
2f1c743227
Rewrite main pool loop
...
pytest-cov documentation recommends using explicit
management of multiprocessing.Pool rather than the context manager.
This is supposed to work better for collecting coverage data, particularly
on Windows.
2019-12-31 16:23:41 -08:00
James R. Barlow
63de7e1677
Improve error message for unreadable input files
2019-12-30 16:14:52 -08:00
James R. Barlow
c5edff2c2f
Sort imports
2019-12-19 15:31:18 -08:00
James R. Barlow
b8b7ecfe7f
Fix DecompressionBomb related errors due to Windows process differences
2019-12-04 21:10:27 -08:00
James R. Barlow
b7f63bc93d
Make devnull check compatible with Windows
2019-12-04 17:04:08 -08:00
James R. Barlow
4d26867dee
Delinting
2019-09-20 17:17:11 -07:00
James R. Barlow
78e8bf9cbf
Use at most 3 Tesseract threads
...
Based on a user suggestion and
tesseract-ocr/tesseract#2611 , I reviewed thread limits and found that
thread limit of 3 is still beneficial, but not 4.
> time env OMP_THREAD_LIMIT=2 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
116.67user 1.67system 1:26.26elapsed 137%CPU (0avgtext+0avgdata 356752maxresident)k
2213inputs+0outputs (18major+131059minor)pagefaults 0swaps
> time env OMP_THREAD_LIMIT=3 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
136.89user 1.63system 1:19.56elapsed 174%CPU (0avgtext+0avgdata 356784maxresident)k
821inputs+0outputs (0major+131080minor)pagefaults 0swaps
> time env OMP_THREAD_LIMIT=4 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
161.31user 1.51system 1:18.80elapsed 206%CPU (0avgtext+0avgdata 356632maxresident)k
8477inputs+0outputs (12major+131074minor)pagefaults 0swaps
> time env OMP_THREAD_LIMIT=8 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
160.30user 1.62system 1:18.01elapsed 207%CPU (0avgtext+0avgdata 356640maxresident)k
821inputs+0outputs (0major+131078minor)pagefaults 0swaps
2019-09-20 17:12:36 -07:00