Commit Graph

32 Commits

Author SHA1 Message Date
James R. Barlow
1aa34f5d2e Make some interfaces accepting of both str-paths and Path objects 2017-07-21 13:28:30 -07:00
James R. Barlow
dd38519f07 Merge branch 'feature/user-words' into develop
# Conflicts:
#	ocrmypdf/exec/tesseract.py
2017-07-20 16:25:20 -07:00
James R. Barlow
3232643809 Support “textonly PDF” renderer in Tesseract 3.05.01 2017-06-13 10:18:08 -07:00
James R. Barlow
1d57bcc99e Fix Ghostscript rasterizing of UserUnit pages and related sizing issues 2017-05-29 12:14:10 -07:00
James R. Barlow
facdd13879 Ghostscript: refactor image output resizing 2017-05-29 11:42:27 -07:00
James R. Barlow
6e891f91d3 ghostscript, qpdf: Restore API backward compatibility 2017-05-29 11:13:06 -07:00
James R. Barlow
9b50ede977 Partially solve ghostscript rasterize_pdf producing wrong file size
Kludge. Assumes JPEG for now. Messy.
2017-05-25 01:17:43 -07:00
James R. Barlow
6ff6c8614f —output-type=pdf now outputs /UserUnit PDFs at the correct size
This currently distorts the output size because Tesseract assumes it
 knows the DPI better than we do.

Does not work for Ghostscript, because it emerges that Ghostscript
honors /UserUnit for rasterizing but not in pdfwrite (resolve/wontfix).

https://bugs.ghostscript.com/show_bug.cgi?id=690781

Ghostscript’s output would need to be patched in a PDF/A safe way for
this to work. Temporary route may be to block Ghostscript if
/UserUnit.
2017-05-24 23:26:07 -07:00
James R. Barlow
131a5b741d tesseract.py: update canned HOCR template to tess 3.05 output
Seems better to not claim the existence of several entities that don’t
exist as the older one does
2017-05-14 23:40:09 -07:00
James R. Barlow
65b89687a9 ghostscript: fix missing “import sys”, only applicable for an exception 2017-05-14 23:38:52 -07:00
James R. Barlow
234183ecd2 Fix: Tesseract 3.04 is sensitive to order of configuration commands
“txt hocr” is not acceptable and does not produce expected output .txt
while “hocr text” works fine, so switch the order everywhere.

Should fix #169
2017-05-14 23:27:46 -07:00
James R. Barlow
8f91acf956 Remove Tesseract 3.02 and 3.03 compatibility shims 2017-05-11 23:50:52 -07:00
James R. Barlow
0dae1602c7 Fix missing import PIPE 2017-05-11 23:07:20 -07:00
James R. Barlow
96045e98f4 Update develop with master changes
We’re well out of the “trivial updates” zone
2017-05-11 22:54:27 -07:00
James R. Barlow
c8a4cbcf17 Fix test suite breakage after sidecar feature added
Forgot to update tesseract spoofers to account for change in tesseract
parameters.  Also the change to outputting multiple files in the collate
steps affected how ruffus passes information into downstream consumers
of those files.
2017-05-11 00:17:24 -07:00
James R. Barlow
183eafa587 Implement sidecar text files (#126) 2017-05-10 15:22:44 -07:00
James R. Barlow
37ebcadfa1 Implement —user-words, —user-patterns 2017-05-09 17:54:56 -07:00
James R. Barlow
01a1c2b576 Implement —pdfa-image-compression to control Ghostscript’s compression
Fixes #163
2017-05-09 16:37:29 -07:00
James R. Barlow
93e802f473 Fix issue #163, color and grayscale images JPEG compressed when not needed 2017-05-06 22:27:25 -07:00
James R. Barlow
b9b12e2879 Ensure that ocrmypdf stops and reports an error if Ghostscript fails
Past behavior was to continue and let ruffus puke eventually
2017-05-01 15:44:21 -07:00
James R. Barlow
1e7fbd4202 Fix issues with —pdf-renderer tess4 page skipping
If tess4 renderer needed to skip OCR on a page it would end up
duplicating the page contents onto the new page, rather than creating
a blank OCR layer and placing it on the output page. This created
duplicated content in output files.
2017-03-29 23:43:26 -07:00
James R. Barlow
059f79242e Phase out subprocess.Popen 2017-03-29 18:15:02 -07:00
James R. Barlow
2954e72652 Some examples of Ghostscript and Tesseract warnings/errors were not tagged properly 2017-03-28 10:59:53 -07:00
James R. Barlow
199de96cff Ghostcript 9.21 seems to have a regression related to Unicode metadata 2017-03-24 15:15:46 -07:00
James R. Barlow
8ddbe81513 Fix issue #147: unpaper loses DPI information, affects —pdf-renderer tess4 2017-03-24 13:23:03 -07:00
James R. Barlow
8c17c9918e Add documentation and test cases for —tesseract-config
This parameter has existed for along time but never really got any
attention.
2017-01-28 22:06:51 -08:00
James R. Barlow
02fba02d31 Refactor test suite to use fixtures to manage paths 2017-01-26 16:38:59 -08:00
James R. Barlow
99e47c9c04 tesseract: add support for using v4 textonly_pdf feature 2017-01-20 17:06:23 -08:00
James R. Barlow
d4c72b371f Forward --oem argument to tesseract 4 2017-01-18 21:37:50 -08:00
James R. Barlow
c42d9baa26 tesseract: for v4, use --psm while keeping -psm for v3
At the moment v4 accepts both but who knows if this will get dropped,
so do as document for each version.
2017-01-18 17:43:47 -08:00
James R. Barlow
6e27ecd2b9 Finalize ‘exec’ migration and make it backward compatibility for now 2017-01-18 17:40:50 -08:00
James R. Barlow
b8767e5ba9 Rename exe -> exec, more Unix-y and suggestive 2016-12-10 15:34:00 -08:00