James R. Barlow
1aa34f5d2e
Make some interfaces accepting of both str-paths and Path objects
2017-07-21 13:28:30 -07:00
James R. Barlow
dd38519f07
Merge branch 'feature/user-words' into develop
...
# Conflicts:
# ocrmypdf/exec/tesseract.py
2017-07-20 16:25:20 -07:00
James R. Barlow
3232643809
Support “textonly PDF” renderer in Tesseract 3.05.01
2017-06-13 10:18:08 -07:00
James R. Barlow
1d57bcc99e
Fix Ghostscript rasterizing of UserUnit pages and related sizing issues
2017-05-29 12:14:10 -07:00
James R. Barlow
facdd13879
Ghostscript: refactor image output resizing
2017-05-29 11:42:27 -07:00
James R. Barlow
6e891f91d3
ghostscript, qpdf: Restore API backward compatibility
2017-05-29 11:13:06 -07:00
James R. Barlow
9b50ede977
Partially solve ghostscript rasterize_pdf producing wrong file size
...
Kludge. Assumes JPEG for now. Messy.
2017-05-25 01:17:43 -07:00
James R. Barlow
6ff6c8614f
—output-type=pdf now outputs /UserUnit PDFs at the correct size
...
This currently distorts the output size because Tesseract assumes it
knows the DPI better than we do.
Does not work for Ghostscript, because it emerges that Ghostscript
honors /UserUnit for rasterizing but not in pdfwrite (resolve/wontfix).
https://bugs.ghostscript.com/show_bug.cgi?id=690781
Ghostscript’s output would need to be patched in a PDF/A safe way for
this to work. Temporary route may be to block Ghostscript if
/UserUnit.
2017-05-24 23:26:07 -07:00
James R. Barlow
131a5b741d
tesseract.py: update canned HOCR template to tess 3.05 output
...
Seems better to not claim the existence of several entities that don’t
exist as the older one does
2017-05-14 23:40:09 -07:00
James R. Barlow
65b89687a9
ghostscript: fix missing “import sys”, only applicable for an exception
2017-05-14 23:38:52 -07:00
James R. Barlow
234183ecd2
Fix: Tesseract 3.04 is sensitive to order of configuration commands
...
“txt hocr” is not acceptable and does not produce expected output .txt
while “hocr text” works fine, so switch the order everywhere.
Should fix #169
2017-05-14 23:27:46 -07:00
James R. Barlow
8f91acf956
Remove Tesseract 3.02 and 3.03 compatibility shims
2017-05-11 23:50:52 -07:00
James R. Barlow
0dae1602c7
Fix missing import PIPE
2017-05-11 23:07:20 -07:00
James R. Barlow
96045e98f4
Update develop with master changes
...
We’re well out of the “trivial updates” zone
2017-05-11 22:54:27 -07:00
James R. Barlow
c8a4cbcf17
Fix test suite breakage after sidecar feature added
...
Forgot to update tesseract spoofers to account for change in tesseract
parameters. Also the change to outputting multiple files in the collate
steps affected how ruffus passes information into downstream consumers
of those files.
2017-05-11 00:17:24 -07:00
James R. Barlow
183eafa587
Implement sidecar text files ( #126 )
2017-05-10 15:22:44 -07:00
James R. Barlow
37ebcadfa1
Implement —user-words, —user-patterns
2017-05-09 17:54:56 -07:00
James R. Barlow
01a1c2b576
Implement —pdfa-image-compression to control Ghostscript’s compression
...
Fixes #163
2017-05-09 16:37:29 -07:00
James R. Barlow
93e802f473
Fix issue #163 , color and grayscale images JPEG compressed when not needed
2017-05-06 22:27:25 -07:00
James R. Barlow
b9b12e2879
Ensure that ocrmypdf stops and reports an error if Ghostscript fails
...
Past behavior was to continue and let ruffus puke eventually
2017-05-01 15:44:21 -07:00
James R. Barlow
1e7fbd4202
Fix issues with —pdf-renderer tess4 page skipping
...
If tess4 renderer needed to skip OCR on a page it would end up
duplicating the page contents onto the new page, rather than creating
a blank OCR layer and placing it on the output page. This created
duplicated content in output files.
2017-03-29 23:43:26 -07:00
James R. Barlow
059f79242e
Phase out subprocess.Popen
2017-03-29 18:15:02 -07:00
James R. Barlow
2954e72652
Some examples of Ghostscript and Tesseract warnings/errors were not tagged properly
2017-03-28 10:59:53 -07:00
James R. Barlow
199de96cff
Ghostcript 9.21 seems to have a regression related to Unicode metadata
2017-03-24 15:15:46 -07:00
James R. Barlow
8ddbe81513
Fix issue #147 : unpaper loses DPI information, affects —pdf-renderer tess4
2017-03-24 13:23:03 -07:00
James R. Barlow
8c17c9918e
Add documentation and test cases for —tesseract-config
...
This parameter has existed for along time but never really got any
attention.
2017-01-28 22:06:51 -08:00
James R. Barlow
02fba02d31
Refactor test suite to use fixtures to manage paths
2017-01-26 16:38:59 -08:00
James R. Barlow
99e47c9c04
tesseract: add support for using v4 textonly_pdf feature
2017-01-20 17:06:23 -08:00
James R. Barlow
d4c72b371f
Forward --oem argument to tesseract 4
2017-01-18 21:37:50 -08:00
James R. Barlow
c42d9baa26
tesseract: for v4, use --psm while keeping -psm for v3
...
At the moment v4 accepts both but who knows if this will get dropped,
so do as document for each version.
2017-01-18 17:43:47 -08:00
James R. Barlow
6e27ecd2b9
Finalize ‘exec’ migration and make it backward compatibility for now
2017-01-18 17:40:50 -08:00
James R. Barlow
b8767e5ba9
Rename exe -> exec, more Unix-y and suggestive
2016-12-10 15:34:00 -08:00