1140 Commits

Author SHA1 Message Date
James R. Barlow
c4831ac00c v5.3 release notes v5.3 2017-07-27 00:11:12 -07:00
James R. Barlow
93a954ef9f Fix missing import for Py3.5 2017-07-26 23:40:01 -07:00
James R. Barlow
f7ce8f44e9 Weaken the --user-words test so it will pass on Travis 2017-07-26 21:03:51 -07:00
James R. Barlow
0b012697e5 Whitelist the Latin-1 languages that work with HOCR
Omitted French because the rare 'oe' and 'ÿ' glyphs are not in Latin-1.
Basically steer people away from HOCR renderer but avoid a potential
disruptive behavior change.
2017-07-26 21:03:18 -07:00
James R. Barlow
58e357c992 Report location of attempted output_file that fails to write 2017-07-22 17:49:56 -07:00
James R. Barlow
71fbad83ad Fix py3.5 test 2017-07-21 17:01:06 -07:00
James R. Barlow
52483072dc Add a differential test that checks tesseract uses supplied word list 2017-07-21 16:40:20 -07:00
James R. Barlow
7f0b8621f3 Tests: accept rich path objects without having to str() everything 2017-07-21 16:39:22 -07:00
James R. Barlow
cd8db60b06 Crash test all renderers, not just two 2017-07-21 14:10:02 -07:00
James R. Barlow
1aa34f5d2e Make some interfaces accepting of both str-paths and Path objects 2017-07-21 13:28:30 -07:00
James R. Barlow
dfa1d88ce9 Fix missing user_words/user_patterns from textonly_pdf case 2017-07-20 17:14:04 -07:00
James R. Barlow
dd38519f07 Merge branch 'feature/user-words' into develop
# Conflicts:
#	ocrmypdf/exec/tesseract.py
2017-07-20 16:25:20 -07:00
James R. Barlow
098f5d4f0b docs: remove deprecated example of pdftotext 2017-07-20 16:20:17 -07:00
James R. Barlow
ffc685d536 docs: envvar markup 2017-07-20 16:19:57 -07:00
James R. Barlow
cd1a99a0de Refactor int(os.path.basename(s)[0:6]) -> page_number(s) 2017-06-26 13:29:40 -07:00
James R. Barlow
48e3b267fc Accept PDFs with whitespace ahead of %PDF marker
Noticed in @aagahi 's fork
2017-06-26 13:17:47 -07:00
James R. Barlow
3a7c3417bb Don’t check tags and branch at the same time as Travis doesn’t get this
Travis is weird
v5.2
2017-06-13 13:14:34 -07:00
James R. Barlow
d792ef7222 Give the ‘auto’ renderer setting more test covfefe 2017-06-13 13:13:58 -07:00
James R. Barlow
2c24f67deb Rename “tess4” renderer to “sandwich” and make it default in Tess 3.05.01
Tesseract 3.05.01 backported the textonly_pdf=1 which allows the use
of this superior PDF renderer prior to 4.00 alpha. This means that
the tess4 name is no longer accurate, so call it a sandwich because of
its merge-preserve characteristic. Preserve the tess4 name. Fix the
documentation and tests to reflect this.

Make it the default, because it’s better. It does not have the issues
the “tesseract” renderer does prior to Tess 3.05.00 with rendering
PDFs that Ghostscript corrupts, and it produces better output without
re-rastering.

Deprecate some old stuff to avoid the test suite growing obscenely
large.
2017-06-13 13:09:12 -07:00
James R. Barlow
9e75e28d0c Homebrew needs x11 to compile Pillow 2017-06-13 11:03:26 -07:00
James R. Barlow
3232643809 Support “textonly PDF” renderer in Tesseract 3.05.01 2017-06-13 10:18:08 -07:00
James R. Barlow
f7ee9e90ce Document what is meant by the ocrmypdf “API” 2017-06-13 10:15:11 -07:00
James R. Barlow
47298be132 Remove Python <3.5 test 2017-06-13 10:14:28 -07:00
James R. Barlow
a88fa83515 Travis: fix deploy conditions for homebrew autobrew 2017-05-31 02:29:32 -07:00
James R. Barlow
12bfe20385 v5.1 release notes v5.1 2017-05-29 14:36:50 -07:00
James R. Barlow
3d2f6f0772 Fix tess4 test using old-style pageinfo API 2017-05-29 13:51:21 -07:00
James R. Barlow
1cb607f64b Merge UserUnit 2017-05-29 13:22:55 -07:00
James R. Barlow
d3c54fbbde For —rotate-pages, rasterize preview at half DPI instead of 200 DPI
Ensures that time is not wasted on previews at higher resolution than
the input as was sometimes the case
2017-05-29 13:01:18 -07:00
James R. Barlow
28341b755f Refactor common test fixtures 2017-05-29 12:47:55 -07:00
James R. Barlow
4b5cd420e1 Add new test file 2017-05-29 12:16:08 -07:00
James R. Barlow
1d57bcc99e Fix Ghostscript rasterizing of UserUnit pages and related sizing issues 2017-05-29 12:14:10 -07:00
James R. Barlow
facdd13879 Ghostscript: refactor image output resizing 2017-05-29 11:42:27 -07:00
James R. Barlow
6e891f91d3 ghostscript, qpdf: Restore API backward compatibility 2017-05-29 11:13:06 -07:00
James R. Barlow
9b50ede977 Partially solve ghostscript rasterize_pdf producing wrong file size
Kludge. Assumes JPEG for now. Messy.
2017-05-25 01:17:43 -07:00
James R. Barlow
82cf010333 Error out if trying to produce PDF/A >200” due to Ghostscript limitation 2017-05-25 00:07:29 -07:00
James R. Barlow
6ff6c8614f —output-type=pdf now outputs /UserUnit PDFs at the correct size
This currently distorts the output size because Tesseract assumes it
 knows the DPI better than we do.

Does not work for Ghostscript, because it emerges that Ghostscript
honors /UserUnit for rasterizing but not in pdfwrite (resolve/wontfix).

https://bugs.ghostscript.com/show_bug.cgi?id=690781

Ghostscript’s output would need to be patched in a PDF/A safe way for
this to work. Temporary route may be to block Ghostscript if
/UserUnit.
2017-05-24 23:26:07 -07:00
James R. Barlow
eb1cd38f6c Add an open helper that is compatible with pathlib 2017-05-24 16:19:15 -07:00
James R. Barlow
148b632b4f Prove multiprocessing works, although it is still racy in some places 2017-05-23 16:32:13 -07:00
James R. Barlow
591e213713 Add more dependencies for autobrew 2017-05-23 13:52:28 -07:00
James R. Barlow
75f2262659 Ensure JobContext stuff is actually tested for IPC consistency 2017-05-19 17:57:07 -07:00
James R. Barlow
d9005a1074 pdfinfo: replace most remaining dict-style access 2017-05-19 16:17:36 -07:00
James R. Barlow
3e73fa81bf pageinfo: deprecation warning 2017-05-19 16:17:07 -07:00
James R. Barlow
ba6e290231 Restore old pageinfo.py to avoid breaking compatibility 2017-05-19 15:49:23 -07:00
James R. Barlow
08e47117a3 Rename pageinfo to pdfinfo 2017-05-19 15:48:23 -07:00
James R. Barlow
532ef38157 /UserUnit is a scalar, not an array 2017-05-19 14:19:50 -07:00
James R. Barlow
4c09875890 docs: upload unpaper Dropbox link, .rst typo blocking macOS install
[ci skip]
2017-05-19 12:18:09 -07:00
James R. Barlow
0e98139712 Upload to upload.pypi.org/legacy as recommend by PyPA
https://github.com/pypa/warehouse/issues/1996#issuecomment-302784126
2017-05-19 12:06:24 -07:00
James R. Barlow
4c04d802d7 Introduce /UserUnit checking 2017-05-19 12:01:19 -07:00
James R. Barlow
b3dc404571 Update unpaper.deb link (fixes #171)
*Shakes fist a Dropbox*
2017-05-19 11:28:45 -07:00
James R. Barlow
8694f8d2eb Replace magic strings colorspace and encoding with Enums 2017-05-18 22:32:27 -07:00