490 Commits

Author SHA1 Message Date
James R. Barlow
79b3472b26 All tests passed, bump version v3.1 2015-12-04 04:31:01 -08:00
James R. Barlow
f1b2f1ae08 Merge branch 'feature/pdfa-2' into develop 2015-12-04 04:04:08 -08:00
James R. Barlow
ee7d97ae8c Trivial 2015-12-04 04:03:38 -08:00
James R. Barlow
7d9f473bb1 Remove eval() call by introspecting ExitCode 2015-12-04 03:34:53 -08:00
James R. Barlow
e77a5e5e75 We don't want threads. Really. Do. Not. Want. 2015-12-04 03:11:38 -08:00
James R. Barlow
6ab19af122 Comments 2015-12-04 03:09:39 -08:00
James R. Barlow
276fe49867 Better error messages for input file not found or invalid
Not as good finding a general way to deal with ruffus exceptions, but
better than nil.
2015-12-04 03:07:53 -08:00
James R. Barlow
acb31abe86 Fix issue #20 - fails on uppercase .PDF 2015-12-04 02:14:09 -08:00
James R. Barlow
4f964a3c8a Introduce --pdf-renderer auto
Tess 3.03's has various quality problems like wrong DPI that are fixed
in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF
select the rendering backend based on the options and system.

However, we're not ready for tesseract as the main renderer.
Setting pdf-renderer to tesseract does not pass all test cases, mainly
the one where --tesseract-timeout is triggered, and some others.
2015-12-02 23:20:31 -08:00
James R. Barlow
df1fda7438 pageinfo: workaround PyPDF extractText limitations on hidden text
It appears that extractText() does not find all text. At a glance it
may be that Tesseract's PDF renderer generates a font and uses glyphs
that map to different Unicode code points that PyPDF expects, so it
discards the content and finds nothing. As a proxy in lieu of better
PDF parsing, assume that a "GlyphLessFont" means there is a text there.

I had previously found it does not work to check for the presence of a
font on page. Some PDF generators create a font resource entry even if
the font is never called for.
2015-12-02 23:16:36 -08:00
James R. Barlow
d6124c1787 pageinfo: improve robustness of text test for Tesseract produced PDFs 2015-12-02 03:12:52 -08:00
James R. Barlow
80d89b5420 Set /Creator metadata to OCRmyPDF
with reference to Tess version and settings
2015-12-02 02:19:39 -08:00
James R. Barlow
74059eecf1 Choose PDF/A-2b by default instead of A-1b 2015-12-02 01:48:10 -08:00
James R. Barlow
78697341a2 pytest: don't run tests that happened to be part of pyvenv 2015-12-02 01:19:43 -08:00
James R. Barlow
cfb56dd8ff Merge commit 'b1769cbe18e6380ddfe96b3b22e6d02cb603338b' into develop 2015-12-01 00:40:43 -08:00
jbarlow83
b1769cbe18 README: El Capitan supported now, Py3.5 supported 2015-11-26 16:31:33 -08:00
James R. Barlow
955b801e7f Merge branch 'master' into develop 2015-09-14 00:34:21 -07:00
James R. Barlow
3cea3f1afe Try to work around git binary file bug again 2015-09-14 00:34:16 -07:00
James R. Barlow
fd4a227ccb Force this file to stop thinking it was modified 2015-09-13 17:53:01 -07:00
James R. Barlow
19c3097483 Update notes 2015-09-13 17:51:18 -07:00
James R. Barlow
cdd1a6d03c Suppress failing test 2015-09-10 07:01:14 -07:00
James R. Barlow
5fb8411571 Try new PPA for libav 2015-09-10 06:01:59 -07:00
James R. Barlow
334a15b8c7 typo fix 2015-09-10 05:01:44 -07:00
James R. Barlow
6390736577 ffmpeg-dev instead? 2015-09-10 04:27:57 -07:00
James R. Barlow
d55a214516 Autoreconf? 2015-09-10 04:10:12 -07:00
James R. Barlow
0994164b9a travis: apt-get install in wrong place 2015-09-06 01:43:47 -07:00
James R. Barlow
54ee0dd147 travis: fix typo 2015-09-06 01:39:54 -07:00
James R. Barlow
47c7990fb3 travis: build unpaper with cache 2015-09-06 01:38:01 -07:00
James R. Barlow
997e95de4d travis: build unpaper 2015-09-06 01:29:07 -07:00
James R. Barlow
44204be256 Fix order of PPAs 2015-09-06 00:54:50 -07:00
James R. Barlow
9b1d9aa88a travis: improve, add new PPA, etc. 2015-09-06 00:41:23 -07:00
James R. Barlow
b775762f6a travis: doesn't like gcc-4.8, try just gcc 2015-09-06 00:23:05 -07:00
James R. Barlow
df1a28e319 Travis needs sudo mode 2015-09-06 00:21:20 -07:00
James R. Barlow
c300b2802a travis: tabs -> spaces 2015-09-06 00:08:25 -07:00
James R. Barlow
01040ace4c More complete travis.yml 2015-09-06 00:02:58 -07:00
James R. Barlow
8367172e0b Start setting up Travis CI 2015-09-05 23:44:43 -07:00
James R. Barlow
09afd8d25d Move to my repo: github.com/fritz-hh => jbarlow83
I made several efforts to contact fritz but he is no longer
communicating, and to set up Github integrations with Docker and Travis
CI I need admin access. Which I don't have. So I'm moving it to my own
and aiming the old one at me.
v3.0
2015-09-05 01:14:54 -07:00
James R. Barlow
7ed60429b3 Test case: No longer using JHOVE
So JHOVE will not claim this is an invalid PDF and we should see it
reported as valid.
2015-09-05 01:12:33 -07:00
James R. Barlow
281eafada0 bump to v3.0 and move repos 2015-09-05 00:53:14 -07:00
James R. Barlow
c14e10128a Bump version to -rc9 v3.0-rc9 2015-08-29 16:43:22 -07:00
James R. Barlow
3270635192 ghostscript: quiet startup on rasterize 2015-08-28 04:51:36 -07:00
James R. Barlow
3d26257710 Add test cases for additional image formats 2015-08-28 04:51:11 -07:00
James R. Barlow
c4f134d694 Prevent running validation on missing file after an exception is thrown 2015-08-28 04:48:29 -07:00
James R. Barlow
83f9dfbac4 Use png256 raster device when possible
Someone reported a bug where the .png input to unpaper ended up being
type 'P' (palette) for some reason, which was not supported in unpaper.

Not sure how it happened, but seemed easier to fix by explicitly
supporting. Here we use png256 if it would capture all colors in the
input file. It's up to tesseract/reportlab to make use of the palette
PNG when rendering.
2015-08-28 04:47:57 -07:00
James R. Barlow
3a445ad5f7 unpaper: support paletted files by conversion instead of bailing 2015-08-28 04:44:26 -07:00
James R. Barlow
c6d106ec33 Throw exception if iccprofiles not found instead of returning None
So far iccprofiles were only missing for a user who had a custom and
possibly broken ghostscript installation.
2015-08-28 03:59:35 -07:00
James R. Barlow
2ce6834be4 Bump to -rc8 v3.0-rc8 2015-08-24 01:25:01 -07:00
James R. Barlow
b376672dbc Bug fix: exception thrown if input PDF was missing DocumentInfo block 2015-08-24 01:23:30 -07:00
James R. Barlow
d07db8547f Merge branch 'master' of https://github.com/fritz-hh/OCRmyPDF v3.0-rc7 2015-08-23 12:30:46 -07:00
James R. Barlow
aab08bfcc7 Fix requirements.txt problem 2015-08-23 12:30:40 -07:00