454 Commits

Author SHA1 Message Date
James R. Barlow
09afd8d25d Move to my repo: github.com/fritz-hh => jbarlow83
I made several efforts to contact fritz but he is no longer
communicating, and to set up Github integrations with Docker and Travis
CI I need admin access. Which I don't have. So I'm moving it to my own
and aiming the old one at me.
v3.0
2015-09-05 01:14:54 -07:00
James R. Barlow
7ed60429b3 Test case: No longer using JHOVE
So JHOVE will not claim this is an invalid PDF and we should see it
reported as valid.
2015-09-05 01:12:33 -07:00
James R. Barlow
281eafada0 bump to v3.0 and move repos 2015-09-05 00:53:14 -07:00
James R. Barlow
c14e10128a Bump version to -rc9 v3.0-rc9 2015-08-29 16:43:22 -07:00
James R. Barlow
3270635192 ghostscript: quiet startup on rasterize 2015-08-28 04:51:36 -07:00
James R. Barlow
3d26257710 Add test cases for additional image formats 2015-08-28 04:51:11 -07:00
James R. Barlow
c4f134d694 Prevent running validation on missing file after an exception is thrown 2015-08-28 04:48:29 -07:00
James R. Barlow
83f9dfbac4 Use png256 raster device when possible
Someone reported a bug where the .png input to unpaper ended up being
type 'P' (palette) for some reason, which was not supported in unpaper.

Not sure how it happened, but seemed easier to fix by explicitly
supporting. Here we use png256 if it would capture all colors in the
input file. It's up to tesseract/reportlab to make use of the palette
PNG when rendering.
2015-08-28 04:47:57 -07:00
James R. Barlow
3a445ad5f7 unpaper: support paletted files by conversion instead of bailing 2015-08-28 04:44:26 -07:00
James R. Barlow
c6d106ec33 Throw exception if iccprofiles not found instead of returning None
So far iccprofiles were only missing for a user who had a custom and
possibly broken ghostscript installation.
2015-08-28 03:59:35 -07:00
James R. Barlow
2ce6834be4 Bump to -rc8 v3.0-rc8 2015-08-24 01:25:01 -07:00
James R. Barlow
b376672dbc Bug fix: exception thrown if input PDF was missing DocumentInfo block 2015-08-24 01:23:30 -07:00
James R. Barlow
d07db8547f Merge branch 'master' of https://github.com/fritz-hh/OCRmyPDF v3.0-rc7 2015-08-23 12:30:46 -07:00
James R. Barlow
aab08bfcc7 Fix requirements.txt problem 2015-08-23 12:30:40 -07:00
jbarlow83
e0a25494ee Explain the need for multi core, etc 2015-08-22 13:34:42 -07:00
James R. Barlow
fd876d5e4e Merge branch 'develop' v3.0-rc6 2015-08-22 01:51:44 -07:00
James R. Barlow
ee7f008ff5 Require unpaper 6.1; no messing around with broken versions 2015-08-22 01:51:08 -07:00
jbarlow83
d9161a6ddb Update README: docker run instructions 2015-08-22 01:50:13 -07:00
jbarlow83
f8d66768e3 Update README with docker install instructions 2015-08-22 01:33:12 -07:00
James R. Barlow
4f3673d14d Update notes for -rc6 2015-08-22 00:40:07 -07:00
James R. Barlow
1712fdb74a Merge branch 'feature/docker-debian' 2015-08-22 00:32:27 -07:00
James R. Barlow
3a5ffc79e0 Stock debian unpaper is no good; replace with 6.1 built from source
debian and ubuntu both install unpaper 0.4.2 or so. No .deb packages
available at higher version numbers although ArchLinux had something.
Considered making a separate image to handle building and install but
decided that was a premature optimization at this point, so just build
the unpaper that works. All tests pass.
2015-08-22 00:30:39 -07:00
James R. Barlow
859b063444 Fixup other docker test suite errors
Outstanding failures:
test_pageinfo::test_jpeg
tests involving unpaper due to version <6.1 failures
2015-08-20 02:37:03 -07:00
James R. Barlow
bd61e7c644 dockerignore *.pyc
https://github.com/docker/docker/issues/13113
Docker kinda sucks. No recursive exclusion.
2015-08-20 02:27:07 -07:00
James R. Barlow
c9abf282b5 Set docker locale to utf-8
Shocked, shocked, that there's a Linux distribution out that there isn't
doing the right thing and setting up utf-8 by default. (Many tests failed)
2015-08-20 01:44:30 -07:00
James R. Barlow
9dad40b5a3 Major overhaul of the Dockerfile
Switched from Ubuntu to debian:stretch because stretch has more recent
versions of our binary packages and starts smaller.  In particular,
stretch has both pillow==2.9.0 and reportlab==3.2.0 available as system
packages which saves the considerable hassle of install a toolchain.

Instead, a pyvenv is set up with access to system's site-packages (note:
needs two steps), making the binary-dependent packages available.  Then
the remaining packages are installed into the pyvenv with --no-cache-dir
to avoid saving files. And there we are.

Image is still very large (>500 MB), but programs like reportlab require
font rendering capabilities so they pull in large portions of the Linux
graphics stack. Not much will shrink that.
2015-08-20 01:25:31 -07:00
James R. Barlow
8e2d690cb0 Rework Dockerfile, setup.py to work with wheels for better cache use 2015-08-19 13:43:32 -07:00
James R. Barlow
c132e091e1 Dockerfile: use local copy of application 2015-08-19 13:10:58 -07:00
James R. Barlow
630e6cbf1e pip chokes on Unicode filenames? 2015-08-18 23:56:30 -07:00
James R. Barlow
83ff5760a8 Dockerfile comment cleanup 2015-08-18 23:41:41 -07:00
James R. Barlow
fed0ee638e Fix ruffus writing to RO directory in container 2015-08-18 23:30:06 -07:00
James R. Barlow
cc161780df Replace fileinput with regular open-replace
fileinput is supposed to save time in these cases but it's not capable
of doing both in-place rewrites and working with a non-ascii encoding.
This was not noticed until characters outside of ASCII were picked up
by tesseract and saved in a HOCR file. Rework some surrounding code as
well and add multilingual test cases.
2015-08-18 23:27:50 -07:00
James R. Barlow
898b2b000a Works 2015-08-18 05:38:05 -07:00
James R. Barlow
b3ee743ed7 WIP on docker 2015-08-18 04:46:25 -07:00
James R. Barlow
ef17b669fe README needs ghostscript 2015-08-18 03:27:39 -07:00
James R. Barlow
2dff3e07ce Drop libxml2 dependency
It seems that Python's internal XML parser is good enough to do the job.
2015-08-17 15:26:07 -07:00
James R. Barlow
53c88093ad Bump to -rc5 v3.0-rc5 2015-08-16 02:19:04 -07:00
James R. Barlow
0ec13d3a17 Fix test cases: minor issues
-os.environ directly modified when whole suite run, breaking subsequent
tests
-no longer trusting JHOVE for PDF/A validation
2015-08-16 01:57:35 -07:00
jbarlow83
0d5104049a Update README with better install instructions 2015-08-16 01:28:28 -07:00
James R. Barlow
ce8fa69785 Update readme 2015-08-16 00:59:57 -07:00
James R. Barlow
30072e0c70 Pillow sucks
Far from being fluffy or friendly, Pillow silently allows installation
of itself without support for major image types.  Reportlab calls for
pillow 2.4.0.  On Ubuntu 14.04 LTS this will trigger an upgrade of
pillow that will be built without JPEG or ZLIB so it is effectively
neutered, and unfortunately Pillow will not detect this situation at
install time and guide users to a resolution.  Instead, you see nasty
stack traces.

So add a run-time check to ensure that Pillow is sane and capable of JPEG
and PNG support since both may be used internally.
2015-08-16 00:54:03 -07:00
James R. Barlow
eb04a890b2 Relax Pillow requirement for Ubuntu 14.04 LTS 2015-08-15 15:55:56 -07:00
James R. Barlow
0c53adb04f setup: rollback lxml version to 3.3.3 - that's the latest in Ubuntu 14.04 2015-08-15 15:25:58 -07:00
James R. Barlow
ee5a43fd47 setup: suppress jhove errors 2015-08-15 15:25:30 -07:00
James R. Barlow
c43d6c2cbe Merge branch 'develop' of https://github.com/fritz-hh/OCRmyPDF into develop
Conflicts:
	setup.py
2015-08-15 15:18:41 -07:00
James R. Barlow
87aeeacb04 Fix erroneous instruction to "apt-get install tesseract"
Should be tesseract-ocr
2015-08-15 15:17:38 -07:00
James R. Barlow
6b26e9cad6 Fix erroneous instruction to "apt-get install tesseract"
Should be tesseract-ocr
2015-08-15 15:12:05 -07:00
James R. Barlow
85af0f0d03 Add test case for blank PDF page 2015-08-14 00:46:50 -07:00
James R. Barlow
f6f4705ea3 Remove Java from setup.py 2015-08-14 00:44:56 -07:00
James R. Barlow
a4702bff22 Possible fix for issue #111 2015-08-13 23:10:22 -07:00