I made several efforts to contact fritz but he is no longer
communicating, and to set up Github integrations with Docker and Travis
CI I need admin access. Which I don't have. So I'm moving it to my own
and aiming the old one at me.
Someone reported a bug where the .png input to unpaper ended up being
type 'P' (palette) for some reason, which was not supported in unpaper.
Not sure how it happened, but seemed easier to fix by explicitly
supporting. Here we use png256 if it would capture all colors in the
input file. It's up to tesseract/reportlab to make use of the palette
PNG when rendering.
debian and ubuntu both install unpaper 0.4.2 or so. No .deb packages
available at higher version numbers although ArchLinux had something.
Considered making a separate image to handle building and install but
decided that was a premature optimization at this point, so just build
the unpaper that works. All tests pass.
Switched from Ubuntu to debian:stretch because stretch has more recent
versions of our binary packages and starts smaller. In particular,
stretch has both pillow==2.9.0 and reportlab==3.2.0 available as system
packages which saves the considerable hassle of install a toolchain.
Instead, a pyvenv is set up with access to system's site-packages (note:
needs two steps), making the binary-dependent packages available. Then
the remaining packages are installed into the pyvenv with --no-cache-dir
to avoid saving files. And there we are.
Image is still very large (>500 MB), but programs like reportlab require
font rendering capabilities so they pull in large portions of the Linux
graphics stack. Not much will shrink that.
fileinput is supposed to save time in these cases but it's not capable
of doing both in-place rewrites and working with a non-ascii encoding.
This was not noticed until characters outside of ASCII were picked up
by tesseract and saved in a HOCR file. Rework some surrounding code as
well and add multilingual test cases.
Far from being fluffy or friendly, Pillow silently allows installation
of itself without support for major image types. Reportlab calls for
pillow 2.4.0. On Ubuntu 14.04 LTS this will trigger an upgrade of
pillow that will be built without JPEG or ZLIB so it is effectively
neutered, and unfortunately Pillow will not detect this situation at
install time and guide users to a resolution. Instead, you see nasty
stack traces.
So add a run-time check to ensure that Pillow is sane and capable of JPEG
and PNG support since both may be used internally.