As of 3.1.1, our minimum version, these codecs are now required by
default for a successful installation, effectively solving the problem
of Pillow installed without libjpeg/libpng.
-update requirements.txt and dev_requirements.txt to more recent version
-setup.py updated to Ubuntu 14.04 rather than 12.04 backports
-request at least Pillow 3.1.1 now (since this makes jpeg/png mandatory)
Never really investigated the reason why ruffus returns a mutex to go
along with its logger. It seems that the mutex is only needed if one
wanted to make multiple successive calls to a log function and have
them appear appear atomically. It is not needed to protect the logger
proxy because accessing the proxy triggers IPC in the child process
that handles the multiprocessing.Manager() object.
The logging wrapper only logs one line at a time, so the mutex does not
actually protect logging sequence. Cut it.
Also manager.Lock() returns a threading.Lock object so the purpose of it
is actually to help processes share a thread-level lock. It would be
more appropriate to use a semaphore based multiprocessing.Lock.
I experimented with the idea of using asyncio-based processing but
realized that that does not solve the import time binding problem
that is the real issue. Therefore the simpler refactoring is to convert
to ruffus-oo syntax and get things working again.
build_pipeline() is really ugly at the moment. The old syntax had its
advantages.
This test reproduces the complete pipeline graph but does not work
otherwise.
It looks like GS 9.19 can incorrectly set overprinting for the text layer
even though this makes no sense in PDF/A, or at least someone produced
PDFs that have this after a Tesseract PDF -> GS PDF/A conversion. GS 9.20
complains about this. Instead of aborting, elide the feature.
See
http://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=094d5a1880f1cb9ed320ca9353eb69436e09b594
and
issue #107.
It looks like it is better to elide features and warn about elision rather
than abort with an error.
Turns out this occurred in any case where pdf-renderer hocr was used
and a tesseract timeout or error occurred. We created a replacement
page based on the unrotated page dimensions instead of the input image’s
dimensions.
Mathjax isn't actually needed for OCRmyPDF's docs, but enabling this
extension causes the browser to download a copy of mathjax.js from
cdn.mathjax.org anyway.
I have to disable this for the offline docs bundled with Debian, but
since you're not using mathjax, it would be nice to have the diff merged
upstream.
For sanity's sake, deal with tesseract streams in binary without
transcoding (via universal_newlines, etc.). The only differences are
printing messages regarding spoofing.
Also hash the source file so that changes to the cache mechanism
invalidate old cache automatically. That is probably too aggressive,
but simple and safer than the previous approach.