Commit Graph

896 Commits

Author SHA1 Message Date
James R. Barlow
ff16a00a3d Remove test for Pillow JPEG and PNG
As of 3.1.1, our minimum version, these codecs are now required by
default for a successful installation, effectively solving the problem
of Pillow installed without libjpeg/libpng.
2016-12-03 14:25:46 -08:00
James R. Barlow
8982b3e1e2 Update requirements
-update requirements.txt and dev_requirements.txt to more recent version
-setup.py updated to Ubuntu 14.04 rather than 12.04 backports
-request at least Pillow 3.1.1 now (since this makes jpeg/png mandatory)
2016-12-03 14:14:07 -08:00
James R. Barlow
be0fa35d14 Merge branch 'master' into feature/ooruffus 2016-12-03 14:02:43 -08:00
James R. Barlow
9f51ed9d01 Finalize v4.3.3 release notes v4.3.3 2016-12-03 00:39:24 -08:00
James R. Barlow
731e6792c7 Add test cases for Ghostscript PDF/A warnings 2016-12-03 00:32:09 -08:00
James R. Barlow
c35ec0b4aa ghostscript: more effort at error logging 2016-12-03 00:22:03 -08:00
James R. Barlow
03aaf575dc v4.3.3 release notes, fix more gs 9.20 issues 2016-12-02 16:26:34 -08:00
James R. Barlow
9a060579ba Move work_folder into multiprocessing manager 2016-12-02 01:39:17 -08:00
James R. Barlow
d40a5c4f7a Remove all remaining traces of ‘options’ global state from task runners 2016-12-02 01:31:57 -08:00
James R. Barlow
21f7dc3377 Distribute ‘options’ to worker processes via the multiprocessing manager 2016-12-02 01:06:11 -08:00
James R. Barlow
43c13a1ed9 Replace pdfinfo, pdfinfo_lock with multiprocessing manager
Using a context manager to guard the pdfinfo list makes the lock
unnecessary. (Although it was probably unnecessary in the first place
anyway.)
2016-12-01 23:36:30 -08:00
James R. Barlow
6bc3f189e1 Remove “WrappedLogger” - does not do anything useful
Never really investigated the reason why ruffus returns a mutex to go
along with its logger. It seems that the mutex is only needed if one
wanted to make multiple successive calls to a log function and have
them appear appear atomically. It is not needed to protect the logger
proxy because accessing the proxy triggers IPC in the child process
that handles the multiprocessing.Manager() object.

The logging wrapper only logs one line at a time, so the mutex does not
actually protect logging sequence. Cut it.

Also manager.Lock() returns a threading.Lock object so the purpose of it
is actually to help processes share a thread-level lock. It would be
more appropriate to use a semaphore based multiprocessing.Lock.
2016-12-01 15:27:07 -08:00
James R. Barlow
2c5437135c Remove temporary re_symlink logging shim 2016-12-01 00:31:42 -08:00
James R. Barlow
444da02523 Fix mistake made in converting pipeline; incredibly, all tests pass now 2016-12-01 00:30:19 -08:00
James R. Barlow
00e8af2381 Reactivate the pipeline; surprisingly works in quick test 2016-12-01 00:03:03 -08:00
James R. Barlow
401b21864f Convert to object oriented ruffus syntax (does not run)
I experimented with the idea of using asyncio-based processing but
realized that that does not solve the import time binding problem
that is the real issue. Therefore the simpler refactoring is to convert
to ruffus-oo syntax and get things working again.

build_pipeline() is really ugly at the moment. The old syntax had its
advantages.

This test reproduces the complete pipeline graph but does not work
otherwise.
2016-11-30 23:58:26 -08:00
James R. Barlow
de939951d4 Record version in debug log 2016-11-29 15:30:50 -08:00
James R. Barlow
7725d16a26 Fix exception on inline stencil masks with no /CS attribute 2016-11-24 22:37:00 -08:00
James R. Barlow
8a74408d83 Add security suggestions 2016-11-21 20:58:31 -08:00
James R. Barlow
3d0dc95a06 Moved venvs 2016-11-21 20:40:22 -08:00
James R. Barlow
04a57a3cc2 OS X -> macOS 2016-11-21 20:40:06 -08:00
James R. Barlow
d0c22ce01d v4.3.2 release notes v4.3.2 2016-11-10 23:16:08 -08:00
James R. Barlow
23c95e9660 ghostscript: elide overprinting to fix PDF/A errors in GS 9.20
It looks like GS 9.19 can incorrectly set overprinting for the text layer
even though this makes no sense in PDF/A, or at least someone produced
PDFs that have this after a Tesseract PDF -> GS PDF/A conversion. GS 9.20
complains about this. Instead of aborting, elide the feature.

See
http://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=094d5a1880f1cb9ed320ca9353eb69436e09b594
and
issue #107.

It looks like it is better to elide features and warn about elision rather
than abort with an error.
2016-11-10 14:48:02 -08:00
James R. Barlow
eecab9b95d pdfa: fix KeyError on pdfa_dict if document has some xmp metadata but
not exactly what we’re looking for
2016-11-09 05:41:12 -08:00
James R. Barlow
8abc2f113c Merge branch 'develop' v4.3.1 2016-11-07 14:36:50 -08:00
James R. Barlow
949d2ff1c2 v4.3.1 release notes 2016-11-07 14:36:08 -08:00
James R. Barlow
1c8b763d53 test_pageinfo: Remove bits per component test
The behavior of this test will ultimately depend on what version of
img2pdf is installed, since after my patch it will be able to produce
1bpp images.
2016-11-07 14:35:54 -08:00
James R. Barlow
bb91393b85 Fix “deskew-rotate” bug.
Turns out this occurred in any case where pdf-renderer hocr was used
and a tesseract timeout or error occurred. We created a replacement
page based on the unrotated page dimensions instead of the input image’s
dimensions.
2016-11-07 14:17:31 -08:00
James R. Barlow
cc9c0d819e Add test case for documents that get rotated incorrectly after deskew 2016-11-07 14:15:03 -08:00
James R. Barlow
a72b8caf47 Update documentation on other languages, multilingual documents 2016-11-07 14:14:06 -08:00
James R. Barlow
fdd9b8b8ce Optimize some of the test resources to reduce file sizes
Mostly by reducing RGB -> monochrome and applying JBIG2 compression
2016-11-07 14:01:23 -08:00
James R. Barlow
c096b4ca8c Make debug dump of pageinfo at the end of processing readable 2016-11-04 02:23:02 -07:00
James R. Barlow
427add3008 Add @posttask debug hooks 2016-11-03 18:15:21 -07:00
James R. Barlow
c45871700d Fix bug: LeptonicaErrorTrap() leaks file handles 2016-11-03 15:51:27 -07:00
Sean Whitton
6821e8eeb2 disable mathjax sphinx extension (#103)
Mathjax isn't actually needed for OCRmyPDF's docs, but enabling this
extension causes the browser to download a copy of mathjax.js from
cdn.mathjax.org anyway.

I have to disable this for the offline docs bundled with Debian, but
since you're not using mathjax, it would be nice to have the diff merged
upstream.
2016-11-01 21:56:57 -07:00
James R. Barlow
a4f07756a5 tesseract caching: don't transcode tesseract's output, hash source file
For sanity's sake, deal with tesseract streams in binary without
transcoding (via universal_newlines, etc.). The only differences are
printing messages regarding spoofing.

Also hash the source file so that changes to the cache mechanism
invalidate old cache automatically. That is probably too aggressive,
but simple and safer than the previous approach.
2016-10-28 16:44:12 -07:00
James R. Barlow
f24fb0e0c5 Obligatory MANIFEST.in repair v4.3 2016-10-28 01:28:46 -07:00
James R. Barlow
73b88a0a6f More work on documentation 2016-10-28 01:22:40 -07:00
James R. Barlow
c42f39e2d4 Update README to point to ReadTheDocs 2016-10-28 00:33:17 -07:00
James R. Barlow
5e5fe3175f docs: OS X -> macOS branding change 2016-10-28 00:32:57 -07:00
James R. Barlow
cab65d1f11 pageinfo: add a python3.4 implementation of isclose() 2016-10-28 00:31:04 -07:00
James R. Barlow
245f05d5f4 docs: allow python setup.py install --force to bypass checks
ReadTheDocs needs this.
2016-10-28 00:07:26 -07:00
James R. Barlow
dda751f9e3 Merge branch 'feature/docs' into develop
# Conflicts:
#	ocrmypdf/__main__.py
2016-10-27 23:50:08 -07:00
James R. Barlow
3d37ae988a Update release notes for 4.3 2016-10-27 23:48:12 -07:00
James R. Barlow
717acd9855 Prevent dumping binary PDFs to stdout 2016-10-27 16:20:53 -07:00
James R. Barlow
2e4431cc63 Allow piping output to stdout 2016-10-27 16:14:42 -07:00
James R. Barlow
f7387b0859 test_stdin: simplify this test
No need to involve 'cat', just hook the file up to stdin.
2016-10-27 16:01:07 -07:00
James R. Barlow
a09f6b8977 Test cases: check that stdout is clear of output
To ensure piping to stdout is possible.
2016-10-27 15:58:24 -07:00
James R. Barlow
d63449c214 main: don't print output file location to stdout, use stderr 2016-10-27 15:57:33 -07:00
James R. Barlow
a86805f0d9 Remove possibly non-free page from "multipage.pdf" 2016-10-27 15:56:43 -07:00