Commit Graph

571 Commits

Author SHA1 Message Date
James R. Barlow
58f4582517 More Dockerfile repair
I'm not fully happy with this arrangement, as it effectively downloads
OCRmyPDF twice, not to mention the lengthy setup time overall.

Will need to try separate build/run images in the future, but now just
get it working again.
2016-02-06 23:13:16 -08:00
James R. Barlow
2d15c09cca Merge branch 'develop' 2016-02-06 18:18:49 -08:00
James R. Barlow
04cb8865b0 Fetch application from PyPI instead of local
setuptools_scm barfs because it can't find the version, because Docker hub
retrieves the application from Github in a way that omits the necessary
details.

I suppose there is a certain logic to Docker only using the tagged
released versions from PyPI, so go with it.  The other attractive option
is to nix setuptools_scm.
2016-02-06 18:18:30 -08:00
James R. Barlow
6fe32bbaf7 v3.2.1 v3.2.1 2016-02-05 16:10:18 -08:00
James R. Barlow
4abb20390d Bump Dockerfile versions 2016-02-05 16:08:26 -08:00
James R. Barlow
daa3916430 Fix img2pdf 0.2 usage
All tests pass when forced to rely on img2pdf, so seems okay
2016-02-05 15:13:26 -08:00
James R. Barlow
e9b87cefcc Try img2pdf 0.2 2016-02-05 14:38:37 -08:00
James R. Barlow
60593b5ad3 Tighten up package requirements to deal with incompatible img2pdf 0.2 release 2016-02-05 14:37:05 -08:00
James R. Barlow
f708b11ea4 Fix Python 2.7 warning 2016-02-05 02:34:49 -08:00
James R. Barlow
7982f58b2e Try tweaking Dockerfile for automated build again v3.2.post2 2016-02-05 01:38:59 -08:00
James R. Barlow
e805c1908a Minor fix for Dockerfile polyglot v3.2.post1 2016-02-05 00:52:27 -08:00
James R. Barlow
cb3ba8e973 Merge branch 'release/v3.2' into develop 2016-02-05 00:10:41 -08:00
James R. Barlow
344fc40cbc Merge branch 'release/v3.2' v3.2 2016-02-05 00:10:41 -08:00
James R. Barlow
7e5c37137b Merge branch 'develop' into release/v3.2 2016-02-04 23:42:06 -08:00
James R. Barlow
1aae11714b Update release notes for v3.2 2016-02-04 23:41:33 -08:00
James R. Barlow
d82f14a7aa Update .gitignore 2016-02-04 18:51:41 -08:00
James R. Barlow
4b65e0b093 Set JPEG output quality to 95 for better transcoding 2016-02-04 18:49:09 -08:00
James R. Barlow
43b0faa830 Bug in tesseract_noop spoof: produced wrong page sizes
Now checks input image to ensure the implied page size of its .hocr file
matches the rest of the PDF.
2016-02-04 18:48:22 -08:00
James R. Barlow
8674c9fb20 Merge commit 'ccfbb54e8c26784e438ba2fcac2179f21e7d857b' into release/v3.2 2016-02-04 17:39:36 -08:00
jbarlow83
ccfbb54e8c Update release notes for v3.2
Fix the notes
2016-02-04 17:37:30 -08:00
James R. Barlow
9893ebf889 Suppress tesseract argument printout 2016-02-04 17:26:36 -08:00
James R. Barlow
303eb3e93a Merge commit 'ca546d70e5bff9e9b115371f7813f3c326822bd8' into release/v3.2 2016-02-04 17:25:56 -08:00
jbarlow83
ca546d70e5 Merge pull request #45 from spwhitton/hocrtransform-shebang-fix
fix shebang in hocrtransform.py
2016-02-04 17:21:33 -08:00
Sean Whitton
6a5ea2d64a fix shebang in hocrtransform.py 2016-02-03 17:48:35 -07:00
James R. Barlow
bacbcba58a Merge branch 'release/v3.2-rc1' v3.2rc1 2016-01-19 16:58:37 -08:00
James R. Barlow
52e8aa434f Update release notes for v3.2-rc1 2016-01-19 16:49:49 -08:00
James R. Barlow
37c508f3f8 Better versioning: no silly version files, but wrong ver in development
Small price to pay.
2016-01-19 16:07:52 -08:00
James R. Barlow
26e36422cc More fiddling with version 2016-01-19 15:07:21 -08:00
James R. Barlow
f82cb002bc Try automatic versioning with setuptools_scm 2016-01-19 13:27:18 -08:00
James R. Barlow
c1eb047a4b Fix name of pdfa_def.ps
Used to include a copy of the parent dir's name.
2016-01-19 13:11:03 -08:00
James R. Barlow
626ca18f5c Remove stale comment 2016-01-19 13:02:35 -08:00
James R. Barlow
9058dedfbe New tests for ccitt, jbig2 encodings 2016-01-19 13:01:56 -08:00
James R. Barlow
a0952bfca3 Optimize: use img2pdf stream instead of repeated copies 2016-01-18 20:24:46 -08:00
James R. Barlow
354e61946e Use os.makedirs for test output directories
Broke Travis
2016-01-16 02:47:56 -08:00
James R. Barlow
fd6d1d748a Merge branch 'feature/pypdf-page-merge' into develop 2016-01-16 02:33:23 -08:00
James R. Barlow
360acd1e2c Adjust test_oversample test case
Add -f to force generation of the background image at the desired
oversample resolution.  Our new behavior is to only send the oversampled
image to Tesseract while leaving the main page intact unless asked to
deskew, clean, etc.
2016-01-15 15:55:23 -08:00
James R. Barlow
fc0479f110 Fix all but test_oversample[hocr] 2016-01-15 15:46:47 -08:00
James R. Barlow
62728205b6 Implement image+text merging in other cases
5 failed, 28 passed

failures:
test_oversample[hocr], test_skip_ocr, test_skip_big, test_maximum_options[hocr],
test_blank_input_pdf,
2016-01-15 15:38:08 -08:00
James R. Barlow
dc0fb25e64 Render hocr page: no longer needs an image as input 2016-01-15 15:16:47 -08:00
James R. Barlow
f3e04cce56 Update pipeline.svg 2016-01-15 14:56:16 -08:00
James R. Barlow
7067110308 Add safety check to prevent merge from running when not sensible 2016-01-15 14:54:45 -08:00
James R. Barlow
599d889703 Implement "perfect reconstruction" - transfer page and watermark OCR layer
Works, does not account for changes to clean/deskew, etc.
Surprisingly, it works. PyPDF2 fixes since last attempt?
2016-01-15 14:39:12 -08:00
James R. Barlow
2fa8366632 Merge branch 'feature/test-pageinfo-cleanup' into develop 2016-01-15 14:18:01 -08:00
James R. Barlow
c368c51bad New hocrtransform test 2016-01-15 14:14:08 -08:00
James R. Barlow
7c558b3713 Move pageinfo test into tests folder 2016-01-11 17:40:44 -08:00
James R. Barlow
8d323ae510 Merge branch 'feature/pagesegmode' into develop 2016-01-11 17:23:00 -08:00
James R. Barlow
3b53e9adac Use tesseract cache for -psm 2016-01-11 17:22:50 -08:00
James R. Barlow
074c1d71b4 Activate --tesseract-pagesegmode 2016-01-11 17:19:32 -08:00
James R. Barlow
1fca9a004d Adjust command line parameters
Was splitting each argument to --tesseract-config into a list of single
character strings
2016-01-11 16:57:19 -08:00
James R. Barlow
b485a1ef78 Override ruffus' handling of --jobs
Ruffus treats omitted parameter as -j1. For our purposes it makes more
sense for omitting the parameter to mean "use all CPUs". As such we
must be able to distinguish -j1 from the parameter -j being omitted.

Telling ruffus to ignore the argument actually just makes it not auto
generate the argument. We can add an argument back with the same name.
2016-01-09 19:07:48 -08:00