James R. Barlow
58f4582517
More Dockerfile repair
...
I'm not fully happy with this arrangement, as it effectively downloads
OCRmyPDF twice, not to mention the lengthy setup time overall.
Will need to try separate build/run images in the future, but now just
get it working again.
2016-02-06 23:13:16 -08:00
James R. Barlow
2d15c09cca
Merge branch 'develop'
2016-02-06 18:18:49 -08:00
James R. Barlow
04cb8865b0
Fetch application from PyPI instead of local
...
setuptools_scm barfs because it can't find the version, because Docker hub
retrieves the application from Github in a way that omits the necessary
details.
I suppose there is a certain logic to Docker only using the tagged
released versions from PyPI, so go with it. The other attractive option
is to nix setuptools_scm.
2016-02-06 18:18:30 -08:00
James R. Barlow
6fe32bbaf7
v3.2.1
v3.2.1
2016-02-05 16:10:18 -08:00
James R. Barlow
4abb20390d
Bump Dockerfile versions
2016-02-05 16:08:26 -08:00
James R. Barlow
daa3916430
Fix img2pdf 0.2 usage
...
All tests pass when forced to rely on img2pdf, so seems okay
2016-02-05 15:13:26 -08:00
James R. Barlow
e9b87cefcc
Try img2pdf 0.2
2016-02-05 14:38:37 -08:00
James R. Barlow
60593b5ad3
Tighten up package requirements to deal with incompatible img2pdf 0.2 release
2016-02-05 14:37:05 -08:00
James R. Barlow
f708b11ea4
Fix Python 2.7 warning
2016-02-05 02:34:49 -08:00
James R. Barlow
7982f58b2e
Try tweaking Dockerfile for automated build again
v3.2.post2
2016-02-05 01:38:59 -08:00
James R. Barlow
e805c1908a
Minor fix for Dockerfile polyglot
v3.2.post1
2016-02-05 00:52:27 -08:00
James R. Barlow
cb3ba8e973
Merge branch 'release/v3.2' into develop
2016-02-05 00:10:41 -08:00
James R. Barlow
344fc40cbc
Merge branch 'release/v3.2'
v3.2
2016-02-05 00:10:41 -08:00
James R. Barlow
7e5c37137b
Merge branch 'develop' into release/v3.2
2016-02-04 23:42:06 -08:00
James R. Barlow
1aae11714b
Update release notes for v3.2
2016-02-04 23:41:33 -08:00
James R. Barlow
d82f14a7aa
Update .gitignore
2016-02-04 18:51:41 -08:00
James R. Barlow
4b65e0b093
Set JPEG output quality to 95 for better transcoding
2016-02-04 18:49:09 -08:00
James R. Barlow
43b0faa830
Bug in tesseract_noop spoof: produced wrong page sizes
...
Now checks input image to ensure the implied page size of its .hocr file
matches the rest of the PDF.
2016-02-04 18:48:22 -08:00
James R. Barlow
8674c9fb20
Merge commit 'ccfbb54e8c26784e438ba2fcac2179f21e7d857b' into release/v3.2
2016-02-04 17:39:36 -08:00
jbarlow83
ccfbb54e8c
Update release notes for v3.2
...
Fix the notes
2016-02-04 17:37:30 -08:00
James R. Barlow
9893ebf889
Suppress tesseract argument printout
2016-02-04 17:26:36 -08:00
James R. Barlow
303eb3e93a
Merge commit 'ca546d70e5bff9e9b115371f7813f3c326822bd8' into release/v3.2
2016-02-04 17:25:56 -08:00
jbarlow83
ca546d70e5
Merge pull request #45 from spwhitton/hocrtransform-shebang-fix
...
fix shebang in hocrtransform.py
2016-02-04 17:21:33 -08:00
Sean Whitton
6a5ea2d64a
fix shebang in hocrtransform.py
2016-02-03 17:48:35 -07:00
James R. Barlow
bacbcba58a
Merge branch 'release/v3.2-rc1'
v3.2rc1
2016-01-19 16:58:37 -08:00
James R. Barlow
52e8aa434f
Update release notes for v3.2-rc1
2016-01-19 16:49:49 -08:00
James R. Barlow
37c508f3f8
Better versioning: no silly version files, but wrong ver in development
...
Small price to pay.
2016-01-19 16:07:52 -08:00
James R. Barlow
26e36422cc
More fiddling with version
2016-01-19 15:07:21 -08:00
James R. Barlow
f82cb002bc
Try automatic versioning with setuptools_scm
2016-01-19 13:27:18 -08:00
James R. Barlow
c1eb047a4b
Fix name of pdfa_def.ps
...
Used to include a copy of the parent dir's name.
2016-01-19 13:11:03 -08:00
James R. Barlow
626ca18f5c
Remove stale comment
2016-01-19 13:02:35 -08:00
James R. Barlow
9058dedfbe
New tests for ccitt, jbig2 encodings
2016-01-19 13:01:56 -08:00
James R. Barlow
a0952bfca3
Optimize: use img2pdf stream instead of repeated copies
2016-01-18 20:24:46 -08:00
James R. Barlow
354e61946e
Use os.makedirs for test output directories
...
Broke Travis
2016-01-16 02:47:56 -08:00
James R. Barlow
fd6d1d748a
Merge branch 'feature/pypdf-page-merge' into develop
2016-01-16 02:33:23 -08:00
James R. Barlow
360acd1e2c
Adjust test_oversample test case
...
Add -f to force generation of the background image at the desired
oversample resolution. Our new behavior is to only send the oversampled
image to Tesseract while leaving the main page intact unless asked to
deskew, clean, etc.
2016-01-15 15:55:23 -08:00
James R. Barlow
fc0479f110
Fix all but test_oversample[hocr]
2016-01-15 15:46:47 -08:00
James R. Barlow
62728205b6
Implement image+text merging in other cases
...
5 failed, 28 passed
failures:
test_oversample[hocr], test_skip_ocr, test_skip_big, test_maximum_options[hocr],
test_blank_input_pdf,
2016-01-15 15:38:08 -08:00
James R. Barlow
dc0fb25e64
Render hocr page: no longer needs an image as input
2016-01-15 15:16:47 -08:00
James R. Barlow
f3e04cce56
Update pipeline.svg
2016-01-15 14:56:16 -08:00
James R. Barlow
7067110308
Add safety check to prevent merge from running when not sensible
2016-01-15 14:54:45 -08:00
James R. Barlow
599d889703
Implement "perfect reconstruction" - transfer page and watermark OCR layer
...
Works, does not account for changes to clean/deskew, etc.
Surprisingly, it works. PyPDF2 fixes since last attempt?
2016-01-15 14:39:12 -08:00
James R. Barlow
2fa8366632
Merge branch 'feature/test-pageinfo-cleanup' into develop
2016-01-15 14:18:01 -08:00
James R. Barlow
c368c51bad
New hocrtransform test
2016-01-15 14:14:08 -08:00
James R. Barlow
7c558b3713
Move pageinfo test into tests folder
2016-01-11 17:40:44 -08:00
James R. Barlow
8d323ae510
Merge branch 'feature/pagesegmode' into develop
2016-01-11 17:23:00 -08:00
James R. Barlow
3b53e9adac
Use tesseract cache for -psm
2016-01-11 17:22:50 -08:00
James R. Barlow
074c1d71b4
Activate --tesseract-pagesegmode
2016-01-11 17:19:32 -08:00
James R. Barlow
1fca9a004d
Adjust command line parameters
...
Was splitting each argument to --tesseract-config into a list of single
character strings
2016-01-11 16:57:19 -08:00
James R. Barlow
b485a1ef78
Override ruffus' handling of --jobs
...
Ruffus treats omitted parameter as -j1. For our purposes it makes more
sense for omitting the parameter to mean "use all CPUs". As such we
must be able to distinguish -j1 from the parameter -j being omitted.
Telling ruffus to ignore the argument actually just makes it not auto
generate the argument. We can add an argument back with the same name.
2016-01-09 19:07:48 -08:00