James R. Barlow
9ec4aa039d
Add tesseract caching to speed up tests
2015-12-17 12:52:12 -08:00
James R. Barlow
ecebe2f24b
Let some tests use the spoofed tesseract
...
Where getting OCR doesn't matter
2015-12-17 11:56:09 -08:00
James R. Barlow
7313a77c2a
Implement pdf renderer side of tess spoof
2015-12-17 11:41:54 -08:00
James R. Barlow
45113676a3
Add Tesseract spoofing
2015-12-17 11:36:47 -08:00
James R. Barlow
102bd07019
Check for encrypted PDF and complain appropriately
2015-12-17 10:37:54 -08:00
James R. Barlow
9622e31da9
Use envvars in a new test case
...
And get rid of the messy binary replacement spoofing
2015-12-17 09:29:01 -08:00
James R. Barlow
1731ce2a44
Environment variables can now override default programs
2015-12-17 09:05:10 -08:00
James R. Barlow
276f421c44
Did a quick test of Ghostscript vs QPDF at PDF page splitting
...
qpdf won so hard it wasn't funny, even though it must be called once
per page to do the job. Perhaps Ghostscript interprets it as a call to
render the page?
time bash qpdf-test.fish ../tests/resources/multipage.pdf
0.07 real 0.02 user 0.03 sys
time gs -sDEVICE=pdfwrite -dSAFER -o '%06d.pdf' ../tests/resources/multipage.pdf
5.12 real 5.06 user 0.04 sys
2015-12-17 08:49:08 -08:00
James R. Barlow
133357779a
All subprocess invocations refactored out of main.py
2015-12-17 08:31:18 -08:00
James R. Barlow
5d8167b232
Move PDF validation check to qpdf.py
2015-12-17 08:28:00 -08:00
James R. Barlow
e76ae8c46c
Move more qpdf calls into qpdf.py
2015-12-17 08:24:48 -08:00
James R. Barlow
53a7c0e668
Refactor qpdf subprocess calls into module
2015-12-17 08:19:53 -08:00
James R. Barlow
4ca243e490
Merge commit '9f374461559460527e47237323e511123f31b6b0' into feature/envvars
2015-12-17 07:27:26 -08:00
jbarlow83
9f37446155
Merge pull request #34 from shemgp/master
...
Don't exit when qpdf repairs the file successfully but displays warning
2015-12-16 20:46:47 -08:00
Shem Pasamba
d7c7559b05
Use boolean instead of integers
2015-12-17 11:23:27 +08:00
Shem Pasamba
b2b66d1344
Don't exit when qpdf repair was successful
2015-12-17 11:20:20 +08:00
James R. Barlow
5d111a3c04
Refactor tesseract --pdfrenderer calls to tesseract.py
2015-12-16 17:48:26 -08:00
James R. Barlow
10416f847f
Migrate tesseract-hocr code to tesseract module, because modularity
2015-12-16 17:36:11 -08:00
James R. Barlow
79b3472b26
All tests passed, bump version
v3.1
2015-12-04 04:31:01 -08:00
James R. Barlow
f1b2f1ae08
Merge branch 'feature/pdfa-2' into develop
2015-12-04 04:04:08 -08:00
James R. Barlow
ee7d97ae8c
Trivial
2015-12-04 04:03:38 -08:00
James R. Barlow
7d9f473bb1
Remove eval() call by introspecting ExitCode
2015-12-04 03:34:53 -08:00
James R. Barlow
e77a5e5e75
We don't want threads. Really. Do. Not. Want.
2015-12-04 03:11:38 -08:00
James R. Barlow
6ab19af122
Comments
2015-12-04 03:09:39 -08:00
James R. Barlow
276fe49867
Better error messages for input file not found or invalid
...
Not as good finding a general way to deal with ruffus exceptions, but
better than nil.
2015-12-04 03:07:53 -08:00
James R. Barlow
acb31abe86
Fix issue #20 - fails on uppercase .PDF
2015-12-04 02:14:09 -08:00
James R. Barlow
4f964a3c8a
Introduce --pdf-renderer auto
...
Tess 3.03's has various quality problems like wrong DPI that are fixed
in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF
select the rendering backend based on the options and system.
However, we're not ready for tesseract as the main renderer.
Setting pdf-renderer to tesseract does not pass all test cases, mainly
the one where --tesseract-timeout is triggered, and some others.
2015-12-02 23:20:31 -08:00
James R. Barlow
df1fda7438
pageinfo: workaround PyPDF extractText limitations on hidden text
...
It appears that extractText() does not find all text. At a glance it
may be that Tesseract's PDF renderer generates a font and uses glyphs
that map to different Unicode code points that PyPDF expects, so it
discards the content and finds nothing. As a proxy in lieu of better
PDF parsing, assume that a "GlyphLessFont" means there is a text there.
I had previously found it does not work to check for the presence of a
font on page. Some PDF generators create a font resource entry even if
the font is never called for.
2015-12-02 23:16:36 -08:00
James R. Barlow
d6124c1787
pageinfo: improve robustness of text test for Tesseract produced PDFs
2015-12-02 03:12:52 -08:00
James R. Barlow
80d89b5420
Set /Creator metadata to OCRmyPDF
...
with reference to Tess version and settings
2015-12-02 02:19:39 -08:00
James R. Barlow
74059eecf1
Choose PDF/A-2b by default instead of A-1b
2015-12-02 01:48:10 -08:00
James R. Barlow
78697341a2
pytest: don't run tests that happened to be part of pyvenv
2015-12-02 01:19:43 -08:00
James R. Barlow
cfb56dd8ff
Merge commit 'b1769cbe18e6380ddfe96b3b22e6d02cb603338b' into develop
2015-12-01 00:40:43 -08:00
jbarlow83
b1769cbe18
README: El Capitan supported now, Py3.5 supported
2015-11-26 16:31:33 -08:00
James R. Barlow
955b801e7f
Merge branch 'master' into develop
2015-09-14 00:34:21 -07:00
James R. Barlow
3cea3f1afe
Try to work around git binary file bug again
2015-09-14 00:34:16 -07:00
James R. Barlow
fd4a227ccb
Force this file to stop thinking it was modified
2015-09-13 17:53:01 -07:00
James R. Barlow
19c3097483
Update notes
2015-09-13 17:51:18 -07:00
James R. Barlow
cdd1a6d03c
Suppress failing test
2015-09-10 07:01:14 -07:00
James R. Barlow
5fb8411571
Try new PPA for libav
2015-09-10 06:01:59 -07:00
James R. Barlow
334a15b8c7
typo fix
2015-09-10 05:01:44 -07:00
James R. Barlow
6390736577
ffmpeg-dev instead?
2015-09-10 04:27:57 -07:00
James R. Barlow
d55a214516
Autoreconf?
2015-09-10 04:10:12 -07:00
James R. Barlow
0994164b9a
travis: apt-get install in wrong place
2015-09-06 01:43:47 -07:00
James R. Barlow
54ee0dd147
travis: fix typo
2015-09-06 01:39:54 -07:00
James R. Barlow
47c7990fb3
travis: build unpaper with cache
2015-09-06 01:38:01 -07:00
James R. Barlow
997e95de4d
travis: build unpaper
2015-09-06 01:29:07 -07:00
James R. Barlow
44204be256
Fix order of PPAs
2015-09-06 00:54:50 -07:00
James R. Barlow
9b1d9aa88a
travis: improve, add new PPA, etc.
2015-09-06 00:41:23 -07:00
James R. Barlow
b775762f6a
travis: doesn't like gcc-4.8, try just gcc
2015-09-06 00:23:05 -07:00