Commit Graph

508 Commits

Author SHA1 Message Date
James R. Barlow
9ec4aa039d Add tesseract caching to speed up tests 2015-12-17 12:52:12 -08:00
James R. Barlow
ecebe2f24b Let some tests use the spoofed tesseract
Where getting OCR doesn't matter
2015-12-17 11:56:09 -08:00
James R. Barlow
7313a77c2a Implement pdf renderer side of tess spoof 2015-12-17 11:41:54 -08:00
James R. Barlow
45113676a3 Add Tesseract spoofing 2015-12-17 11:36:47 -08:00
James R. Barlow
102bd07019 Check for encrypted PDF and complain appropriately 2015-12-17 10:37:54 -08:00
James R. Barlow
9622e31da9 Use envvars in a new test case
And get rid of the messy binary replacement spoofing
2015-12-17 09:29:01 -08:00
James R. Barlow
1731ce2a44 Environment variables can now override default programs 2015-12-17 09:05:10 -08:00
James R. Barlow
276f421c44 Did a quick test of Ghostscript vs QPDF at PDF page splitting
qpdf won so hard it wasn't funny, even though it must be called once
per page to do the job. Perhaps Ghostscript interprets it as a call to
render the page?

time bash qpdf-test.fish ../tests/resources/multipage.pdf
        0.07 real         0.02 user         0.03 sys

time gs -sDEVICE=pdfwrite -dSAFER -o '%06d.pdf' ../tests/resources/multipage.pdf
        5.12 real         5.06 user         0.04 sys
2015-12-17 08:49:08 -08:00
James R. Barlow
133357779a All subprocess invocations refactored out of main.py 2015-12-17 08:31:18 -08:00
James R. Barlow
5d8167b232 Move PDF validation check to qpdf.py 2015-12-17 08:28:00 -08:00
James R. Barlow
e76ae8c46c Move more qpdf calls into qpdf.py 2015-12-17 08:24:48 -08:00
James R. Barlow
53a7c0e668 Refactor qpdf subprocess calls into module 2015-12-17 08:19:53 -08:00
James R. Barlow
4ca243e490 Merge commit '9f374461559460527e47237323e511123f31b6b0' into feature/envvars 2015-12-17 07:27:26 -08:00
jbarlow83
9f37446155 Merge pull request #34 from shemgp/master
Don't exit when qpdf repairs the file successfully but displays warning
2015-12-16 20:46:47 -08:00
Shem Pasamba
d7c7559b05 Use boolean instead of integers 2015-12-17 11:23:27 +08:00
Shem Pasamba
b2b66d1344 Don't exit when qpdf repair was successful 2015-12-17 11:20:20 +08:00
James R. Barlow
5d111a3c04 Refactor tesseract --pdfrenderer calls to tesseract.py 2015-12-16 17:48:26 -08:00
James R. Barlow
10416f847f Migrate tesseract-hocr code to tesseract module, because modularity 2015-12-16 17:36:11 -08:00
James R. Barlow
79b3472b26 All tests passed, bump version v3.1 2015-12-04 04:31:01 -08:00
James R. Barlow
f1b2f1ae08 Merge branch 'feature/pdfa-2' into develop 2015-12-04 04:04:08 -08:00
James R. Barlow
ee7d97ae8c Trivial 2015-12-04 04:03:38 -08:00
James R. Barlow
7d9f473bb1 Remove eval() call by introspecting ExitCode 2015-12-04 03:34:53 -08:00
James R. Barlow
e77a5e5e75 We don't want threads. Really. Do. Not. Want. 2015-12-04 03:11:38 -08:00
James R. Barlow
6ab19af122 Comments 2015-12-04 03:09:39 -08:00
James R. Barlow
276fe49867 Better error messages for input file not found or invalid
Not as good finding a general way to deal with ruffus exceptions, but
better than nil.
2015-12-04 03:07:53 -08:00
James R. Barlow
acb31abe86 Fix issue #20 - fails on uppercase .PDF 2015-12-04 02:14:09 -08:00
James R. Barlow
4f964a3c8a Introduce --pdf-renderer auto
Tess 3.03's has various quality problems like wrong DPI that are fixed
in Tess 3.04. Idea here is to introduce an option to let OCRmyPDF
select the rendering backend based on the options and system.

However, we're not ready for tesseract as the main renderer.
Setting pdf-renderer to tesseract does not pass all test cases, mainly
the one where --tesseract-timeout is triggered, and some others.
2015-12-02 23:20:31 -08:00
James R. Barlow
df1fda7438 pageinfo: workaround PyPDF extractText limitations on hidden text
It appears that extractText() does not find all text. At a glance it
may be that Tesseract's PDF renderer generates a font and uses glyphs
that map to different Unicode code points that PyPDF expects, so it
discards the content and finds nothing. As a proxy in lieu of better
PDF parsing, assume that a "GlyphLessFont" means there is a text there.

I had previously found it does not work to check for the presence of a
font on page. Some PDF generators create a font resource entry even if
the font is never called for.
2015-12-02 23:16:36 -08:00
James R. Barlow
d6124c1787 pageinfo: improve robustness of text test for Tesseract produced PDFs 2015-12-02 03:12:52 -08:00
James R. Barlow
80d89b5420 Set /Creator metadata to OCRmyPDF
with reference to Tess version and settings
2015-12-02 02:19:39 -08:00
James R. Barlow
74059eecf1 Choose PDF/A-2b by default instead of A-1b 2015-12-02 01:48:10 -08:00
James R. Barlow
78697341a2 pytest: don't run tests that happened to be part of pyvenv 2015-12-02 01:19:43 -08:00
James R. Barlow
cfb56dd8ff Merge commit 'b1769cbe18e6380ddfe96b3b22e6d02cb603338b' into develop 2015-12-01 00:40:43 -08:00
jbarlow83
b1769cbe18 README: El Capitan supported now, Py3.5 supported 2015-11-26 16:31:33 -08:00
James R. Barlow
955b801e7f Merge branch 'master' into develop 2015-09-14 00:34:21 -07:00
James R. Barlow
3cea3f1afe Try to work around git binary file bug again 2015-09-14 00:34:16 -07:00
James R. Barlow
fd4a227ccb Force this file to stop thinking it was modified 2015-09-13 17:53:01 -07:00
James R. Barlow
19c3097483 Update notes 2015-09-13 17:51:18 -07:00
James R. Barlow
cdd1a6d03c Suppress failing test 2015-09-10 07:01:14 -07:00
James R. Barlow
5fb8411571 Try new PPA for libav 2015-09-10 06:01:59 -07:00
James R. Barlow
334a15b8c7 typo fix 2015-09-10 05:01:44 -07:00
James R. Barlow
6390736577 ffmpeg-dev instead? 2015-09-10 04:27:57 -07:00
James R. Barlow
d55a214516 Autoreconf? 2015-09-10 04:10:12 -07:00
James R. Barlow
0994164b9a travis: apt-get install in wrong place 2015-09-06 01:43:47 -07:00
James R. Barlow
54ee0dd147 travis: fix typo 2015-09-06 01:39:54 -07:00
James R. Barlow
47c7990fb3 travis: build unpaper with cache 2015-09-06 01:38:01 -07:00
James R. Barlow
997e95de4d travis: build unpaper 2015-09-06 01:29:07 -07:00
James R. Barlow
44204be256 Fix order of PPAs 2015-09-06 00:54:50 -07:00
James R. Barlow
9b1d9aa88a travis: improve, add new PPA, etc. 2015-09-06 00:41:23 -07:00
James R. Barlow
b775762f6a travis: doesn't like gcc-4.8, try just gcc 2015-09-06 00:23:05 -07:00