James R. Barlow
05aa43c856
Require pdfminer
2018-10-29 12:45:15 -07:00
James R. Barlow
de80fb6bc8
Fix some failing tests after --redo-ocr changes
2018-10-29 11:49:38 -07:00
James R. Barlow
8e396f4be2
Document --redo-ocr more accurately
2018-10-29 02:03:58 -07:00
James R. Barlow
efec6da377
Fix error on serializing bad character markers
...
(Since they held a reference to their font, which in turn, had an
open file handle.)
2018-10-29 02:02:00 -07:00
James R. Barlow
00ef53195e
Fix corrupt Unicode mapping detection's false positives
2018-10-29 01:30:19 -07:00
James R. Barlow
f564aaf485
Remove only_ocr_text
2018-10-28 22:41:18 -07:00
James R. Barlow
5ac2d31d0d
Redo OCR can now handle visible and invisible text, so adjust accordingly
...
Still can't filter out corrupt text
2018-10-28 14:06:25 -07:00
James R. Barlow
fda890ab47
pdfinfo: further layout improvements
...
Rather than grouping visible/invisible in a custom analysis step,
use pdfminer's analysis and iterate.
Make iteration predicate and return more generic.
2018-10-28 14:05:50 -07:00
James R. Barlow
e6d64be890
pdfinfo: formatting
2018-10-27 23:22:44 -07:00
James R. Barlow
0e4d978d20
pdfinfo: all -> not any
2018-10-27 23:22:28 -07:00
James R. Barlow
b12c2cfedf
Fix handling of Type3 fonts with no ToUnicode mapping
2018-10-27 01:24:48 -07:00
James R. Barlow
58cc70725e
Reorganize around getting bboxes for visible/invisible text
2018-10-26 01:07:02 -07:00
James R. Barlow
339afb02aa
--redo-ocr now works in the presence of printable text
2018-10-25 16:53:47 -07:00
James R. Barlow
7ba0ff5c36
Fix strip invisible text bug: missing BT operator
2018-10-25 16:52:23 -07:00
James R. Barlow
ff41fbf673
Add pdfminer based layout analysis
2018-10-25 12:42:35 -07:00
James R. Barlow
2435cd23ce
Move pdfinfo into a package
2018-10-25 00:37:38 -07:00
James R. Barlow
a063cff720
Rename/expose strip_invisible_text
2018-10-24 21:53:24 -07:00
James R. Barlow
0d396e1ac0
option check: Remove always-True condition
...
Both renderers are now lossless reconstruction-capable. (Have
been since 7.0)
2018-10-22 22:13:59 -07:00
James R. Barlow
d11c428407
Redo OCR: disallow in cases that will damage the output PDF
2018-10-20 01:14:33 -07:00
James R. Barlow
6182b1f53e
Merge branch 'feature/remove-vectors' into feature/redo-ocr
2018-10-20 01:13:24 -07:00
James R. Barlow
16af753206
Add functional "redo OCR" feature
...
Needs argument validation and some other changes. Needs testing
with mixed-content PDFs.
Only really works for pure invisible text at the moment.
2018-10-19 00:02:19 -07:00
James R. Barlow
fa48205bb8
Add feature to remove vector graphics objects
2018-10-18 21:46:08 -07:00
James R. Barlow
f7dbf94071
pipeline: if vector graphic objects exist, ensure the DPI is reasonable
2018-10-18 01:23:31 -07:00
James R. Barlow
b18e66e2ca
pdfinfo: learn to detect vector graphic objects
2018-10-18 01:21:51 -07:00
James R. Barlow
7a5504dfa5
pdfinfo: fix terminology (operands, command) -> (operands, operator)
2018-10-18 01:18:30 -07:00
James R. Barlow
d1cad7bc68
Merge branch 'master' of github.com:jbarlow83/OCRmyPDF
2018-10-16 01:28:17 -07:00
Elliott Sales de Andrade
c58d5c097c
Add Fedora install instructions. ( #304 )
...
* Add Fedora install instructions.
* Fix path to fedora_rawhide badget
2018-10-14 13:28:50 -07:00
James R. Barlow
46157ca94e
docs: some redundancies
2018-10-12 21:29:27 -07:00
jbarlow83
dd99511bcc
Fix broken badges in README
2018-10-12 21:16:08 -07:00
M.Yasoob Ullah Khalid ☺
5bc2efd3c7
Removed extra word from docs ( #303 )
2018-10-12 21:02:16 -07:00
James R. Barlow
1b18dbecf5
Fix filename test.txt
v7.2.1
2018-10-11 16:03:25 -07:00
James R. Barlow
9f82c0eb6e
v7.2.1 release notes
2018-10-11 15:55:01 -07:00
James R. Barlow
68bac1b177
Fix compatibility with pikepdf 0.3.5 API change
2018-10-11 15:51:34 -07:00
James R. Barlow
1495b78330
Remove cruft to support leptonica < 1.72 in test suite
2018-10-11 01:37:32 -07:00
James R. Barlow
6f777d2848
Include Debian copyright file
2018-10-10 23:55:48 -07:00
James R. Barlow
5650eba848
Cleanup MANIFEST.in, reorg requirements/*.txt, fix non-Unicode readme
2018-10-10 23:53:08 -07:00
James R. Barlow
5bc5dc93f3
v7.2.0 release notes update
v7.2.0
2018-10-05 01:27:00 -07:00
James R. Barlow
c1e18bb825
optimize: Exclude soft masks (SMasks) from optimization
...
Soft masks are only allowed to be of colorspace DeviceGray so we
shouldn't use pngquant on them. For now, avoid this exceptional
case by excluded soft masks from optimization.
2018-10-05 01:23:26 -07:00
James R. Barlow
58282ea0fb
optimize: more refactoring
...
Now properly generalized/specialized where it should be
2018-10-04 13:44:51 -07:00
James R. Barlow
891da7834c
optimize: refactor image extraction
2018-10-04 12:34:22 -07:00
James R. Barlow
5c229d48d5
optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp
...
Closes #297
2018-10-04 11:53:11 -07:00
James R. Barlow
53f660cf35
Travis: use newer macos image
2018-10-04 08:59:40 -07:00
James R. Barlow
7b66ca68f2
...and document lossy JBIG2
2018-10-04 01:31:53 -07:00
James R. Barlow
ba71c3ffbd
requirements: request pikepdf 0.3.4
2018-10-04 01:22:03 -07:00
James R. Barlow
6707ad427a
v7.2.0 release notes
2018-10-04 01:21:17 -07:00
James R. Barlow
5b84549716
Change JBIG2 lossy mode to require --jbig2-lossy
2018-10-04 01:20:49 -07:00
James R. Barlow
c74f2ee6e8
Refactor the detailed error messages
2018-10-04 00:10:59 -07:00
James R. Barlow
b32dd9f9d3
Fix lossless JBIG2 when there are multiple JBIG2 images on a single page
2018-10-03 17:40:26 -07:00
James R. Barlow
fb8b161f6c
Fix suppression of tesseract config error messages
2018-10-03 17:39:50 -07:00
James R. Barlow
baddd6d233
Remove libtiff from Brewfile
...
For some reason, brew complains about it now.
2018-10-03 16:17:59 -07:00