James R. Barlow
74d2a156c4
Update cache
2024-01-07 01:35:05 -08:00
James R. Barlow
445617a1a5
Rebuild cache for hocr default case
2023-12-03 15:16:18 -08:00
James R. Barlow
68bb38d0ad
pdf_to_hocr: improve plugin handling
2023-10-24 00:52:31 -07:00
James R. Barlow
146da79c00
Regenerate test cache
2023-09-21 00:24:55 -07:00
James R. Barlow
a5efc4af9b
unpaper: replace input pnm with png
...
Unpaper or its underlying libraries don't seem to accept pnms with an
odd integer width. Although it's not clear if this is the issue at all.
In any case, keeping the image a PNG works around the issue. unpaper
only accepted PNM input in the past, which is why we send it PNM.
Since it now accepts PNG, we might as well use PNG.
Unpaper can write PNG as output too, but this added a few seconds to
the test suite was not committed.
Related issues:
https://github.com/ocrmypdf/OCRmyPDF/issues/887
https://github.com/ocrmypdf/OCRmyPDF/issues/665
https://github.com/unpaper/unpaper/issues/82
2022-07-03 15:32:16 -07:00
James R. Barlow
ee21bf9ef6
Update cache
2021-12-13 20:45:30 -08:00
James R. Barlow
4c1ff1086c
tess cache: don't include full platform - could be sensitive
2021-12-06 15:38:26 -08:00
James R. Barlow
f91faf9795
Add new argument --tesseract-thresholding to control tesseract thresholding where available
...
Also add missing test for --tesseract-oem
2021-12-06 15:38:14 -08:00
James R. Barlow
036afc4d88
Update cache, related to previous apparently
2021-11-12 23:57:50 -08:00
James R. Barlow
a55ab05d16
Replace leptonica deskew with tesseract find skew and pillow rotate
...
Also rebuild the cache.
2021-11-12 16:35:08 -08:00
James R. Barlow
aa10a70d70
Rebuild test cache due to hocr output change
2021-08-01 01:00:05 -07:00
James R. Barlow
390fdf8c05
Package OCR in Form XObject
...
Should improve results in some situations where the initial content
stream is messy or not well-formed.
2021-01-31 19:27:25 -08:00
James R. Barlow
06ab114aa8
Update test cache
2020-06-22 16:31:34 -07:00
James R. Barlow
991db17fde
Remove Ghostscript-based text extraction
...
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
5e2a7f8a56
tests: speed up several slow tests
2019-12-09 16:17:57 -08:00
James R. Barlow
5f00e4f9d8
Sort imports
2019-07-27 04:51:52 -07:00
James R. Barlow
eb5200d26a
Change most tests to use ocrmypdf API instead of subprocess
...
The main benefit of this is code coverage gains can actually follow it.
Also removes most ugly os.environ hacks.
2019-06-03 01:45:27 -07:00
James R. Barlow
4340ad9f12
Update test cache
2019-05-17 01:45:06 -07:00
James R. Barlow
58e6663806
Update test cache for french->german change
2019-03-03 03:23:59 -08:00
James R. Barlow
80bd7de580
Generate test cache
2018-12-30 01:02:37 -08:00
James R. Barlow
d4cbef9457
Update test cache with naming rule change
2018-06-29 12:04:20 -07:00
James R. Barlow
b81daf71d1
Regenerate test cache
2018-06-23 02:02:58 -07:00
James R. Barlow
3254315127
Update test cache
2018-05-11 12:19:50 -07:00
James R. Barlow
ba0535e3fb
Update test cache to account for unpaper --layout none change
2018-04-12 00:48:21 -07:00
James R. Barlow
49fa7f6b5c
tesseract_cache: don't reveal host system file paths in manifest file
2018-04-12 00:47:28 -07:00
James R. Barlow
ca51514046
Add test cache
2018-03-24 23:50:41 -07:00