Commit Graph

26 Commits

Author SHA1 Message Date
James R. Barlow
74d2a156c4 Update cache 2024-01-07 01:35:05 -08:00
James R. Barlow
445617a1a5 Rebuild cache for hocr default case 2023-12-03 15:16:18 -08:00
James R. Barlow
68bb38d0ad pdf_to_hocr: improve plugin handling 2023-10-24 00:52:31 -07:00
James R. Barlow
146da79c00 Regenerate test cache 2023-09-21 00:24:55 -07:00
James R. Barlow
a5efc4af9b unpaper: replace input pnm with png
Unpaper or its underlying libraries don't seem to accept pnms with an
odd integer width. Although it's not clear if this is the issue at all.

In any case, keeping the image a PNG works around the issue. unpaper
only accepted PNM input in the past, which is why we send it PNM.
Since it now accepts PNG, we might as well use PNG.

Unpaper can write PNG as output too, but this added a few seconds to
the test suite was not committed.

Related issues:

https://github.com/ocrmypdf/OCRmyPDF/issues/887

https://github.com/ocrmypdf/OCRmyPDF/issues/665

https://github.com/unpaper/unpaper/issues/82
2022-07-03 15:32:16 -07:00
James R. Barlow
ee21bf9ef6 Update cache 2021-12-13 20:45:30 -08:00
James R. Barlow
4c1ff1086c tess cache: don't include full platform - could be sensitive 2021-12-06 15:38:26 -08:00
James R. Barlow
f91faf9795 Add new argument --tesseract-thresholding to control tesseract thresholding where available
Also add missing test for --tesseract-oem
2021-12-06 15:38:14 -08:00
James R. Barlow
036afc4d88 Update cache, related to previous apparently 2021-11-12 23:57:50 -08:00
James R. Barlow
a55ab05d16 Replace leptonica deskew with tesseract find skew and pillow rotate
Also rebuild the cache.
2021-11-12 16:35:08 -08:00
James R. Barlow
aa10a70d70 Rebuild test cache due to hocr output change 2021-08-01 01:00:05 -07:00
James R. Barlow
390fdf8c05 Package OCR in Form XObject
Should improve results in some situations where the initial content
stream is messy or not well-formed.
2021-01-31 19:27:25 -08:00
James R. Barlow
06ab114aa8 Update test cache 2020-06-22 16:31:34 -07:00
James R. Barlow
991db17fde Remove Ghostscript-based text extraction
While faster than Python based methods, we've outgrown the limited
amount of information Ghostscript provides with this feature, and it
repeats an analysis we have to do anyway to learn what images are
present.
2020-04-26 04:02:07 -07:00
James R. Barlow
5e2a7f8a56 tests: speed up several slow tests 2019-12-09 16:17:57 -08:00
James R. Barlow
5f00e4f9d8 Sort imports 2019-07-27 04:51:52 -07:00
James R. Barlow
eb5200d26a Change most tests to use ocrmypdf API instead of subprocess
The main benefit of this is code coverage gains can actually follow it.
Also removes most ugly os.environ hacks.
2019-06-03 01:45:27 -07:00
James R. Barlow
4340ad9f12 Update test cache 2019-05-17 01:45:06 -07:00
James R. Barlow
58e6663806 Update test cache for french->german change 2019-03-03 03:23:59 -08:00
James R. Barlow
80bd7de580 Generate test cache 2018-12-30 01:02:37 -08:00
James R. Barlow
d4cbef9457 Update test cache with naming rule change 2018-06-29 12:04:20 -07:00
James R. Barlow
b81daf71d1 Regenerate test cache 2018-06-23 02:02:58 -07:00
James R. Barlow
3254315127 Update test cache 2018-05-11 12:19:50 -07:00
James R. Barlow
ba0535e3fb Update test cache to account for unpaper --layout none change 2018-04-12 00:48:21 -07:00
James R. Barlow
49fa7f6b5c tesseract_cache: don't reveal host system file paths in manifest file 2018-04-12 00:47:28 -07:00
James R. Barlow
ca51514046 Add test cache 2018-03-24 23:50:41 -07:00