Documentation for OCR quality features

This commit is contained in:
James R. Barlow
2018-11-10 15:48:41 -08:00
parent 701ef1df3f
commit 700abbb8a5
3 changed files with 10 additions and 3 deletions

1
.gitignore vendored
View File

@@ -32,6 +32,7 @@ htmlcov/
*.profile
/*.pdf
/*.qdf
/*.png
/scratch.py
IDEAS
log/

View File

@@ -29,7 +29,13 @@ v7.3.0
- OCRmyPDF now warns when a PDF that contains Adobe AcroForms, since such files probably do not need OCR. It can work with these files.
- Added a new feature ``--mask-barcodes`` to detect and suppress barcodes in files. We have observed that barcodes can interfere with OCR.
- Added three new **experimental** features. The name, syntax and behavior of these arguments is subject to change. They may also be incompatible with some other features.
- ``--remove-vectors`` which strips out vector graphics. This can improve OCR quality since OCR will not search artwork for readable text; however, it currently removes "text as curves" as well.
- ``--mask-barcodes`` to detect and suppress barcodes in files. We have observed that barcodes can interfere with OCR.
- ``--threshold`` which uses a more sophisticated thresholding algorithm than is currently in use in Tesseract OCR. This works around a `known issue in Tesseract <https://github.com/tesseract-ocr/tesseract/issues/1990>`_ with text on bright backgrounds.
- Fixed an issue where an error message was not reported when the installed Ghostscript was very old.

View File

@@ -250,12 +250,12 @@ preprocessing.add_argument(
"will not be included in OCR. This can eliminate false characters.")
preprocessing.add_argument(
'--mask-barcodes', action='store_true',
help="Mask out any barcodes that appear in the PDF so they are not "
help="EXPERIMENTAL. Mask out any barcodes that appear in the PDF so they are not "
"considered during OCR. Barcodes can introduce false characters into "
"OCR.")
preprocessing.add_argument(
'--threshold', action='store_true',
help="Threshold image to 1bpp before sending it to Tesseract for OCR. Can "
help="EXPERIMENTAL. Threshold image to 1bpp before sending it to Tesseract for OCR. Can "
"improve OCR quality compared to Tesseract's thresholder.")
ocrsettings = parser.add_argument_group(