Return a distinct error code if PDF/A fails

2026-05-19 03:58:06 -04:00 · 2018-07-03 16:59:03 -07:00
parent 47885f4230
commit e44001641c
4 changed files with 25 additions and 20 deletions
--- a/docs/advanced.rst
+++ b/docs/advanced.rst
@@ -9,7 +9,7 @@ OCRmyPDF provides many features to control the behavior of the OCR engine, Tesse
 When OCR is skipped
 """""""""""""""""""

-If a page in a PDF seems to have text, by default OCRmyPDF will exit without modifying the PDF. This is to ensure that PDFs that were previously OCRed or were "born digital" rather than scanned are not processed. 
+If a page in a PDF seems to have text, by default OCRmyPDF will exit without modifying the PDF. This is to ensure that PDFs that were previously OCRed or were "born digital" rather than scanned are not processed.

 If ``--skip-text`` is issued, then no OCR will be performed on pages that already have text. The page will be copied to the output. This may be useful for documents that contain both "born digital" and scanned content, or to use OCRmyPDF to normalize and convert to PDF/A regardless of their contents.

@@ -19,7 +19,7 @@ If ``--force-ocr`` is issued, then all pages will be rasterized to images, disca
 Time and image size limits
 """"""""""""""""""""""""""

-By default, OCRmyPDF permits tesseract to run for three minutes (180 seconds) per page. This is usually more than enough time to find all text on a reasonably sized page with modern hardware. 
+By default, OCRmyPDF permits tesseract to run for three minutes (180 seconds) per page. This is usually more than enough time to find all text on a reasonably sized page with modern hardware.

 If a page is skipped, it will be inserted without OCR. If preprocessing was requested, the preprocessed image layer will be inserted.

@@ -33,7 +33,7 @@ If you want to adjust the amount of time spent on OCR, change ``--tesseract-time
 Overriding default tesseract
 """"""""""""""""""""""""""""

-OCRmyPDF checks the system ``PATH`` for the ``tesseract`` binary.  
+OCRmyPDF checks the system ``PATH`` for the ``tesseract`` binary.

 Some relevant environment variables that influence Tesseract's behavior include:

@@ -140,8 +140,8 @@ Return code policy
 OCRmyPDF writes all messages to ``stderr``.  ``stdout`` is reserved for piping
 output files.  ``stdin`` is reserved for piping input files.

-The return codes generated by the OCRmyPDF are considered part of the stable 
-user interface.
+The return codes generated by the OCRmyPDF are considered part of the stable
+user interface.  They may be imported from ``ocrmypdf.exceptions``.

 .. list-table:: Return codes
    :widths: 5 35 60
@@ -151,39 +151,41 @@ user interface.
        - Name
        - Interpretation
    *	- 0
-        - ``ocrmypdf.exceptions.ExitCode.ok``
+        - ``ExitCode.ok``
        - Everything worked as expected.
    *	- 1
-        - ``ocrmypdf.exceptions.ExitCode.bad_args``
+        - ``ExitCode.bad_args``
        - Invalid arguments, exited with an error.
    *	- 2
-        - ``ocrmypdf.exceptions.ExitCode.input_file``
+        - ``ExitCode.input_file``
        - The input file does not seem to be a valid PDF.
    *	- 3
-        - ``ocrmypdf.exceptions.missing_dependency``
+        - ``ExitCode.missing_dependency``
        - An external program required by OCRmyPDF is missing.
    *	- 4
-        - ``ocrmypdf.exceptions.invalid_output_pdf``
-        - An output file was created, but it does not seem to be a valid PDF or
-          PDF/A. The file will be available.
+        - ``ExitCode.invalid_output_pdf``
+        - An output file was created, but it does not seem to be a valid PDF. The file will be available.
    *	- 5
-        - ``ocrmypdf.exceptions.file_access_error``
+        - ``ExitCode.file_access_error``
        - The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file.
    *	- 6
-        - ``ocrmypdf.exceptions.already_done_ocr``
+        - ``ExitCode.already_done_ocr``
        - The file already appears to contain text so it may not need OCR. See output message.
    *	- 7
-        - ``ocrmypdf.exceptions.child_process_error``
+        - ``ExitCode.child_process_error``
        - An error occurred in an external program (child process) and OCRmyPDF cannot continue.
    *	- 8
-        - ``ocrmypdf.exceptions.encrypted_pdf``
+        - ``ExitCode.encrypted_pdf``
        - The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption.
    *	- 9
-        - ``ocrmypdf.exceptions.invalid_config``
+        - ``ExitCode.invalid_config``
        - A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file.
+    *   - 10
+        - ``ExitCode.pdfa_conversion_failed``
+        - A valid PDF was created, PDF/A conversion failed. The file will be available.
    *	- 15
-        - ``ocrmypdf.exceptions.other_error``
+        - ``ExitCode.other_error``
        - Some other error occurred.
    *	- 130
-        - ``ocrmypdf.exceptions.ctrl_c``
+        - ``ExitCode.ctrl_c``
        - The program was interrupted by pressing Ctrl+C.
--- a/docs/release_notes.rst
+++ b/docs/release_notes.rst
@@ -49,6 +49,8 @@ v7

 -   ``--pdf-renderer auto`` option and the diagnostics used to select a PDF renderer now work better with old versions, but may make different decisions than past versions.

+-   If everything succeeds but PDF/A conversion fails, a distinct return code is now returned (``ExitCode.pdfa_conversion_failed (10)``) where this situation previously returned ``ExitCode.invalid_output_pdf (4)``. The latter is now returned only if there is some indication that the output file is invalid.
+
 -   Notes for downstream packagers

    +   There is also a new dependency on ``python-xmp-toolkit`` and ``libexempi3``
--- a/src/ocrmypdf/main.py
+++ b/src/ocrmypdf/main.py
@@ -936,7 +936,7 @@ def run_pipeline():
            else:
                msg = 'Output file is okay but is not PDF/A (seems to be {})'
                _log.warning(msg.format(pdfa_info['conformance']))
-                return ExitCode.invalid_output_pdf
+                return ExitCode.pdfa_conversion_failed
        if not qpdf.check(options.output_file, _log):
            _log.warning('Output file: The generated PDF is INVALID')
            return ExitCode.invalid_output_pdf
--- a/src/ocrmypdf/exceptions.py
+++ b/src/ocrmypdf/exceptions.py
@@ -29,6 +29,7 @@ class ExitCode(IntEnum):
    child_process_error = 7
    encrypted_pdf = 8
    invalid_config = 9
+    pdfa_conversion_failed = 10
    other_error = 15
    ctrl_c = 130