Merge branch 'docs'

2026-05-19 03:58:06 -04:00 · 2023-06-20 01:08:09 -07:00
parent e44a57aec0 d0d49ce989
commit 1ba2bce486
7 changed files with 159 additions and 113 deletions
--- a/docs/advanced.rst
+++ b/docs/advanced.rst
@@ -310,16 +310,31 @@ stable user interface. They may be imported from
        - The program was interrupted by pressing Ctrl+C.


+.. _tmpdir:
+
+Changing temporary storage location
+===================================
+
+OCRmyPDF generates many temporary files during processing.
+
+To change where temporary files are stored, change the ``TMPDIR``
+environment variable for ocrmypdf's environment. (Python's
+``tempfile.gettempdir()`` returns the root directory in which temporary
+files will be stored.) For example, one could redirect ``TMPDIR`` to a
+large RAM disk to avoid wear on HDD/SSD and potentially improve
+performance.
+
+On Windows, the ``TEMP`` environment variable is used instead.
+
 Debugging the intermediate files
 ================================

 OCRmyPDF normally saves its intermediate results to a temporary folder
 and deletes this folder when it exits, whether it succeeded or failed.

-If the ``-k`` argument is issued on the command line, OCRmyPDF will keep
-the temporary folder and print the location, whether it succeeded or
-failed (provided the Python interpreter did not crash). An example
-message is:
+If the ``--keep-temporary-files`` (``-k```) argument is issued on the
+command line, OCRmyPDF will keep the temporary folder and print the location,
+whether it succeeded or failed. An example message is:

 .. code-block:: none

--- a/docs/batch.rst
+++ b/docs/batch.rst
@@ -151,14 +151,14 @@ The watcher service is included in the OCRmyPDF Docker image. To run it:
 .. code-block:: bash

    docker run \
-        -v <path to files to convert>:/input \
-        -v <path to store results>:/output \
-        -v <path to store processed originals>:/archive \
-        -e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
-        -e OCR_ON_SUCCESS_ARCHIVE=1 \
-        -e OCR_DESKEW=1 \
-        -e PYTHONUNBUFFERED=1 \
-        -it --entrypoint python3 \
+        --volume <path to files to convert>:/input \
+        --volume <path to store results>:/output \
+        --volume <path to store processed originals>:/archive \
+        --env OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
+        --env OCR_ON_SUCCESS_ARCHIVE=1 \
+        --env OCR_DESKEW=1 \
+        --env PYTHONUNBUFFERED=1 \
+        --interactive --tty --entrypoint python3 \
        jbarlow83/ocrmypdf \
        watcher.py

@@ -170,13 +170,13 @@ original to ``/archive``. The parameters to this image are:
    :header: "Parameter", "Description"
    :widths: 50, 50

-    "``-v <path to files to convert>:/input``", "Files placed in this location will be OCRed"
-    "``-v <path to store results>:/output``", "This is where OCRed files will be stored"
-    "``-v <path to store processed originals>:/archive``", "Archive processed originals here"
-    "``-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "Define environment variable ``OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1`` to place files in the output in ``{output}/{year}/{month}/{filename}``"
-    "``-e OCR_ON_SUCCESS_ARCHIVE=1``", "Define environment variable ``OCR_ON_SUCCESS_ARCHIVE`` to move processed originals"
-    "``-e OCR_DESKEW=1``", "Define environment variable ``OCR_DESKEW``  to apply deskew to crooked input PDFs"
-    "``-e PYTHONBUFFERED=1``", "This will force ``STDOUT`` to be unbuffered and allow you to see messages in docker logs"
+    "``--volume <path to files to convert>:/input``", "Files placed in this location will be OCRed"
+    "``--volume <path to store results>:/output``", "This is where OCRed files will be stored"
+    "``--volume <path to store processed originals>:/archive``", "Archive processed originals here"
+    "``--env OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "Define environment variable ``OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1`` to place files in the output in ``{output}/{year}/{month}/{filename}``"
+    "``--env OCR_ON_SUCCESS_ARCHIVE=1``", "Define environment variable ``OCR_ON_SUCCESS_ARCHIVE`` to move processed originals"
+    "``--env OCR_DESKEW=1``", "Define environment variable ``OCR_DESKEW``  to apply deskew to crooked input PDFs"
+    "``--env PYTHONBUFFERED=1``", "This will force ``STDOUT`` to be unbuffered and allow you to see messages in docker logs"

 This service relies on polling to check for changes to the filesystem. It
 may not be suitable for some environments, such as filesystems shared on a
--- a/docs/cloud.rst
+++ b/docs/cloud.rst
@@ -0,0 +1,87 @@
+.. _ocr-service:
+
+==================
+Online deployments
+==================
+
+OCRmyPDF is designed to be used as a command line tool, but it can be
+used in a web service. This document describes some considerations for
+doing so.
+
+A basic web service implementation is provided in the source code
+repository, as ``misc/webservice.py``. It is only demonstration quality
+and is not intended for production use.
+
+OCRmyPDF is not designed for use as a public web service where a
+malicious user could upload a chosen PDF. In particular, it is not
+necessarily secure against PDF malware or PDFs that cause denial of
+service. For further discussino of security, see :ref:`security`.
+
+OCRmyPDF relies on Ghostscript, and therefore, if deployed
+online one should be prepared to comply with Ghostscript's Affero GPL
+license, and any other licenses.
+
+Setting aside these concerns, a side effect of OCRmyPDF is that it may
+incidentally sanitize PDFs containing certain types of malware. It
+repairs the PDF with pikepdf/libqpdf, which could correct malformed PDF
+structures that are part of an attack. When PDF/A output is selected
+(the default), the input PDF is partially reconstructed by Ghostscript.
+When ``--force-ocr`` is used, all pages are rasterized and reconverted
+to PDF, which could remove malware in embedded images.
+
+Limiting CPU usage
+------------------
+
+OCRmyPDF will attempt to use all available CPUs and storage, so
+executing ``nice ocrmypdf`` or limiting the number of jobs with the
+``--jobs`` argument may ensure the server remains responsive. Another option
+would be to run OCRmyPDF jobs inside a Docker container, a virtual machine,
+or a cloud instance, which can impose its own limits on CPU usage and be
+terminated "from orbit" if it fails to complete.
+
+Temporary storage requirements
+------------------------------
+
+OCRmyPDF will use a large amount of temporary storage for its work,
+proportional to the total number of pixels needed to rasterize the PDF.
+The raster image of a 8.5×11" color page at 300 DPI takes 25 MB
+uncompressed; OCRmyPDF saves its intermediates as PNG, but that still
+means it requires about 9 MB per intermediate based on average
+compression ratios. Multiple intermediates per page are also required,
+depending on the command line given. A rule of thumb would be to allow
+100 MB of temporary storage per page in a file – meaning that a small
+cloud servers or small VM partitions should be provisioned with plenty
+of extra space, if say, a 500 page file might be sent.
+
+To change the temporary directory, see :ref:`tmpdir`.
+
+On Amazon Web Services or other cloud vendors, consider setting your
+temporary directory to `empheral
+storage <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html>`__.
+
+Timeouts
+--------
+
+To prevent excessively long OCR jobs consider setting
+``--tesseract-timeout`` and/or ``--skip-big`` arguments. ``--skip-big``
+is particularly helpful if your PDFs include documents such as reports
+on standard page sizes with large images attached - often large images
+are not worth OCR'ing anyway.
+
+Document management systems
+---------------------------
+
+If you are looking for a full document management system, consider
+`paperless-ngx <https://github.com/paperless-ngx/paperless-ngx>`__,
+which is a web application that uses OCRmyPDF to automatically OCR and
+archive documents.
+
+Commercial OCR alternatives
+---------------------------
+
+The author also provides professional services that include OCR and
+building databases around PDFs, and is happy to provide consultation.
+
+Abbyy Cloud OCR is viable commercial alternative with a web services
+API. Amazon Textract, Google Cloud Vision, and Microsoft Azure
+Computer Vision provide advanced OCR but have less PDF rendering capability.
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -20,7 +20,6 @@ image processing and OCR to existing PDFs.
   introduction
   release_notes
   installation
-   optimizer
   languages
   jbig2

@@ -29,9 +28,11 @@ image processing and OCR to existing PDFs.
   :maxdepth: 2

   cookbook
+   optimizer
   docker
   advanced
   batch
+   cloud
   performance
   pdfsecurity
   errors
--- a/docs/languages.rst
+++ b/docs/languages.rst
@@ -70,13 +70,13 @@ This enables these languages for all packages (e.g. including aspell).

   # Display a list of all Tesseract language packs
   equery uses app-text/tessdata_fast
-   
+
   # Add English and German language support for Tesseract only
   echo 'app-text/tessdata_fast l10n_de l10n_en' >> /etc/portage/package.use
-   
+
   # Add global English and German language support (the `l10n_` from equery has to be omitted)
   echo L10N="de en" >> /etc/portage/make.conf
-   
+
   # update system to reflect changed USE flags
   emerge --update --deep --newuse @world

@@ -101,7 +101,7 @@ derived Docker image as
 Windows users
 =============

-The Tesseract installer provided by Chocolatey currently includes only English language. 
-To install other languages, download the respective language pack (``.traineddata`` file) 
-from https://github.com/tesseract-ocr/tessdata/ and place it in 
+The Tesseract installer provided by Chocolatey currently includes only English language.
+To install other languages, download the respective language pack (``.traineddata`` file)
+from https://github.com/tesseract-ocr/tessdata/ and place it in
 ``C:\\Program Files\\Tesseract-OCR\\tessdata`` (or wherever Tesseract OCR is installed).
--- a/docs/optimizer.rst
+++ b/docs/optimizer.rst
@@ -13,14 +13,33 @@ tuned. Optimization occurs after OCR, and only if OCR succeeded.  It does not
 perform other possible optimizations such as deduplicating resources,
 consolidating fonts, simplifying vector drawings, or anything of that nature.

-Optimization ranges from ``-O0`` through ``-O3``, where ``0`` disables
-optimization and ``3`` implements all options. ``1``, the default, performs only
-safe and lossless optimizations. (This is similar to GCC's optimization
-parameter.) The exact type of optimizations performed will vary over time.
+.. list-table:: Title
+   :widths: 33 6 60
+   :header-rows: 1

-PDF optimization requires third-party, optional tools for certain optimizations.
-If these are not installed or cannot be found by OCRmyPDF, optimization will not
-be as good.
+    * - Optimization level
+      - Shorthand
+      - Description
+    * - ``--optimize 0``
+      - ``-O0``
+      - Disable most optimizations.
+    * - ``--optimize 1`` (default)
+      - ``-O1``
+      - Safe and lossless optimizations.
+    * - ``--optimize 2``
+      - ``-O2``
+      - Safe and lossy optimizations.
+    * - ``--optimize 3``
+      - ``-O3``
+      - Aggressive lossy optimizations.
+
+The exact type of optimizations performed will vary over time, and depend on
+the availability of third-party tools.
+
+Despite optimizations, OCRmyPDF might still increase the overall file size,
+since it must embed information about the recognized text, and depending on the
+settings chosen, may not be able to represent the output file as compactly as
+the input file.

 Optimizations that always occurs
 ================================
@@ -37,12 +56,14 @@ Fast web view
 OCRmyPDF automatically optimizes PDFs for "fast web view" in Adobe Acrobat's
 parlance, or equivalently, linearizes PDFs so that the resources they reference
 are presented in the order a viewer needs them for sequential display. This
-reduces the latency of viewing a PDF both online and from local storage. This
-actually slightly increases the file size.
+reduces the latency of viewing a PDF both online and from local storage, in
+exchange for a slight increase in file size.

 To disable this optimization and all others, use ``ocrmypdf --optimize 0 ...``
 or the shorthand ``-O0``.

+Adobe Acrobat might not report the file as being "fast web view".
+
 Lossless optimizations
 ======================

--- a/docs/pdfsecurity.rst
+++ b/docs/pdfsecurity.rst
@@ -54,84 +54,6 @@ into the existing PDF or it may essentially reconstruct ("re-fry") a
 visually identical PDF that may be quite different at the binary level.
 That said, OCRmyPDF is not a tool designed for sanitizing PDFs.

-.. _ocr-service:
-
-Using OCRmyPDF online or as a service
-=====================================
-
-OCRmyPDF is not designed for use as a public web service where a
-malicious user could upload a chosen PDF. In particular, it is not
-necessarily secure against PDF malware or PDFs that cause denial of
-service. OCRmyPDF relies on Ghostscript, and therefore, if deployed
-online one should be prepared to comply with Ghostscript's Affero GPL
-license, and any other licenses.
-
-Setting aside these concerns, a side effect of OCRmyPDF is that it may
-incidentally sanitize PDFs containing certain types of malware. It
-repairs the PDF with pikepdf/libqpdf, which could correct malformed PDF
-structures that are part of an attack. When PDF/A output is selected
-(the default), the input PDF is partially reconstructed by Ghostscript.
-When ``--force-ocr`` is used, all pages are rasterized and reconverted
-to PDF, which could remove malware in embedded images.
-
-OCRmyPDF should be relatively safe to use in a trusted intranet, with
-some considerations:
-
-Limiting CPU usage
------------------
-
-OCRmyPDF will attempt to use all available CPUs and storage, so
-executing ``nice ocrmypdf`` or limiting the number of jobs with the
-``-j`` argument may ensure the server remains available. Another option
-would be to run OCRmyPDF jobs inside a Docker container, a virtual machine,
-or a cloud instance, which can impose its own limits on CPU usage and be
-terminated "from orbit" if it fails to complete.
-
-Temporary storage requirements
------------------------------
-
-OCRmyPDF will use a large amount of temporary storage for its work,
-proportional to the total number of pixels needed to rasterize the PDF.
-The raster image of a 8.5×11" color page at 300 DPI takes 25 MB
-uncompressed; OCRmyPDF saves its intermediates as PNG, but that still
-means it requires about 9 MB per intermediate based on average
-compression ratios. Multiple intermediates per page are also required,
-depending on the command line given. A rule of thumb would be to allow
-100 MB of temporary storage per page in a file – meaning that a small
-cloud servers or small VM partitions should be provisioned with plenty
-of extra space, if say, a 500 page file might be sent.
-
-To check temporary storage usage on actual files, run
-``ocrmypdf -k ...`` which will preserve and print the path to temporary
-storage when the job is done.
-
-To change where temporary files are stored, change the ``TMPDIR``
-environment variable for ocrmypdf's environment. (Python's
-``tempfile.gettempdir()`` returns the root directory in which temporary
-files will be stored.) For example, one could redirect ``TMPDIR`` to a
-large RAM disk to avoid wear on HDD/SSD and potentially improve
-performance. On Amazon Web Services, ``TMPDIR`` can be set to `empheral
-storage <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html>`__.
-
-Timeouts
--------
-
-To prevent excessively long OCR jobs consider setting
-``--tesseract-timeout`` and/or ``--skip-big`` arguments. ``--skip-big``
-is particularly helpful if your PDFs include documents such as reports
-on standard page sizes with large images attached - often large images
-are not worth OCR'ing anyway.
-
-Commercial alternatives
-----------------------
-
-The author also provides professional services that include OCR and
-building databases around PDFs, and is happy to provide consultation.
-
-Abbyy Cloud OCR is viable commercial alternative with a web services
-API. Amazon Textract, Google Cloud Vision, and Microsoft Azure
-Computer Vision provide advanced OCR but have less PDF rendering capability.
-
 Password protection, digital signatures and certification
 =========================================================