mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-05-19 03:58:06 -04:00
Merge branch 'docs'
This commit is contained in:
@@ -310,16 +310,31 @@ stable user interface. They may be imported from
|
||||
- The program was interrupted by pressing Ctrl+C.
|
||||
|
||||
|
||||
.. _tmpdir:
|
||||
|
||||
Changing temporary storage location
|
||||
===================================
|
||||
|
||||
OCRmyPDF generates many temporary files during processing.
|
||||
|
||||
To change where temporary files are stored, change the ``TMPDIR``
|
||||
environment variable for ocrmypdf's environment. (Python's
|
||||
``tempfile.gettempdir()`` returns the root directory in which temporary
|
||||
files will be stored.) For example, one could redirect ``TMPDIR`` to a
|
||||
large RAM disk to avoid wear on HDD/SSD and potentially improve
|
||||
performance.
|
||||
|
||||
On Windows, the ``TEMP`` environment variable is used instead.
|
||||
|
||||
Debugging the intermediate files
|
||||
================================
|
||||
|
||||
OCRmyPDF normally saves its intermediate results to a temporary folder
|
||||
and deletes this folder when it exits, whether it succeeded or failed.
|
||||
|
||||
If the ``-k`` argument is issued on the command line, OCRmyPDF will keep
|
||||
the temporary folder and print the location, whether it succeeded or
|
||||
failed (provided the Python interpreter did not crash). An example
|
||||
message is:
|
||||
If the ``--keep-temporary-files`` (``-k```) argument is issued on the
|
||||
command line, OCRmyPDF will keep the temporary folder and print the location,
|
||||
whether it succeeded or failed. An example message is:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
|
||||
@@ -151,14 +151,14 @@ The watcher service is included in the OCRmyPDF Docker image. To run it:
|
||||
.. code-block:: bash
|
||||
|
||||
docker run \
|
||||
-v <path to files to convert>:/input \
|
||||
-v <path to store results>:/output \
|
||||
-v <path to store processed originals>:/archive \
|
||||
-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
|
||||
-e OCR_ON_SUCCESS_ARCHIVE=1 \
|
||||
-e OCR_DESKEW=1 \
|
||||
-e PYTHONUNBUFFERED=1 \
|
||||
-it --entrypoint python3 \
|
||||
--volume <path to files to convert>:/input \
|
||||
--volume <path to store results>:/output \
|
||||
--volume <path to store processed originals>:/archive \
|
||||
--env OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
|
||||
--env OCR_ON_SUCCESS_ARCHIVE=1 \
|
||||
--env OCR_DESKEW=1 \
|
||||
--env PYTHONUNBUFFERED=1 \
|
||||
--interactive --tty --entrypoint python3 \
|
||||
jbarlow83/ocrmypdf \
|
||||
watcher.py
|
||||
|
||||
@@ -170,13 +170,13 @@ original to ``/archive``. The parameters to this image are:
|
||||
:header: "Parameter", "Description"
|
||||
:widths: 50, 50
|
||||
|
||||
"``-v <path to files to convert>:/input``", "Files placed in this location will be OCRed"
|
||||
"``-v <path to store results>:/output``", "This is where OCRed files will be stored"
|
||||
"``-v <path to store processed originals>:/archive``", "Archive processed originals here"
|
||||
"``-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "Define environment variable ``OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1`` to place files in the output in ``{output}/{year}/{month}/{filename}``"
|
||||
"``-e OCR_ON_SUCCESS_ARCHIVE=1``", "Define environment variable ``OCR_ON_SUCCESS_ARCHIVE`` to move processed originals"
|
||||
"``-e OCR_DESKEW=1``", "Define environment variable ``OCR_DESKEW`` to apply deskew to crooked input PDFs"
|
||||
"``-e PYTHONBUFFERED=1``", "This will force ``STDOUT`` to be unbuffered and allow you to see messages in docker logs"
|
||||
"``--volume <path to files to convert>:/input``", "Files placed in this location will be OCRed"
|
||||
"``--volume <path to store results>:/output``", "This is where OCRed files will be stored"
|
||||
"``--volume <path to store processed originals>:/archive``", "Archive processed originals here"
|
||||
"``--env OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "Define environment variable ``OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1`` to place files in the output in ``{output}/{year}/{month}/{filename}``"
|
||||
"``--env OCR_ON_SUCCESS_ARCHIVE=1``", "Define environment variable ``OCR_ON_SUCCESS_ARCHIVE`` to move processed originals"
|
||||
"``--env OCR_DESKEW=1``", "Define environment variable ``OCR_DESKEW`` to apply deskew to crooked input PDFs"
|
||||
"``--env PYTHONBUFFERED=1``", "This will force ``STDOUT`` to be unbuffered and allow you to see messages in docker logs"
|
||||
|
||||
This service relies on polling to check for changes to the filesystem. It
|
||||
may not be suitable for some environments, such as filesystems shared on a
|
||||
|
||||
87
docs/cloud.rst
Normal file
87
docs/cloud.rst
Normal file
@@ -0,0 +1,87 @@
|
||||
.. _ocr-service:
|
||||
|
||||
==================
|
||||
Online deployments
|
||||
==================
|
||||
|
||||
OCRmyPDF is designed to be used as a command line tool, but it can be
|
||||
used in a web service. This document describes some considerations for
|
||||
doing so.
|
||||
|
||||
A basic web service implementation is provided in the source code
|
||||
repository, as ``misc/webservice.py``. It is only demonstration quality
|
||||
and is not intended for production use.
|
||||
|
||||
OCRmyPDF is not designed for use as a public web service where a
|
||||
malicious user could upload a chosen PDF. In particular, it is not
|
||||
necessarily secure against PDF malware or PDFs that cause denial of
|
||||
service. For further discussino of security, see :ref:`security`.
|
||||
|
||||
OCRmyPDF relies on Ghostscript, and therefore, if deployed
|
||||
online one should be prepared to comply with Ghostscript's Affero GPL
|
||||
license, and any other licenses.
|
||||
|
||||
Setting aside these concerns, a side effect of OCRmyPDF is that it may
|
||||
incidentally sanitize PDFs containing certain types of malware. It
|
||||
repairs the PDF with pikepdf/libqpdf, which could correct malformed PDF
|
||||
structures that are part of an attack. When PDF/A output is selected
|
||||
(the default), the input PDF is partially reconstructed by Ghostscript.
|
||||
When ``--force-ocr`` is used, all pages are rasterized and reconverted
|
||||
to PDF, which could remove malware in embedded images.
|
||||
|
||||
Limiting CPU usage
|
||||
------------------
|
||||
|
||||
OCRmyPDF will attempt to use all available CPUs and storage, so
|
||||
executing ``nice ocrmypdf`` or limiting the number of jobs with the
|
||||
``--jobs`` argument may ensure the server remains responsive. Another option
|
||||
would be to run OCRmyPDF jobs inside a Docker container, a virtual machine,
|
||||
or a cloud instance, which can impose its own limits on CPU usage and be
|
||||
terminated "from orbit" if it fails to complete.
|
||||
|
||||
Temporary storage requirements
|
||||
------------------------------
|
||||
|
||||
OCRmyPDF will use a large amount of temporary storage for its work,
|
||||
proportional to the total number of pixels needed to rasterize the PDF.
|
||||
The raster image of a 8.5×11" color page at 300 DPI takes 25 MB
|
||||
uncompressed; OCRmyPDF saves its intermediates as PNG, but that still
|
||||
means it requires about 9 MB per intermediate based on average
|
||||
compression ratios. Multiple intermediates per page are also required,
|
||||
depending on the command line given. A rule of thumb would be to allow
|
||||
100 MB of temporary storage per page in a file – meaning that a small
|
||||
cloud servers or small VM partitions should be provisioned with plenty
|
||||
of extra space, if say, a 500 page file might be sent.
|
||||
|
||||
To change the temporary directory, see :ref:`tmpdir`.
|
||||
|
||||
On Amazon Web Services or other cloud vendors, consider setting your
|
||||
temporary directory to `empheral
|
||||
storage <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html>`__.
|
||||
|
||||
Timeouts
|
||||
--------
|
||||
|
||||
To prevent excessively long OCR jobs consider setting
|
||||
``--tesseract-timeout`` and/or ``--skip-big`` arguments. ``--skip-big``
|
||||
is particularly helpful if your PDFs include documents such as reports
|
||||
on standard page sizes with large images attached - often large images
|
||||
are not worth OCR'ing anyway.
|
||||
|
||||
Document management systems
|
||||
---------------------------
|
||||
|
||||
If you are looking for a full document management system, consider
|
||||
`paperless-ngx <https://github.com/paperless-ngx/paperless-ngx>`__,
|
||||
which is a web application that uses OCRmyPDF to automatically OCR and
|
||||
archive documents.
|
||||
|
||||
Commercial OCR alternatives
|
||||
---------------------------
|
||||
|
||||
The author also provides professional services that include OCR and
|
||||
building databases around PDFs, and is happy to provide consultation.
|
||||
|
||||
Abbyy Cloud OCR is viable commercial alternative with a web services
|
||||
API. Amazon Textract, Google Cloud Vision, and Microsoft Azure
|
||||
Computer Vision provide advanced OCR but have less PDF rendering capability.
|
||||
@@ -20,7 +20,6 @@ image processing and OCR to existing PDFs.
|
||||
introduction
|
||||
release_notes
|
||||
installation
|
||||
optimizer
|
||||
languages
|
||||
jbig2
|
||||
|
||||
@@ -29,9 +28,11 @@ image processing and OCR to existing PDFs.
|
||||
:maxdepth: 2
|
||||
|
||||
cookbook
|
||||
optimizer
|
||||
docker
|
||||
advanced
|
||||
batch
|
||||
cloud
|
||||
performance
|
||||
pdfsecurity
|
||||
errors
|
||||
|
||||
@@ -70,13 +70,13 @@ This enables these languages for all packages (e.g. including aspell).
|
||||
|
||||
# Display a list of all Tesseract language packs
|
||||
equery uses app-text/tessdata_fast
|
||||
|
||||
|
||||
# Add English and German language support for Tesseract only
|
||||
echo 'app-text/tessdata_fast l10n_de l10n_en' >> /etc/portage/package.use
|
||||
|
||||
|
||||
# Add global English and German language support (the `l10n_` from equery has to be omitted)
|
||||
echo L10N="de en" >> /etc/portage/make.conf
|
||||
|
||||
|
||||
# update system to reflect changed USE flags
|
||||
emerge --update --deep --newuse @world
|
||||
|
||||
@@ -101,7 +101,7 @@ derived Docker image as
|
||||
Windows users
|
||||
=============
|
||||
|
||||
The Tesseract installer provided by Chocolatey currently includes only English language.
|
||||
To install other languages, download the respective language pack (``.traineddata`` file)
|
||||
from https://github.com/tesseract-ocr/tessdata/ and place it in
|
||||
The Tesseract installer provided by Chocolatey currently includes only English language.
|
||||
To install other languages, download the respective language pack (``.traineddata`` file)
|
||||
from https://github.com/tesseract-ocr/tessdata/ and place it in
|
||||
``C:\\Program Files\\Tesseract-OCR\\tessdata`` (or wherever Tesseract OCR is installed).
|
||||
|
||||
@@ -13,14 +13,33 @@ tuned. Optimization occurs after OCR, and only if OCR succeeded. It does not
|
||||
perform other possible optimizations such as deduplicating resources,
|
||||
consolidating fonts, simplifying vector drawings, or anything of that nature.
|
||||
|
||||
Optimization ranges from ``-O0`` through ``-O3``, where ``0`` disables
|
||||
optimization and ``3`` implements all options. ``1``, the default, performs only
|
||||
safe and lossless optimizations. (This is similar to GCC's optimization
|
||||
parameter.) The exact type of optimizations performed will vary over time.
|
||||
.. list-table:: Title
|
||||
:widths: 33 6 60
|
||||
:header-rows: 1
|
||||
|
||||
PDF optimization requires third-party, optional tools for certain optimizations.
|
||||
If these are not installed or cannot be found by OCRmyPDF, optimization will not
|
||||
be as good.
|
||||
* - Optimization level
|
||||
- Shorthand
|
||||
- Description
|
||||
* - ``--optimize 0``
|
||||
- ``-O0``
|
||||
- Disable most optimizations.
|
||||
* - ``--optimize 1`` (default)
|
||||
- ``-O1``
|
||||
- Safe and lossless optimizations.
|
||||
* - ``--optimize 2``
|
||||
- ``-O2``
|
||||
- Safe and lossy optimizations.
|
||||
* - ``--optimize 3``
|
||||
- ``-O3``
|
||||
- Aggressive lossy optimizations.
|
||||
|
||||
The exact type of optimizations performed will vary over time, and depend on
|
||||
the availability of third-party tools.
|
||||
|
||||
Despite optimizations, OCRmyPDF might still increase the overall file size,
|
||||
since it must embed information about the recognized text, and depending on the
|
||||
settings chosen, may not be able to represent the output file as compactly as
|
||||
the input file.
|
||||
|
||||
Optimizations that always occurs
|
||||
================================
|
||||
@@ -37,12 +56,14 @@ Fast web view
|
||||
OCRmyPDF automatically optimizes PDFs for "fast web view" in Adobe Acrobat's
|
||||
parlance, or equivalently, linearizes PDFs so that the resources they reference
|
||||
are presented in the order a viewer needs them for sequential display. This
|
||||
reduces the latency of viewing a PDF both online and from local storage. This
|
||||
actually slightly increases the file size.
|
||||
reduces the latency of viewing a PDF both online and from local storage, in
|
||||
exchange for a slight increase in file size.
|
||||
|
||||
To disable this optimization and all others, use ``ocrmypdf --optimize 0 ...``
|
||||
or the shorthand ``-O0``.
|
||||
|
||||
Adobe Acrobat might not report the file as being "fast web view".
|
||||
|
||||
Lossless optimizations
|
||||
======================
|
||||
|
||||
|
||||
@@ -54,84 +54,6 @@ into the existing PDF or it may essentially reconstruct ("re-fry") a
|
||||
visually identical PDF that may be quite different at the binary level.
|
||||
That said, OCRmyPDF is not a tool designed for sanitizing PDFs.
|
||||
|
||||
.. _ocr-service:
|
||||
|
||||
Using OCRmyPDF online or as a service
|
||||
=====================================
|
||||
|
||||
OCRmyPDF is not designed for use as a public web service where a
|
||||
malicious user could upload a chosen PDF. In particular, it is not
|
||||
necessarily secure against PDF malware or PDFs that cause denial of
|
||||
service. OCRmyPDF relies on Ghostscript, and therefore, if deployed
|
||||
online one should be prepared to comply with Ghostscript's Affero GPL
|
||||
license, and any other licenses.
|
||||
|
||||
Setting aside these concerns, a side effect of OCRmyPDF is that it may
|
||||
incidentally sanitize PDFs containing certain types of malware. It
|
||||
repairs the PDF with pikepdf/libqpdf, which could correct malformed PDF
|
||||
structures that are part of an attack. When PDF/A output is selected
|
||||
(the default), the input PDF is partially reconstructed by Ghostscript.
|
||||
When ``--force-ocr`` is used, all pages are rasterized and reconverted
|
||||
to PDF, which could remove malware in embedded images.
|
||||
|
||||
OCRmyPDF should be relatively safe to use in a trusted intranet, with
|
||||
some considerations:
|
||||
|
||||
Limiting CPU usage
|
||||
------------------
|
||||
|
||||
OCRmyPDF will attempt to use all available CPUs and storage, so
|
||||
executing ``nice ocrmypdf`` or limiting the number of jobs with the
|
||||
``-j`` argument may ensure the server remains available. Another option
|
||||
would be to run OCRmyPDF jobs inside a Docker container, a virtual machine,
|
||||
or a cloud instance, which can impose its own limits on CPU usage and be
|
||||
terminated "from orbit" if it fails to complete.
|
||||
|
||||
Temporary storage requirements
|
||||
------------------------------
|
||||
|
||||
OCRmyPDF will use a large amount of temporary storage for its work,
|
||||
proportional to the total number of pixels needed to rasterize the PDF.
|
||||
The raster image of a 8.5×11" color page at 300 DPI takes 25 MB
|
||||
uncompressed; OCRmyPDF saves its intermediates as PNG, but that still
|
||||
means it requires about 9 MB per intermediate based on average
|
||||
compression ratios. Multiple intermediates per page are also required,
|
||||
depending on the command line given. A rule of thumb would be to allow
|
||||
100 MB of temporary storage per page in a file – meaning that a small
|
||||
cloud servers or small VM partitions should be provisioned with plenty
|
||||
of extra space, if say, a 500 page file might be sent.
|
||||
|
||||
To check temporary storage usage on actual files, run
|
||||
``ocrmypdf -k ...`` which will preserve and print the path to temporary
|
||||
storage when the job is done.
|
||||
|
||||
To change where temporary files are stored, change the ``TMPDIR``
|
||||
environment variable for ocrmypdf's environment. (Python's
|
||||
``tempfile.gettempdir()`` returns the root directory in which temporary
|
||||
files will be stored.) For example, one could redirect ``TMPDIR`` to a
|
||||
large RAM disk to avoid wear on HDD/SSD and potentially improve
|
||||
performance. On Amazon Web Services, ``TMPDIR`` can be set to `empheral
|
||||
storage <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html>`__.
|
||||
|
||||
Timeouts
|
||||
--------
|
||||
|
||||
To prevent excessively long OCR jobs consider setting
|
||||
``--tesseract-timeout`` and/or ``--skip-big`` arguments. ``--skip-big``
|
||||
is particularly helpful if your PDFs include documents such as reports
|
||||
on standard page sizes with large images attached - often large images
|
||||
are not worth OCR'ing anyway.
|
||||
|
||||
Commercial alternatives
|
||||
-----------------------
|
||||
|
||||
The author also provides professional services that include OCR and
|
||||
building databases around PDFs, and is happy to provide consultation.
|
||||
|
||||
Abbyy Cloud OCR is viable commercial alternative with a web services
|
||||
API. Amazon Textract, Google Cloud Vision, and Microsoft Azure
|
||||
Computer Vision provide advanced OCR but have less PDF rendering capability.
|
||||
|
||||
Password protection, digital signatures and certification
|
||||
=========================================================
|
||||
|
||||
|
||||
Reference in New Issue
Block a user