mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-02-22 01:24:11 -05:00
92 lines
3.8 KiB
ReStructuredText
92 lines
3.8 KiB
ReStructuredText
.. SPDX-FileCopyrightText: 2023 James R. Barlow
|
||
..
|
||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||
|
||
|
||
.. _ocr-service:
|
||
|
||
==================
|
||
Online deployments
|
||
==================
|
||
|
||
OCRmyPDF is designed to be used as a command line tool, but it can be
|
||
used in a web service. This document describes some considerations for
|
||
doing so.
|
||
|
||
A basic web service implementation is provided in the source code
|
||
repository, as ``misc/webservice.py``. It is only demonstration quality
|
||
and is not intended for production use.
|
||
|
||
OCRmyPDF is not designed for use as a public web service where a
|
||
malicious user could upload a chosen PDF. In particular, it is not
|
||
necessarily secure against PDF malware or PDFs that cause denial of
|
||
service. For further discussino of security, see :ref:`security`.
|
||
|
||
OCRmyPDF relies on Ghostscript, and therefore, if deployed
|
||
online one should be prepared to comply with Ghostscript's Affero GPL
|
||
license, and any other licenses.
|
||
|
||
Setting aside these concerns, a side effect of OCRmyPDF is that it may
|
||
incidentally sanitize PDFs containing certain types of malware. It
|
||
repairs the PDF with pikepdf/libqpdf, which could correct malformed PDF
|
||
structures that are part of an attack. When PDF/A output is selected
|
||
(the default), the input PDF is partially reconstructed by Ghostscript.
|
||
When ``--force-ocr`` is used, all pages are rasterized and reconverted
|
||
to PDF, which could remove malware in embedded images.
|
||
|
||
Limiting CPU usage
|
||
------------------
|
||
|
||
OCRmyPDF will attempt to use all available CPUs and storage, so
|
||
executing ``nice ocrmypdf`` or limiting the number of jobs with the
|
||
``--jobs`` argument may ensure the server remains responsive. Another option
|
||
would be to run OCRmyPDF jobs inside a Docker container, a virtual machine,
|
||
or a cloud instance, which can impose its own limits on CPU usage and be
|
||
terminated "from orbit" if it fails to complete.
|
||
|
||
Temporary storage requirements
|
||
------------------------------
|
||
|
||
OCRmyPDF will use a large amount of temporary storage for its work,
|
||
proportional to the total number of pixels needed to rasterize the PDF.
|
||
The raster image of a 8.5×11" color page at 300 DPI takes 25 MB
|
||
uncompressed; OCRmyPDF saves its intermediates as PNG, but that still
|
||
means it requires about 9 MB per intermediate based on average
|
||
compression ratios. Multiple intermediates per page are also required,
|
||
depending on the command line given. A rule of thumb would be to allow
|
||
100 MB of temporary storage per page in a file – meaning that a small
|
||
cloud servers or small VM partitions should be provisioned with plenty
|
||
of extra space, if say, a 500 page file might be sent.
|
||
|
||
To change the temporary directory, see :ref:`tmpdir`.
|
||
|
||
On Amazon Web Services or other cloud vendors, consider setting your
|
||
temporary directory to `empheral
|
||
storage <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html>`__.
|
||
|
||
Timeouts
|
||
--------
|
||
|
||
To prevent excessively long OCR jobs consider setting
|
||
``--tesseract-timeout`` and/or ``--skip-big`` arguments. ``--skip-big``
|
||
is particularly helpful if your PDFs include documents such as reports
|
||
on standard page sizes with large images attached - often large images
|
||
are not worth OCR'ing anyway.
|
||
|
||
Document management systems
|
||
---------------------------
|
||
|
||
If you are looking for a full document management system, consider
|
||
`paperless-ngx <https://github.com/paperless-ngx/paperless-ngx>`__,
|
||
which is a web application that uses OCRmyPDF to automatically OCR and
|
||
archive documents.
|
||
|
||
Commercial OCR alternatives
|
||
---------------------------
|
||
|
||
The author also provides professional services that include OCR and
|
||
building databases around PDFs, and is happy to provide consultation.
|
||
|
||
Abbyy Cloud OCR is viable commercial alternative with a web services
|
||
API. Amazon Textract, Google Cloud Vision, and Microsoft Azure
|
||
Computer Vision provide advanced OCR but have less PDF rendering capability. |