mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-01-24 22:12:46 -05:00
85 lines
3.6 KiB
Markdown
85 lines
3.6 KiB
Markdown
% SPDX-FileCopyrightText: 2025 James R. Barlow
|
||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||
|
||
(ocr-service)=
|
||
|
||
# Online deployments
|
||
|
||
OCRmyPDF is designed to be used as a command line tool, but it can be
|
||
used in a web service. This document describes some considerations for
|
||
doing so.
|
||
|
||
A basic web service implementation is provided in the source code
|
||
repository, as `misc/webservice.py`. It is only demonstration quality
|
||
and is not intended for production use.
|
||
|
||
OCRmyPDF is not designed for use as a public web service where a
|
||
malicious user could upload a chosen PDF. In particular, it is not
|
||
necessarily secure against PDF malware or PDFs that cause denial of
|
||
service. For further discussino of security, see
|
||
[security](security).
|
||
|
||
OCRmyPDF relies on Ghostscript, and therefore, if deployed online one
|
||
should be prepared to comply with Ghostscript\'s Affero GPL license, and
|
||
any other licenses.
|
||
|
||
Setting aside these concerns, a side effect of OCRmyPDF is that it may
|
||
incidentally sanitize PDFs containing certain types of malware. It
|
||
repairs the PDF with pikepdf/libqpdf, which could correct malformed PDF
|
||
structures that are part of an attack. When PDF/A output is selected
|
||
(the default), the input PDF is partially reconstructed by Ghostscript.
|
||
When `--force-ocr` is used, all pages are rasterized and reconverted to
|
||
PDF, which could remove malware in embedded images.
|
||
|
||
## Limiting CPU usage
|
||
|
||
OCRmyPDF will attempt to use all available CPUs and storage, so
|
||
executing `nice ocrmypdf` or limiting the number of jobs with the
|
||
`--jobs` argument may ensure the server remains responsive. Another
|
||
option would be to run OCRmyPDF jobs inside a Docker container, a
|
||
virtual machine, or a cloud instance, which can impose its own limits on
|
||
CPU usage and be terminated \"from orbit\" if it fails to complete.
|
||
|
||
## Temporary storage requirements
|
||
|
||
OCRmyPDF will use a large amount of temporary storage for its work,
|
||
proportional to the total number of pixels needed to rasterize the PDF.
|
||
The raster image of a 8.5×11\" color page at 300 DPI takes 25 MB
|
||
uncompressed; OCRmyPDF saves its intermediates as PNG, but that still
|
||
means it requires about 9 MB per intermediate based on average
|
||
compression ratios. Multiple intermediates per page are also required,
|
||
depending on the command line given. A rule of thumb would be to allow
|
||
100 MB of temporary storage per page in a file -- meaning that a small
|
||
cloud servers or small VM partitions should be provisioned with plenty
|
||
of extra space, if say, a 500 page file might be sent.
|
||
|
||
To change the temporary directory, see [tmpdir](#tmpdir).
|
||
|
||
On Amazon Web Services or other cloud vendors, consider setting your
|
||
temporary directory to [empheral
|
||
storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html).
|
||
|
||
## Timeouts
|
||
|
||
To prevent excessively long OCR jobs consider setting
|
||
`--tesseract-timeout` and/or `--skip-big` arguments. `--skip-big` is
|
||
particularly helpful if your PDFs include documents such as reports on
|
||
standard page sizes with large images attached - often large images are
|
||
not worth OCR\'ing anyway.
|
||
|
||
## Document management systems
|
||
|
||
If you are looking for a full document management system, consider
|
||
[paperless-ngx](https://github.com/paperless-ngx/paperless-ngx), which
|
||
is a web application that uses OCRmyPDF to automatically OCR and archive
|
||
documents.
|
||
|
||
## Commercial OCR alternatives
|
||
|
||
The author also provides professional services that include OCR and
|
||
building databases around PDFs, and is happy to provide consultation.
|
||
|
||
Abbyy Cloud OCR is viable commercial alternative with a web services
|
||
API. Amazon Textract, Google Cloud Vision, and Microsoft Azure Computer
|
||
Vision provide advanced OCR but have less PDF rendering capability.
|