Convert remaining rst -> md

This commit is contained in:
James R. Barlow
2025-04-17 15:03:21 -07:00
parent 3b9367fc69
commit d1a45e4abc
18 changed files with 4538 additions and 4783 deletions

461
docs/advanced.md Normal file
View File

@@ -0,0 +1,461 @@
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
# Advanced features
## Control of unpaper
OCRmyPDF uses `unpaper` to provide the implementation of the
`--clean` and `--clean-final` arguments.
[unpaper](https://github.com/Flameeyes/unpaper/blob/main/doc/basic-concepts.md)
provides a variety of image processing filters to improve images.
By default, OCRmyPDF uses only `unpaper` arguments that were found to
be safe to use on almost all files without having to inspect every page
of the file afterwards. This is particularly true when only `--clean`
is used, since that instructs OCRmyPDF to only clean the image before
OCR and not the final image.
However, if you wish to use the more aggressive options in `unpaper`,
you may use `--unpaper-args '...'` to override the OCRmyPDF's defaults
and forward other arguments to unpaper. This option will forward
arguments to `unpaper` without any knowledge of what that program
considers to be valid arguments. The string of arguments must be quoted
as shown in the examples below. No filename arguments may be included.
OCRmyPDF will assume it can append input and output filename of
intermediate images to the `--unpaper-args` string.
In this example, we tell `unpaper` to expect two pages of text on a
sheet (image), such as occurs when two facing pages of a book are
scanned. `unpaper` uses this information to deskew each independently
and clean up the margins of both.
```bash
ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf output.pdf
ocrmypdf --clean --clean-final --unpaper-args '--layout double --no-noisefilter' input.pdf output.pdf
```
:::{warning}
Some `unpaper` features will reposition text within the image.
`--clean-final` is recommended to avoid this issue.
:::
:::{warning}
Some `unpaper` features cause multiple input or output files to be
consumed or produced. OCRmyPDF requires `unpaper` to consume one
file and produce one file; errors will result if this assumption is not
met.
:::
:::{note}
`unpaper` uses uncompressed PBM/PGM/PPM files for its intermediate
files. For large images or documents, it can take a lot of temporary
disk space.
:::
## Control of OCR options
OCRmyPDF provides many features to control the behavior of the OCR
engine, Tesseract.
### When OCR is skipped
If a page in a PDF seems to have text, by default OCRmyPDF will exit
without modifying the PDF. This is to ensure that PDFs that were
previously OCRed or were "born digital" rather than scanned are not
processed.
If `--skip-text` is issued, then no image processing or OCR will be
performed on pages that already have text. The page will be copied to
the output. This may be useful for documents that contain both "born
digital" and scanned content, or to use OCRmyPDF to normalize and
convert to PDF/A regardless of their contents.
If `--redo-ocr` is issued, then a detailed text analysis is performed.
Text is categorized as either visible or invisible. Invisible text (OCR)
is stripped out. Then an image of each page is created with visible text
masked out. The page image is sent for OCR, and any additional text is
inserted as OCR. If a file contains a mix of text and bitmap images that
contain text, OCRmyPDF will locate the additional text in images without
disrupting the existing text. Some PDF OCR solutions render text as
technically printable or visible in some way, perhaps by drawing it and
then painting over it. OCRmyPDF cannot distinguish this type of OCR
text from real text, so it will not be "redone".
If `--force-ocr` is issued, then all pages will be rasterized to
images, discarding any hidden OCR text, rasterizing any printable
text, and flattening form fields or interactive objects into their visual
representation. This is useful for redoing OCR, for fixing OCR text
with a damaged character map (text is selectable but not searchable),
and destroying redacted information.
### Time and image size limits
By default, OCRmyPDF permits tesseract to run for three minutes (180
seconds) per page. This is usually more than enough time to find all
text on a reasonably sized page with modern hardware.
If a page is skipped, it will be inserted without OCR. If preprocessing
was requested, the preprocessed image layer will be inserted.
If you want to adjust the amount of time spent on OCR, change
`--tesseract-timeout`. You can also automatically skip images that
exceed a certain number of megapixels with `--skip-big`. (A 300 DPI,
8.5×11" page image is 8.4 megapixels.)
```bash
# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
```
### OCR for huge images
Tesseract has internal limits on the size
of images it will process. By default,
`--tesseract-downsample-large-images` is enabled, and OCRmyPDF will
downsample images to fit Tesseract limits. (The limits are usually encountered
only for scanned images of oversized media, such as large maps or blueprints exceeding
110 cm or 43 inches in either dimension, and at high DPI.) This feature can disabled
using `--no-tesseract-downsample-large-images`.
`--tesseract-downsample-above Npixels` adjusts the threshold at which images
will be downsampled. By default, only images that exceed any of Tesseract's
internal limits are downsampled (32767 pixels on either dimension).
You will also need to set `--tesseract-timeout` high enough to allow
for processing.
Only the image sent for OCR is downsampled. The original image is
preserved.
```bash
# Allow 600 seconds for OCR on huge images
ocrmypdf --tesseract-timeout 600 \
--tesseract-downsample-large-images \
bigfile.pdf output.pdf
# Downsample images above 5000 pixels on the longest dimension to
# 5000 pixels
ocrmypdf --tesseract-timeout 120 \
--tesseract-downsample-large-images \
--tesseract-downsample-above 5000 \
bigfile.pdf output_downsampled_ocr.pdf
```
### Overriding default tesseract
OCRmyPDF checks the system `PATH` for the `tesseract` binary.
Some relevant environment variables that influence Tesseract's behavior
include:
```{eval-rst}
.. envvar:: TESSDATA_PREFIX
Overrides the path to Tesseract's data files. This can allow
simultaneous installation of the "best" and "fast" training data
sets. OCRmyPDF does not manage this environment variable.
```
```{eval-rst}
.. envvar:: OMP_THREAD_LIMIT
Controls the number of threads Tesseract will use. OCRmyPDF will
manage this environment variable if it is not already set.
```
For example, if you have a development build of Tesseract don't wish to
use the system installation, you can launch OCRmyPDF as follows:
```bash
env \
PATH=/home/user/src/tesseract/api:$PATH \
TESSDATA_PREFIX=/home/user/src/tesseract \
ocrmypdf input.pdf output.pdf
```
In this example `TESSDATA_PREFIX` is required to redirect Tesseract to
an alternate folder for its "tessdata" files.
### Overriding other support programs
In addition to tesseract, OCRmyPDF uses the following external binaries:
- `gs` (Ghostscript)
- `unpaper`
- `pngquant`
- `jbig2`
In each case OCRmyPDF will search the `PATH` environment variable to
locate the binaries. By modifying the `PATH` environment variable, you
can override the binaries that OCRmyPDF uses.
### Changing Tesseract configuration variables
You can override Tesseract's default [control
parameters](https://tesseract-ocr.github.io/tessdoc/tess3/ControlParams.html)
with a configuration file.
As an example, this configuration will disable Tesseract's dictionary
for current language. Normally the dictionary is helpful for
interpolating words that are unclear, but it may interfere with OCR if
the document does not contain many words (for example, a list of part
numbers).
Create a file named "no-dict.cfg" with these contents:
```
load_system_dawg 0
language_model_penalty_non_dict_word 0
language_model_penalty_non_freq_dict_word 0
```
then run ocrmypdf as follows (along with any other desired arguments):
```bash
ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf
```
:::{warning}
Some combinations of control parameters will break Tesseract or break
assumptions that OCRmyPDF makes about Tesseract's output.
:::
### Changing page segmentation mode
The directive `--tesseract-pagesegmode Nmode` forwards the desired page segmentation
mode to Tesseract OCR. The default is 3.
Page segmentation can improve OCR results when you know that a PDF ought to be
analyzed a particular way, such as PDFs whose pages contain only a single line of
text. For the vast majority of users, changing the page segmentation mode will only
make things worse.
As of June 2024, the Tesseract page segmentation modes are:
| ID | Description |
| --- | --------------------------------------------------------------------------------------------- |
| 0 | Orientation and script detection (OSD) only. |
| 1 | Automatic page segmentation with OSD. |
| 2 | Automatic page segmentation, but no OSD, or OCR. (not implemented) |
| 3 | Fully automatic page segmentation, but no OSD. (Default) |
| 4 | Assume a single column of text of variable sizes. |
| 5 | Assume a single uniform block of vertically aligned text. |
| 6 | Assume a single uniform block of text. |
| 7 | Treat the image as a single text line. |
| 8 | Treat the image as a single word. |
| 9 | Treat the image as a single word in a circle. |
| 10 | Treat the image as a single character. |
| 11 | Sparse text. Find as much text as possible in no particular order. |
| 12 | Sparse text with OSD. |
| 13 | Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. |
Modes 0, 1, 2, and 12 (all of those that enable orientation and script detection)
are not compatible with OCRmyPDF, which performs OSD in a separate step from OCR.
Their use may interfere with `--rotate-pages` and other features.
It is currently not possible to use advanced Tesseract OCR features, such as creating
OCR information, when using Tesseract through OCRmyPDF.
## Changing the PDF renderer
rasterizing
: Converting a PDF to an image for display.
rendering
: Creating a new PDF from other data (such as an existing PDF).
OCRmyPDF has these PDF renderers: `sandwich` and `hocr`. The
renderer may be selected using `--pdf-renderer`. The default is
`auto` which lets OCRmyPDF select the renderer to use. Currently,
`auto` always selects `hocr`.
### The `hocr` renderer
:::{versionchanged} 16.0.0
:::
In both renderers, a text-only layer is rendered and sandwiched (overlaid)
on to either the original PDF page, or newly rasterized version of the
original PDF page (when `--force-ocr` is used). In this way, loss
of PDF information is generally avoided. (You may need to disable PDF/A
conversion and optimization to eliminate all lossy transformations.)
The current approach used by the new hOCR renderer is a re-implementation
of Tesseract's PDF renderer, using the same Glyphless font and general
ideas, but fixing many technical issues that impeded it. The new hocr
provides better text placement accuracy, avoids issues with word
segmentation, and provides better positioning of skewed text.
Using the experimental API, it is also possible to edit the OCR output
from Tesseract, using any tool that is capable of editing hOCR files.
Older versions of this renderer did not support non-Latin languages, but
it is now universal.
### The `sandwich` renderer
The `sandwich` renderer uses Tesseract's text-only PDF feature,
which produces a PDF page that lays out the OCR in invisible text.
Currently some problematic PDF viewers like Mozilla PDF.js and macOS
Preview have problems with segmenting its text output, and
mightrunseveralwordstogether. It also does not implement right to left
fonts (Arabic, Hebrew, Persian). The output of this renderer cannot
be edited. The sandwich renderer is retained for testing.
When image preprocessing features like `--deskew` are used, the
original PDF will be rendered as a full page and the OCR layer will be
placed on top.
## Rendering and rasterizing options
:::{versionadded} 14.3.0
:::
The `--continue-on-soft-render-error` option allows OCRmyPDF to
proceed if a page cannot be rasterized/rendered. This is useful if you are
trying to get the best possible OCR from a PDF that is not well-formed,
and you are willing to accept some pages that may not visually match the
input, and that may not OCR well.
## Color conversion strategy
:::{versionadded} 15.0.0
:::
OCRmyPDF uses Ghostscript to convert PDF to PDF/A. In some cases, this
conversion requires color conversion. The default strategy is to convert
using the `LeaveColorUnchanged` strategy, which preserves the original
color space wherever possible (some rare color spaces might still be
converted).
Usually document scanners produce PDFs in the sRGB color space, and do
not need to be converted, so the default strategy is appropriate.
Suppose that you have a document that was prepared for professional
printing in a Separation or CMYK color space, and text was converted to
curves. In this case, you may want to use a different color conversion
strategy. The `--color-conversion-strategy` option allows you to select a
different strategy, such as `RGB`.
## Return code policy
OCRmyPDF writes all messages to `stderr`. `stdout` is reserved for
piping output files. `stdin` is reserved for piping input files.
The return codes generated by the OCRmyPDF are considered part of the
stable user interface. They may be imported from
`ocrmypdf.exceptions`.
```{eval-rst}
.. list-table:: Return codes
:widths: 5 35 60
:header-rows: 1
* - Code
- Name
- Interpretation
* - 0
- ``ExitCode.ok``
- Everything worked as expected.
* - 1
- ``ExitCode.bad_args``
- Invalid arguments, exited with an error.
* - 2
- ``ExitCode.input_file``
- The input file does not seem to be a valid PDF.
* - 3
- ``ExitCode.missing_dependency``
- An external program required by OCRmyPDF is missing.
* - 4
- ``ExitCode.invalid_output_pdf``
- An output file was created, but it does not seem to be a valid PDF. The file will be available.
* - 5
- ``ExitCode.file_access_error``
- The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file.
* - 6
- ``ExitCode.already_done_ocr``
- The file already appears to contain text so it may not need OCR. See output message.
* - 7
- ``ExitCode.child_process_error``
- An error occurred in an external program (child process) and OCRmyPDF cannot continue.
* - 8
- ``ExitCode.encrypted_pdf``
- The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption.
* - 9
- ``ExitCode.invalid_config``
- A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file.
* - 10
- ``ExitCode.pdfa_conversion_failed``
- A valid PDF was created, PDF/A conversion failed. The file will be available.
* - 15
- ``ExitCode.other_error``
- Some other error occurred.
* - 130
- ``ExitCode.ctrl_c``
- The program was interrupted by pressing Ctrl+C.
```
(tmpdir)=
## Changing temporary storage location
OCRmyPDF generates many temporary files during processing.
To change where temporary files are stored, change the `TMPDIR`
environment variable for ocrmypdf's environment. (Python's
`tempfile.gettempdir()` returns the root directory in which temporary
files will be stored.) For example, one could redirect `TMPDIR` to a
large RAM disk to avoid wear on HDD/SSD and potentially improve
performance.
On Windows, the `TEMP` environment variable is used instead.
## Debugging the intermediate files
OCRmyPDF normally saves its intermediate results to a temporary folder
and deletes this folder when it exits, whether it succeeded or failed.
If the `--keep-temporary-files` (`-k`) argument is issued on the
command line, OCRmyPDF will keep the temporary folder and print the location,
whether it succeeded or failed. An example message is:
```none
Temporary working files retained at:
/tmp/ocrmypdf.io.u20wpz07
```
When OCRmyPDF is launched as a snap, this corresponds to the snap filesystem, for instance:
> /tmp/snap-private-tmp/snap.ocrmypdf/tmp/ocrmypdf.io.u20wpz07
The organization of this folder is an implementation detail and subject
to change between releases. However the general organization is that
working files on a per page basis have the page number as a prefix
(starting with page 1), an infix indicates the processing stage, and a
suffix indicates the file type. Some important files include:
- `_rasterize.png` - what the input page looks like
- `_ocr.png` - the file that is sent to Tesseract for OCR; depending
on arguments this may differ from the presentation image
- `_pp_deskew.png` - the image, after deskewing
- `_pp_clean.png` - the image, after cleaning with unpaper
- `_ocr_hocr.pdf` - the OCR file; appears as a blank page with invisible
text embedded
- `_ocr_hocr.txt` - the OCR text (not necessarily all text on the page,
if the page is mixed format)
- `fix_docinfo.pdf` - a temporary file created to fix the PDF DocumentInfo
data structure
- `graft_layers.pdf` - the rendered PDF with OCR layers grafted on
- `pdfa.pdf` - `graft_layers.pdf` after conversion to PDF/A
- `pdfa.ps` - a PostScript file used by Ghostscript for PDF/A conversion
- `optimize.pdf` - the PDF generated before optimization
- `optimize.out.pdf` - the PDF generated by optimization
- `origin` - the input file
- `origin.pdf` - the input file or the input image converted to PDF
- `images/*` - images extracted during the optimization process; here
the prefix indicates a PDF object ID not a page number

View File

@@ -1,486 +0,0 @@
.. SPDX-FileCopyrightText: 2022 James R. Barlow
.. SPDX-License-Identifier: CC-BY-SA-4.0
=================
Advanced features
=================
Control of unpaper
==================
OCRmyPDF uses ``unpaper`` to provide the implementation of the
``--clean`` and ``--clean-final`` arguments.
`unpaper <https://github.com/Flameeyes/unpaper/blob/main/doc/basic-concepts.md>`__
provides a variety of image processing filters to improve images.
By default, OCRmyPDF uses only ``unpaper`` arguments that were found to
be safe to use on almost all files without having to inspect every page
of the file afterwards. This is particularly true when only ``--clean``
is used, since that instructs OCRmyPDF to only clean the image before
OCR and not the final image.
However, if you wish to use the more aggressive options in ``unpaper``,
you may use ``--unpaper-args '...'`` to override the OCRmyPDF's defaults
and forward other arguments to unpaper. This option will forward
arguments to ``unpaper`` without any knowledge of what that program
considers to be valid arguments. The string of arguments must be quoted
as shown in the examples below. No filename arguments may be included.
OCRmyPDF will assume it can append input and output filename of
intermediate images to the ``--unpaper-args`` string.
In this example, we tell ``unpaper`` to expect two pages of text on a
sheet (image), such as occurs when two facing pages of a book are
scanned. ``unpaper`` uses this information to deskew each independently
and clean up the margins of both.
.. code-block:: bash
ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf output.pdf
ocrmypdf --clean --clean-final --unpaper-args '--layout double --no-noisefilter' input.pdf output.pdf
.. warning::
Some ``unpaper`` features will reposition text within the image.
``--clean-final`` is recommended to avoid this issue.
.. warning::
Some ``unpaper`` features cause multiple input or output files to be
consumed or produced. OCRmyPDF requires ``unpaper`` to consume one
file and produce one file; errors will result if this assumption is not
met.
.. note::
``unpaper`` uses uncompressed PBM/PGM/PPM files for its intermediate
files. For large images or documents, it can take a lot of temporary
disk space.
Control of OCR options
======================
OCRmyPDF provides many features to control the behavior of the OCR
engine, Tesseract.
When OCR is skipped
-------------------
If a page in a PDF seems to have text, by default OCRmyPDF will exit
without modifying the PDF. This is to ensure that PDFs that were
previously OCRed or were "born digital" rather than scanned are not
processed.
If ``--skip-text`` is issued, then no image processing or OCR will be
performed on pages that already have text. The page will be copied to
the output. This may be useful for documents that contain both "born
digital" and scanned content, or to use OCRmyPDF to normalize and
convert to PDF/A regardless of their contents.
If ``--redo-ocr`` is issued, then a detailed text analysis is performed.
Text is categorized as either visible or invisible. Invisible text (OCR)
is stripped out. Then an image of each page is created with visible text
masked out. The page image is sent for OCR, and any additional text is
inserted as OCR. If a file contains a mix of text and bitmap images that
contain text, OCRmyPDF will locate the additional text in images without
disrupting the existing text. Some PDF OCR solutions render text as
technically printable or visible in some way, perhaps by drawing it and
then painting over it. OCRmyPDF cannot distinguish this type of OCR
text from real text, so it will not be "redone".
If ``--force-ocr`` is issued, then all pages will be rasterized to
images, discarding any hidden OCR text, rasterizing any printable
text, and flattening form fields or interactive objects into their visual
representation. This is useful for redoing OCR, for fixing OCR text
with a damaged character map (text is selectable but not searchable),
and destroying redacted information.
Time and image size limits
--------------------------
By default, OCRmyPDF permits tesseract to run for three minutes (180
seconds) per page. This is usually more than enough time to find all
text on a reasonably sized page with modern hardware.
If a page is skipped, it will be inserted without OCR. If preprocessing
was requested, the preprocessed image layer will be inserted.
If you want to adjust the amount of time spent on OCR, change
``--tesseract-timeout``. You can also automatically skip images that
exceed a certain number of megapixels with ``--skip-big``. (A 300 DPI,
8.5×11" page image is 8.4 megapixels.)
.. code-block:: bash
# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
OCR for huge images
-------------------
Tesseract has internal limits on the size
of images it will process. By default,
``--tesseract-downsample-large-images`` is enabled, and OCRmyPDF will
downsample images to fit Tesseract limits. (The limits are usually encountered
only for scanned images of oversized media, such as large maps or blueprints exceeding
110 cm or 43 inches in either dimension, and at high DPI.) This feature can disabled
using ``--no-tesseract-downsample-large-images``.
``--tesseract-downsample-above Npixels`` adjusts the threshold at which images
will be downsampled. By default, only images that exceed any of Tesseract's
internal limits are downsampled (32767 pixels on either dimension).
You will also need to set ``--tesseract-timeout`` high enough to allow
for processing.
Only the image sent for OCR is downsampled. The original image is
preserved.
.. code-block:: bash
# Allow 600 seconds for OCR on huge images
ocrmypdf --tesseract-timeout 600 \
--tesseract-downsample-large-images \
bigfile.pdf output.pdf
# Downsample images above 5000 pixels on the longest dimension to
# 5000 pixels
ocrmypdf --tesseract-timeout 120 \
--tesseract-downsample-large-images \
--tesseract-downsample-above 5000 \
bigfile.pdf output_downsampled_ocr.pdf
Overriding default tesseract
----------------------------
OCRmyPDF checks the system ``PATH`` for the ``tesseract`` binary.
Some relevant environment variables that influence Tesseract's behavior
include:
.. envvar:: TESSDATA_PREFIX
Overrides the path to Tesseract's data files. This can allow
simultaneous installation of the "best" and "fast" training data
sets. OCRmyPDF does not manage this environment variable.
.. envvar:: OMP_THREAD_LIMIT
Controls the number of threads Tesseract will use. OCRmyPDF will
manage this environment variable if it is not already set.
For example, if you have a development build of Tesseract don't wish to
use the system installation, you can launch OCRmyPDF as follows:
.. code-block:: bash
env \
PATH=/home/user/src/tesseract/api:$PATH \
TESSDATA_PREFIX=/home/user/src/tesseract \
ocrmypdf input.pdf output.pdf
In this example ``TESSDATA_PREFIX`` is required to redirect Tesseract to
an alternate folder for its "tessdata" files.
Overriding other support programs
---------------------------------
In addition to tesseract, OCRmyPDF uses the following external binaries:
- ``gs`` (Ghostscript)
- ``unpaper``
- ``pngquant``
- ``jbig2``
In each case OCRmyPDF will search the ``PATH`` environment variable to
locate the binaries. By modifying the ``PATH`` environment variable, you
can override the binaries that OCRmyPDF uses.
Changing Tesseract configuration variables
------------------------------------------
You can override Tesseract's default `control
parameters <https://tesseract-ocr.github.io/tessdoc/tess3/ControlParams.html>`__
with a configuration file.
As an example, this configuration will disable Tesseract's dictionary
for current language. Normally the dictionary is helpful for
interpolating words that are unclear, but it may interfere with OCR if
the document does not contain many words (for example, a list of part
numbers).
Create a file named "no-dict.cfg" with these contents:
::
load_system_dawg 0
language_model_penalty_non_dict_word 0
language_model_penalty_non_freq_dict_word 0
then run ocrmypdf as follows (along with any other desired arguments):
.. code-block:: bash
ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf
.. warning::
Some combinations of control parameters will break Tesseract or break
assumptions that OCRmyPDF makes about Tesseract's output.
Changing page segmentation mode
-------------------------------
The directive ``--tesseract-pagesegmode Nmode`` forwards the desired page segmentation
mode to Tesseract OCR. The default is 3.
Page segmentation can improve OCR results when you know that a PDF ought to be
analyzed a particular way, such as PDFs whose pages contain only a single line of
text. For the vast majority of users, changing the page segmentation mode will only
make things worse.
As of June 2024, the Tesseract page segmentation modes are:
+-----+----------------------------------------------------------------------------------+
| ID | Description |
+=====+==================================================================================+
| 0 | Orientation and script detection (OSD) only. |
+-----+----------------------------------------------------------------------------------+
| 1 | Automatic page segmentation with OSD. |
+-----+----------------------------------------------------------------------------------+
| 2 | Automatic page segmentation, but no OSD, or OCR. (not implemented) |
+-----+----------------------------------------------------------------------------------+
| 3 | Fully automatic page segmentation, but no OSD. (Default) |
+-----+----------------------------------------------------------------------------------+
| 4 | Assume a single column of text of variable sizes. |
+-----+----------------------------------------------------------------------------------+
| 5 | Assume a single uniform block of vertically aligned text. |
+-----+----------------------------------------------------------------------------------+
| 6 | Assume a single uniform block of text. |
+-----+----------------------------------------------------------------------------------+
| 7 | Treat the image as a single text line. |
+-----+----------------------------------------------------------------------------------+
| 8 | Treat the image as a single word. |
+-----+----------------------------------------------------------------------------------+
| 9 | Treat the image as a single word in a circle. |
+-----+----------------------------------------------------------------------------------+
| 10 | Treat the image as a single character. |
+-----+----------------------------------------------------------------------------------+
| 11 | Sparse text. Find as much text as possible in no particular order. |
+-----+----------------------------------------------------------------------------------+
| 12 | Sparse text with OSD. |
+-----+----------------------------------------------------------------------------------+
| 13 | Raw line. Treat the image as a single text line, bypassing hacks that are |
| | Tesseract-specific. |
+-----+----------------------------------------------------------------------------------+
Modes 0, 1, 2, and 12 (all of those that enable orientation and script detection)
are not compatible with OCRmyPDF, which performs OSD in a separate step from OCR.
Their use may interfere with ``--rotate-pages`` and other features.
It is currently not possible to use advanced Tesseract OCR features, such as creating
OCR information, when using Tesseract through OCRmyPDF.
Changing the PDF renderer
=========================
rasterizing
Converting a PDF to an image for display.
rendering
Creating a new PDF from other data (such as an existing PDF).
OCRmyPDF has these PDF renderers: ``sandwich`` and ``hocr``. The
renderer may be selected using ``--pdf-renderer``. The default is
``auto`` which lets OCRmyPDF select the renderer to use. Currently,
``auto`` always selects ``hocr``.
The ``hocr`` renderer
---------------------
.. versionchanged:: 16.0.0
In both renderers, a text-only layer is rendered and sandwiched (overlaid)
on to either the original PDF page, or newly rasterized version of the
original PDF page (when ``--force-ocr`` is used). In this way, loss
of PDF information is generally avoided. (You may need to disable PDF/A
conversion and optimization to eliminate all lossy transformations.)
The current approach used by the new hOCR renderer is a re-implementation
of Tesseract's PDF renderer, using the same Glyphless font and general
ideas, but fixing many technical issues that impeded it. The new hocr
provides better text placement accuracy, avoids issues with word
segmentation, and provides better positioning of skewed text.
Using the experimental API, it is also possible to edit the OCR output
from Tesseract, using any tool that is capable of editing hOCR files.
Older versions of this renderer did not support non-Latin languages, but
it is now universal.
The ``sandwich`` renderer
-------------------------
The ``sandwich`` renderer uses Tesseract's text-only PDF feature,
which produces a PDF page that lays out the OCR in invisible text.
Currently some problematic PDF viewers like Mozilla PDF.js and macOS
Preview have problems with segmenting its text output, and
mightrunseveralwordstogether. It also does not implement right to left
fonts (Arabic, Hebrew, Persian). The output of this renderer cannot
be edited. The sandwich renderer is retained for testing.
When image preprocessing features like ``--deskew`` are used, the
original PDF will be rendered as a full page and the OCR layer will be
placed on top.
Rendering and rasterizing options
=================================
.. versionadded:: 14.3.0
The ``--continue-on-soft-render-error`` option allows OCRmyPDF to
proceed if a page cannot be rasterized/rendered. This is useful if you are
trying to get the best possible OCR from a PDF that is not well-formed,
and you are willing to accept some pages that may not visually match the
input, and that may not OCR well.
Color conversion strategy
=========================
.. versionadded:: 15.0.0
OCRmyPDF uses Ghostscript to convert PDF to PDF/A. In some cases, this
conversion requires color conversion. The default strategy is to convert
using the ``LeaveColorUnchanged`` strategy, which preserves the original
color space wherever possible (some rare color spaces might still be
converted).
Usually document scanners produce PDFs in the sRGB color space, and do
not need to be converted, so the default strategy is appropriate.
Suppose that you have a document that was prepared for professional
printing in a Separation or CMYK color space, and text was converted to
curves. In this case, you may want to use a different color conversion
strategy. The ``--color-conversion-strategy`` option allows you to select a
different strategy, such as ``RGB``.
Return code policy
==================
OCRmyPDF writes all messages to ``stderr``. ``stdout`` is reserved for
piping output files. ``stdin`` is reserved for piping input files.
The return codes generated by the OCRmyPDF are considered part of the
stable user interface. They may be imported from
``ocrmypdf.exceptions``.
.. list-table:: Return codes
:widths: 5 35 60
:header-rows: 1
* - Code
- Name
- Interpretation
* - 0
- ``ExitCode.ok``
- Everything worked as expected.
* - 1
- ``ExitCode.bad_args``
- Invalid arguments, exited with an error.
* - 2
- ``ExitCode.input_file``
- The input file does not seem to be a valid PDF.
* - 3
- ``ExitCode.missing_dependency``
- An external program required by OCRmyPDF is missing.
* - 4
- ``ExitCode.invalid_output_pdf``
- An output file was created, but it does not seem to be a valid PDF. The file will be available.
* - 5
- ``ExitCode.file_access_error``
- The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file.
* - 6
- ``ExitCode.already_done_ocr``
- The file already appears to contain text so it may not need OCR. See output message.
* - 7
- ``ExitCode.child_process_error``
- An error occurred in an external program (child process) and OCRmyPDF cannot continue.
* - 8
- ``ExitCode.encrypted_pdf``
- The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption.
* - 9
- ``ExitCode.invalid_config``
- A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file.
* - 10
- ``ExitCode.pdfa_conversion_failed``
- A valid PDF was created, PDF/A conversion failed. The file will be available.
* - 15
- ``ExitCode.other_error``
- Some other error occurred.
* - 130
- ``ExitCode.ctrl_c``
- The program was interrupted by pressing Ctrl+C.
.. _tmpdir:
Changing temporary storage location
===================================
OCRmyPDF generates many temporary files during processing.
To change where temporary files are stored, change the ``TMPDIR``
environment variable for ocrmypdf's environment. (Python's
``tempfile.gettempdir()`` returns the root directory in which temporary
files will be stored.) For example, one could redirect ``TMPDIR`` to a
large RAM disk to avoid wear on HDD/SSD and potentially improve
performance.
On Windows, the ``TEMP`` environment variable is used instead.
Debugging the intermediate files
================================
OCRmyPDF normally saves its intermediate results to a temporary folder
and deletes this folder when it exits, whether it succeeded or failed.
If the ``--keep-temporary-files`` (``-k``) argument is issued on the
command line, OCRmyPDF will keep the temporary folder and print the location,
whether it succeeded or failed. An example message is:
.. code-block:: none
Temporary working files retained at:
/tmp/ocrmypdf.io.u20wpz07
When OCRmyPDF is launched as a snap, this corresponds to the snap filesystem, for instance:
/tmp/snap-private-tmp/snap.ocrmypdf/tmp/ocrmypdf.io.u20wpz07
The organization of this folder is an implementation detail and subject
to change between releases. However the general organization is that
working files on a per page basis have the page number as a prefix
(starting with page 1), an infix indicates the processing stage, and a
suffix indicates the file type. Some important files include:
- ``_rasterize.png`` - what the input page looks like
- ``_ocr.png`` - the file that is sent to Tesseract for OCR; depending
on arguments this may differ from the presentation image
- ``_pp_deskew.png`` - the image, after deskewing
- ``_pp_clean.png`` - the image, after cleaning with unpaper
- ``_ocr_hocr.pdf`` - the OCR file; appears as a blank page with invisible
text embedded
- ``_ocr_hocr.txt`` - the OCR text (not necessarily all text on the page,
if the page is mixed format)
- ``fix_docinfo.pdf`` - a temporary file created to fix the PDF DocumentInfo
data structure
- ``graft_layers.pdf`` - the rendered PDF with OCR layers grafted on
- ``pdfa.pdf`` - ``graft_layers.pdf`` after conversion to PDF/A
- ``pdfa.ps`` - a PostScript file used by Ghostscript for PDF/A conversion
- ``optimize.pdf`` - the PDF generated before optimization
- ``optimize.out.pdf`` - the PDF generated by optimization
- ``origin`` - the input file
- ``origin.pdf`` - the input file or the input image converted to PDF
- ``images/*`` - images extracted during the optimization process; here
the prefix indicates a PDF object ID not a page number

View File

@@ -1,10 +1,7 @@
.. SPDX-FileCopyrightText: 2022 James R. Barlow
..
.. SPDX-License-Identifier: CC-BY-SA-4.0
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
======================
Using the OCRmyPDF API
======================
# Using the OCRmyPDF API
OCRmyPDF originated as a command line program and continues to have this
legacy, but parts of it can be imported and used in other Python
@@ -13,100 +10,95 @@ applications.
Some applications may want to consider running ocrmypdf from a
subprocess call anyway, as this provides isolation of its activities.
Example
=======
## Example
OCRmyPDF provides one high-level function to run its main engine from an
application. The parameters are symmetric to the command line arguments
and largely have the same functions.
.. code-block:: python
```python
import ocrmypdf
import ocrmypdf
if __name__ == '__main__': # To ensure correct behavior on Windows and macOS
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
if __name__ == '__main__': # To ensure correct behavior on Windows and macOS
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
```
With some exceptions, all of the command line arguments are available
and may be passed as equivalent keywords.
A few differences are that ``verbose`` and ``quiet`` are not available.
A few differences are that `verbose` and `quiet` are not available.
Instead, output should be managed by configuring logging.
Parent process requirements
---------------------------
### Parent process requirements
The :func:`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
The {func}`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
execution. To do this, it will:
- create worker processes or threads
- manage the signal flags of its worker processes
- execute other subprocesses (forking and executing other programs)
The Python process that calls :func:`ocrmypdf.ocr()` must be sufficiently
The Python process that calls {func}`ocrmypdf.ocr()` must be sufficiently
privileged to perform these actions.
There currently is no option to manage how jobs are scheduled other
than the argument ``jobs=`` which will limit the number of worker
than the argument `jobs=` which will limit the number of worker
processes.
Creating a child process to call :func:`ocrmypdf.ocr()` is suggested. That
Creating a child process to call {func}`ocrmypdf.ocr()` is suggested. That
way your application will survive and remain interactive even if
OCRmyPDF fails for any reason. For example:
.. code-block:: python
```python
from multiprocessing import Process
from multiprocessing import Process
def ocrmypdf_process():
ocrmypdf.ocr('input.pdf', 'output.pdf')
def ocrmypdf_process():
ocrmypdf.ocr('input.pdf', 'output.pdf')
def call_ocrmypdf_from_my_app():
p = Process(target=ocrmypdf_process)
p.start()
p.join()
```
def call_ocrmypdf_from_my_app():
p = Process(target=ocrmypdf_process)
p.start()
p.join()
Programs that call :func:`ocrmypdf.ocr()` should also install a SIGBUS signal
Programs that call {func}`ocrmypdf.ocr()` should also install a SIGBUS signal
handler (except on Windows), to raise an exception if access to a memory
mapped file fails. OCRmyPDF may use memory mapping.
:func:`ocrmypdf.ocr()` will take a threading lock to prevent multiple runs of itself
{func}`ocrmypdf.ocr()` will take a threading lock to prevent multiple runs of itself
in the same Python interpreter process. This is not thread-safe, because of how
OCRmyPDF's plugins and Python's library import system work. If you need to parallelize
OCRmyPDF, use processes.
.. warning::
:::{warning}
On Windows and macOS, the script that calls {func}`ocrmypdf.ocr()` must be
protected by an "ifmain" guard (`if __name__ == '__main__'`). If you do
not take at least one of these steps, process semantics will prevent
OCRmyPDF from working correctly.
:::
On Windows and macOS, the script that calls :func:`ocrmypdf.ocr()` must be
protected by an "ifmain" guard (``if __name__ == '__main__'``). If you do
not take at least one of these steps, process semantics will prevent
OCRmyPDF from working correctly.
### Logging
Logging
-------
OCRmyPDF will log under loggers named ``ocrmypdf``. In addition, it
imports ``pdfminer`` and ``PIL``, both of which post log messages under
OCRmyPDF will log under loggers named `ocrmypdf`. In addition, it
imports `pdfminer` and `PIL`, both of which post log messages under
those logging namespaces.
You can configure the logging as desired for your application or call
:func:`ocrmypdf.configure_logging` to configure logging the same way
OCRmyPDF itself does. The command line parameters such as ``--quiet``
and ``--verbose`` have no equivalents in the API; you must use the
{func}`ocrmypdf.configure_logging` to configure logging the same way
OCRmyPDF itself does. The command line parameters such as `--quiet`
and `--verbose` have no equivalents in the API; you must use the
provided configuration function or do configuration in a way that suits
your use case.
Progress monitoring
-------------------
### Progress monitoring
OCRmyPDF uses the ``rich`` package to implement its progress bars.
:func:`ocrmypdf.configure_logging` will set up logging output to
``sys.stderr`` in a way that is compatible with the display of the
progress bar. Use ``ocrmypdf.ocr(...progress_bar=False)`` to disable
OCRmyPDF uses the `rich` package to implement its progress bars.
{func}`ocrmypdf.configure_logging` will set up logging output to
`sys.stderr` in a way that is compatible with the display of the
progress bar. Use `ocrmypdf.ocr(...progress_bar=False)` to disable
the progress bar.
Standard output
---------------
### Standard output
OCRmyPDF is strict about not writing to standard output so that
users can safely use it in a pipeline and produce a valid output
@@ -116,12 +108,11 @@ behavior and support piping to a file. Another benefit of running
OCRmyPDF in a child process, as recommended above, is that it will
not interfere with the parent process's standard output.
Exceptions
----------
### Exceptions
OCRmyPDF may throw standard Python exceptions, ``ocrmypdf.exceptions.*``
OCRmyPDF may throw standard Python exceptions, `ocrmypdf.exceptions.*`
exceptions, some exceptions related to multiprocessing, and
:exc:`KeyboardInterrupt`. The parent process should provide an exception
{exc}`KeyboardInterrupt`. The parent process should provide an exception
handler. OCRmyPDF will clean up its temporary files and worker processes
automatically when an exception occurs.

View File

@@ -1,56 +1,60 @@
.. SPDX-FileCopyrightText: 2022 James R. Barlow
..
.. SPDX-License-Identifier: CC-BY-SA-4.0
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
=============
API reference
=============
# API reference
This page summarizes the rest of the public API. Generally speaking this
should be mainly of interest to plugin developers.
ocrmypdf.api
============
## ocrmypdf.api
```{eval-rst}
.. automodule:: ocrmypdf.api
:members:
```
ocrmypdf.exceptions
===================
## ocrmypdf.exceptions
```{eval-rst}
.. automodule:: ocrmypdf.exceptions
:members:
:undoc-members:
```
ocrmypdf.helpers
================
## ocrmypdf.helpers
```{eval-rst}
.. automodule:: ocrmypdf.helpers
:members:
:noindex: deprecated
.. autodecorator:: deprecated
```
ocrmypdf.hocrtransform
======================
## ocrmypdf.hocrtransform
```{eval-rst}
.. automodule:: ocrmypdf.hocrtransform
:members:
```
ocrmypdf.pdfa
=============
## ocrmypdf.pdfa
```{eval-rst}
.. automodule:: ocrmypdf.pdfa
:members:
```
ocrmypdf.quality
================
## ocrmypdf.quality
```{eval-rst}
.. automodule:: ocrmypdf.quality
:members:
```
ocrmypdf.subprocess
===================
## ocrmypdf.subprocess
```{eval-rst}
.. automodule:: ocrmypdf.subprocess
:members:
```

View File

@@ -45,7 +45,7 @@ extensions = [
'sphinx_issues',
]
myst_enable_extensions = ['colon_fence', 'attrs_block', 'attrs_inline']
myst_enable_extensions = ['colon_fence', 'attrs_block', 'attrs_inline', 'substitution']
# Extension settings
intersphinx_mapping = {'python': ('https://docs.python.org/3', None)}

View File

@@ -1,45 +1,43 @@
% SPDX-FileCopyrightText: 2025 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
Cookbook
========
# Cookbook
Basic examples
--------------
## Basic examples
### Help!
ocrmypdf has built-in help.
:::{code} bash
```bash
ocrmypdf --help
:::
```
### Add an OCR layer and convert to PDF/A
:::{code} bash
```bash
ocrmypdf input.pdf output.pdf
:::
```
### Add an OCR layer and output a standard PDF
:::{code} bash
```bash
ocrmypdf --output-type pdf input.pdf output.pdf
:::
```
### Create a PDF/A with all color and grayscale images converted to JPEG
:::{code} bash
```bash
ocrmypdf --output-type pdfa --pdfa-image-compression jpeg input.pdf output.pdf
:::
```
### Modify a file in place
The file will only be overwritten if OCRmyPDF is successful.
:::{code} bash
```bash
ocrmypdf myfile.pdf myfile.pdf
:::
```
### Correct page rotation
@@ -47,9 +45,9 @@ OCR will attempt to automatic correct the rotation of each page. This
can help fix a scanning job that contains a mix of landscape and
portrait pages.
:::{code} bash
```bash
ocrmypdf --rotate-pages myfile.pdf myfile.pdf
:::
```
You can increase (decrease) the parameter `--rotate-pages-threshold` to
make page rotation more (less) aggressive. The threshold number is the
@@ -70,10 +68,10 @@ angle is wrong.
OCRmyPDF assumes the document is in English unless told otherwise. OCR
quality may be poor if the wrong language is used.
:::{code} bash
```bash
ocrmypdf -l fra LeParisien.pdf LeParisien.pdf
ocrmypdf -l eng+fra Bilingual-English-French.pdf Bilingual-English-French.pdf
:::
```
Language packs must be installed for all languages specified. See
`Installing additional language packs <lang-packs>`{.interpreted-text
@@ -87,9 +85,9 @@ language when it is unknown.
This produces a file named \"output.pdf\" and a companion text file
named \"output.txt\".
:::{code} bash
```bash
ocrmypdf --sidecar output.txt input.pdf output.pdf
:::
```
:::{note}
The sidecar file contains the **OCR text** found by OCRmyPDF. If the
@@ -114,14 +112,14 @@ use a program like Poppler\'s `pdftotext` or `pdfgrep`.
If you are starting with images, you can just use Tesseract directly to
convert images to PDFs:
:::{code} bash
```bash
tesseract my-image.jpg output-prefix pdf
:::
```
:::{code} bash
```bash
# When there are multiple images
tesseract text-file-containing-list-of-image-filenames.txt output-prefix pdf
:::
```
Tesseract\'s PDF output is quite good -- OCRmyPDF uses it internally, in
some cases. However, OCRmyPDF has many features not available in
@@ -134,9 +132,9 @@ You can also use a program like
images to PDFs, and then pipe the results to run ocrmypdf. The `-` tells
ocrmypdf to read standard input.
:::{code} bash
```bash
img2pdf my-images*.jpg | ocrmypdf - myfile.pdf
:::
```
`img2pdf` is recommended because it does an excellent job at generating
PDFs without transcoding images.
@@ -148,9 +146,9 @@ own. If the resolution (dots per inch, DPI) of an image is not set or is
incorrect, it can be overridden with `--image-dpi`. (As 1 inch is 2.54
cm, 1 dpi = 0.39 dpcm).
:::{code} bash
```bash
ocrmypdf --image-dpi 300 image.png myfile.pdf
:::
```
If you have multiple images, you must use `img2pdf` to convert the
images to PDF.
@@ -161,8 +159,9 @@ We caution against using ImageMagick or Ghostscript to convert images to
PDF, since they may transcode images or produce downsampled images,
sometimes without warning.
Image processing
----------------
(image-processing)=
## Image processing
OCRmyPDF perform some image processing on each page of a PDF, if
desired. The same processing is applied to each page. It is suggested
@@ -200,18 +199,18 @@ should be visually reviewed after using these options.
Deskew:
:::{code} bash
```bash
ocrmypdf --deskew input.pdf output.pdf
:::
```
Image processing commands can be combined. The order in which options
are given does not matter. OCRmyPDF always applies the steps of the
image processing pipeline in the same order (rotate, remove background,
deskew, clean).
:::{code} bash
```bash
ocrmypdf --deskew --clean --rotate-pages input.pdf output.pdf
:::
```
Don\'t actually OCR my PDF
--------------------------
@@ -221,12 +220,11 @@ processing without performing OCR (by causing OCR to time out). This
works if all you want to is to apply image processing or PDF/A
conversion.
:::{code} bash
```bash
ocrmypdf --tesseract-timeout=0 --remove-background input.pdf output.pdf
:::
```
::: {.versionchanged}
v14.1.0
:::{versionchanged} v14.1.0
Prior to this version, `--tesseract-timeout 0` would prevent other uses
of Tesseract, such as deskewing, from working. This is no longer the
@@ -239,9 +237,9 @@ non-OCR operations, if needed.
This is getting ridiculous, but OCRmyPDF can complete strip all textual
information from a PDF and reconstruct it as a \"bag of images\" PDF.
:::{code} bash
```bash
ocrmypdf --tesseract-timeout 0 --force-ocr input.pdf output.pdf
:::
```
Why would you want to do this? Perhaps you have a PDF where OCR fails to
produce useful results, and just want to get rid of all OCR information.
@@ -251,18 +249,18 @@ This command also removes OCR generated by third party tools.
You can also optimize all images without performing any OCR:
:::{code} bash
```bash
ocrmypdf --tesseract-timeout=0 --optimize 3 --skip-text input.pdf output.pdf
:::
```
### Process only certain pages
You can ask OCRmyPDF to only apply [image processing](#image-processing)
and OCR to certain pages.
:::{code} bash
```bash
ocrmypdf --pages 2,3,13-17 input.pdf output.pdf
:::
```
Hyphens denote a range of pages and commas separate page numbers. If you
prefer to use spaces, quote all of the page numbers:
@@ -281,9 +279,9 @@ those options. Both of these steps are \"whole file\" operations. In
this example, we want to OCR only the title and otherwise change the PDF
as little as possible:
:::{code} bash
```bash
ocrmypdf --pages 1 --output-type pdf --optimize 0 input.pdf output.pdf
:::
```
Redo existing OCR
-----------------
@@ -297,9 +295,9 @@ This may be helpful for users who want to take advantage of accuracy
improvements in Tesseract for files they previously OCRed with an
earlier version of Tesseract and OCRmyPDF.
:::{code} bash
```bash
ocrmypdf --redo-ocr input.pdf output.pdf
:::
```
This method will replace OCR without rasterizing, reducing quality or
removing vector content. If a file contains a mix of pure digital text
@@ -351,18 +349,18 @@ header-rows: 1
* - Level
- Comments
* - ``--optimize=0``
* - <nobr>``--optimize=0``</nobr>
- Disables optimization.
* - ``--optimize 1``
* - <nobr>``--optimize 1``</nobr>
- Enables lossless optimizations, such as transcoding images to more
efficient formats. Also compress other uncompressed objects in the
PDF and enables the more efficient "object streams" within the PDF.
(If ``--jbig2-lossy`` is issued, then lossy JBIG2 optimization is used.
The decision to use lossy JBIG2 is separate from standard optimization
settings.)
* - ``--optimize 2``
* - <nobr>``--optimize 2``</nobr>
- All of the above, and enables lossy optimizations and color quantization.
* - ``--optimize 3``
* - <nobr>``--optimize 3``</nobr>
- All of the above, and enables more aggressive optimizations and targets lower image quality.
:::
@@ -376,9 +374,9 @@ inefficient compression modes to more modern versions. A program like
`qpdf` can be used to change encodings, e.g. to inspect the internals
for a PDF.
:::{code} bash
```bash
ocrmypdf --optimize 3 in.pdf out.pdf # Make it small
:::
```
Some users may consider enabling lossy JBIG2. See:
`jbig2-lossy`{.interpreted-text role="ref"}.

57
docs/index.md Normal file
View File

@@ -0,0 +1,57 @@
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
# OCRmyPDF documentation
:::{figure} images/logo.svg
:::
OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF
files, allowing them to be searched.
PDF is the best format for storing and exchanging scanned documents.
Unfortunately, PDFs can be difficult to modify. OCRmyPDF makes it easy to apply
image processing and OCR (recognized, searchable text) to existing PDFs.
```{toctree}
:maxdepth: 1
introduction
release_notes
installation
languages
jbig2
```
```{toctree}
:caption: Usage
:maxdepth: 2
cookbook
optimizer
docker
advanced
batch
cloud
performance
pdfsecurity
errors
```
```{toctree}
:caption: Developers
:maxdepth: 2
api
plugins
apiref
design_notes
contributing
maintainers
```
# Indices and tables
- {ref}`genindex`
- {ref}`modindex`
- {ref}`search`

View File

@@ -1,56 +0,0 @@
.. SPDX-FileCopyrightText: 2022 James R. Barlow
..
.. SPDX-License-Identifier: CC-BY-SA-4.0
OCRmyPDF documentation
======================
.. figure:: images/logo.svg
OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF
files, allowing them to be searched.
PDF is the best format for storing and exchanging scanned documents.
Unfortunately, PDFs can be difficult to modify. OCRmyPDF makes it easy to apply
image processing and OCR (recognized, searchable text) to existing PDFs.
.. toctree::
:maxdepth: 1
introduction
release_notes
installation
languages
jbig2
.. toctree::
:caption: Usage
:maxdepth: 2
cookbook
optimizer
docker
advanced
batch
cloud
performance
pdfsecurity
errors
.. toctree::
:caption: Developers
:maxdepth: 2
api
plugins
apiref
design_notes
contributing
maintainers
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

730
docs/installation.md Normal file
View File

@@ -0,0 +1,730 @@
---
myst:
substitutions:
deb_11: |-
:::{image} https://repology.org/badge/version-for-repo/debian_11/ocrmypdf.svg
:alt: Debian 11
:::
deb_12: |-
:::{image} https://repology.org/badge/version-for-repo/debian_12/ocrmypdf.svg
:alt: Debian 12
:::
deb_unstable: |-
:::{image} https://repology.org/badge/version-for-repo/debian_unstable/ocrmypdf.svg
:alt: Debian unstable
:::
fedora_38: |-
:::{image} https://repology.org/badge/version-for-repo/fedora_38/ocrmypdf.svg
:alt: Fedora 38
:::
fedora_39: |-
:::{image} https://repology.org/badge/version-for-repo/fedora_39/ocrmypdf.svg
:alt: Fedora 39
:::
fedora_rawhide: |-
:::{image} https://repology.org/badge/version-for-repo/fedora_rawhide/ocrmypdf.svg
:alt: Fedore Rawhide
:::
latest: |-
:::{image} https://img.shields.io/pypi/v/ocrmypdf.svg
:alt: OCRmyPDF latest released version on PyPI
:::
ubu_2004: |-
:::{image} https://repology.org/badge/version-for-repo/ubuntu_20_04/ocrmypdf.svg
:alt: Ubuntu 20.04 LTS
:::
ubu_2204: |-
:::{image} https://repology.org/badge/version-for-repo/ubuntu_22_04/ocrmypdf.svg
:alt: Ubuntu 22.04 LTS
:::
---
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
# Installing OCRmyPDF
(latest)=
The easiest way to install OCRmyPDF is to follow the steps for your operating
system/platform. This version may be out of date, however.
These platforms have one-liner installs:
:::{list-table}
:header-rows: 0
* - Debian, Ubuntu
- ``apt install ocrmypdf``
* - Windows Subsystem for Linux
- ``apt install ocrmypdf``
* - Fedora
- ``dnf install ocrmypdf tesseract-osd``
* - macOS (Homebrew)
- ``brew install ocrmypdf``
* - macOS (MacPorts)
- ``port install ocrmypdf``
* - LinuxBrew
- ``brew install ocrmypdf``
* - FreeBSD
- ``pkg install textproc/py-ocrmypdf``
* - Snap (snapcraft packaging)
- ``snap install ocrmypdf``
:::
More detailed procedures are outlined below. If you want to do a manual
install, or install a more recent version than your platform provides, read on.
:::{contents} Platform-specific steps
:depth: 2
:local: true
:::
## Installing on Linux
### Debian and Ubuntu 20.04 or newer
:::{list-table}
:header-rows: 1
* - OCRmyPDF versions in Debian & Ubuntu
* - {{ latest }}
* - {{ deb_11 }} {{ deb_12 }} {{ deb_unstable }}
* - {{ ubu_2004 }} {{ ubu_2204 }}
:::
Users of Debian or Ubuntu may simply
```bash
apt install ocrmypdf
```
As indicated in the table above, Debian and Ubuntu releases may lag
behind the latest version. If the version available for your platform is
out of date, you could opt to install the latest version from source.
See [Installing HEAD revision from
sources](#installing-head-revision-from-sources).
For full details on version availability for your platform, check the
[Debian Package Tracker](https://tracker.debian.org/pkg/ocrmypdf) or
[Ubuntu launchpad.net](https://launchpad.net/ocrmypdf).
:::{note}
OCRmyPDF for Debian and Ubuntu currently omit the JBIG2 encoder.
OCRmyPDF works fine without it but will produce larger output files.
If you build jbig2enc from source, ocrmypdf will
automatically detect it (specifically the `jbig2` binary) on the
`PATH`. To add JBIG2 encoding, see {ref}`jbig2`.
:::
### Fedora
:::{list-table}
:header-rows: 1
* - OCRmyPDF version
* - {{latest}}
* - {{fedora_38}} {{fedora_39}} {{fedora_rawhide}}
:::
Users of Fedora may simply
```bash
dnf install ocrmypdf tesseract-osd
```
For full details on version availability, check the [Fedora Package
Tracker](https://packages.fedoraproject.org/pkgs/ocrmypdf/ocrmypdf/).
If the version available for your platform is out of date, you could opt
to install the latest version from source. See [Installing HEAD revision
from sources](#installing-head-revision-from-sources).
:::{note}
OCRmyPDF for Fedora currently omits the JBIG2 encoder due to patent
issues. OCRmyPDF works fine without it but will produce larger output
files. If you build jbig2enc from source, ocrmypdf 7.0.0 and later
will automatically detect it on the `PATH`. To add JBIG2 encoding,
see {ref}`Installing the JBIG2 encoder <jbig2>`.
:::
(ubuntu-lts-latest)=
### RHEL 9
Prepare the environment by getting Python 3.11:
```bash
dnf install python3.11 python3.11-pip
```
Then, follow [Requirements for pip and HEAD install](#requirements-for-pip-and-head-install) to install dependencies:
```bash
dnf install ghostscript tesseract
```
and build ocrmypdf in virtual environment:
```bash
python3.11 -m venv .venv
```
To add JBIG2 encoding, see {ref}`Installing the JBIG2 encoder <jbig2>`.
Note Fedora packages for language data haven't been branched for RHEL/EPEL, but you can get traineddata files directly from [tesseract](https://github.com/tesseract-ocr/tessdata/) and place them in `/usr/share/tesseract/tessdata`.
### Installing the latest version on Ubuntu 22.04 LTS
Ubuntu 22.04 includes ocrmypdf 13.4.0 - you can install that with
`apt install ocrmypdf`. To install a more recent version for the current
user, follow these steps:
```bash
sudo apt-get update
sudo apt-get -y install ocrmypdf python3-pip
pip install --user --upgrade ocrmypdf
```
If you get the message `WARNING: The script ocrmypdf is installed in
'/home/$USER/.local/bin' which is not on PATH.`, you may need to re-login
or open a new shell, or manually adjust your PATH.
To add JBIG2 encoding, see {ref}`jbig2`.
### Ubuntu 20.04 LTS
Ubuntu 20.04 includes ocrmypdf 9.6.0 - you can install that with `apt`. The
most convenient way to install recent OCRmyPDF on older Ubuntu is to use
Homebrew on Linux (Linuxbrew).
```bash
brew install ocrmypdf
```
### Arch Linux (AUR)
:::{image} https://repology.org/badge/version-for-repo/aur/ocrmypdf.svg
:alt: ArchLinux
:target: https://repology.org/metapackage/ocrmypdf
:::
There is an [Arch User Repository (AUR) package for OCRmyPDF](https://aur.archlinux.org/packages/ocrmypdf/).
Installing AUR packages as root is not allowed, so you must first [setup a
non-root user](https://wiki.archlinux.org/index.php/Users_and_groups#User_management) and
[configure sudo](https://wiki.archlinux.org/index.php/Sudo#Configuration).
The standard Docker image, `archlinux/base:latest`, does **not** have a
non-root user configured, so users of that image must follow these guides. If
you are using a VM image, such as [the official Vagrant image](https://app.vagrantup.com/archlinux/boxes/archlinux), this work may already
be completed for you.
Next you should install the [base-devel package group](https://archlinux.org/packages/core/any/base-devel/). This includes the
standard tooling needed to build packages, such as a compiler and binary tools.
```bash
sudo pacman -S --needed base-devel
```
Now you are ready to install the OCRmyPDF package.
```bash
curl -O https://aur.archlinux.org/cgit/aur.git/snapshot/ocrmypdf.tar.gz
tar xvzf ocrmypdf.tar.gz
cd ocrmypdf
makepkg -sri
```
At this point you will have a working install of OCRmyPDF, but the Tesseract
install wont include any OCR language data. You can install [the
tesseract-data package group](https://www.archlinux.org/groups/any/tesseract-data/) to add all supported
languages, or use that package listing to identify the appropriate package for
your desired language.
```bash
sudo pacman -S tesseract-data-eng
```
As an alternative to this manual procedure, consider using an [AUR helper](https://wiki.archlinux.org/index.php/AUR_helpers). Such a tool will
automatically fetch, build and install the AUR package, resolve dependencies
(including dependencies on AUR packages), and ease the upgrade procedure.
If you have any difficulties with installation, check the repository package
page.
:::{note}
The OCRmyPDF AUR package currently omits the JBIG2 encoder. OCRmyPDF works
fine without it but will produce larger output files. The encoder is
available from [the jbig2enc-git AUR package](https://aur.archlinux.org/packages/jbig2enc-git/) and may be installed
using the same series of steps as for the installation OCRmyPDF AUR
package. Alternatively, it may be built manually from source following the
instructions in {ref}`Installing the JBIG2 encoder <jbig2>`. If JBIG2 is
installed, OCRmyPDF 7.0.0 and later will automatically detect it.
:::
### Alpine Linux
:::{image} https://repology.org/badge/version-for-repo/alpine_edge/ocrmypdf.svg
:alt: Alpine Linux
:target: https://repology.org/metapackage/ocrmypdf
:::
To install OCRmyPDF for Alpine Linux:
```bash
apk add ocrmypdf
```
### Gentoo Linux
:::{image} https://repology.org/badge/version-for-repo/gentoo_ovl_guru/ocrmypdf.svg
:alt: Gentoo Linux
:target: https://repology.org/metapackage/ocrmypdf
:::
To install OCRmyPDF on Gentoo Linux, use the following commands:
```bash
eselect repository enable guru
emaint sync --repo guru
emerge --ask app-text/OCRmyPDF
```
### Other Linux packages
See the
[Repology](https://repology.org/metapackage/ocrmypdf/versions) page.
In general, first install the OCRmyPDF package for your system, then
optionally use the procedure [Installing with Python
pip](#installing-with-python-pip) to install a more recent version.
## Installing on macOS
### Homebrew
:::{image} https://img.shields.io/homebrew/v/ocrmypdf.svg
:alt: homebrew
:target: https://formulae.brew.sh/formula/ocrmypdf
:::
OCRmyPDF is now a standard [Homebrew](https://brew.sh) formula. To
install on macOS:
```bash
brew install ocrmypdf
```
This will include only the English language pack. If you need other
languages you can optionally install them all:
```bash
brew install tesseract-lang # Optional: Install all language packs
```
### MacPorts
:::{image} https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fports.macports.org%2Fapi%2Fv1%2Fports%2Focrmypdf%2F%3Fformat%3Djson&query=version&label=MacPorts
:alt: Macports Version Information
:target: https://ports.macports.org/port/ocrmypdf
:::
OCRmyPDF is includes in MacPorts:
```bash
sudo port install ocrmypdf
```
Note that while this will install tesseract you will need to install
the appropriate tesseract [language ports](https://ports.macports.org/search/?selected_facets=categories_exact%3Atextproc&installed_file=&q=tesseract&name=on).
### Manual installation on macOS
These instructions probably work on all macOS supported by Homebrew, and are
for installing a more current version of OCRmyPDF than is available from
Homebrew. Note that the Homebrew versions usually track the release versions
fairly closely.
If it's not already present, [install Homebrew](http://brew.sh/).
Update Homebrew:
```bash
brew update
```
Install or upgrade the required Homebrew packages, if any are missing.
To do this, use `brew edit ocrmypdf` to obtain a recent list of Homebrew
dependencies. You could also check the `.workflows/build.yml`.
This will include the English, French, German and Spanish language
packs. If you need other languages you can optionally install them all:
(macos-all-languages)=
> ```bash
> brew install tesseract-lang # Option 2: for all language packs
> ```
Update the homebrew pip:
```bash
pip install --upgrade pip
```
You can then install OCRmyPDF from PyPI for the current user:
```bash
pip install --user ocrmypdf
```
The command line program should now be available:
```bash
ocrmypdf --help
```
## Installing on Windows
### Native Windows
% If you have a Windows that is not the Home edition, you can use Windows Sandbox to test on a blank Windows instance.
% https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/
:::{note}
Administrator privileges will be required for some of these steps.
:::
You must install the following for Windows:
- Python 64-bit
- Tesseract 64-bit
- Ghostscript 64-bit
Using the [winget](https://docs.microsoft.com/en-us/windows/package-manager/winget/)
package manager:
- `winget install -e --id Python.Python.3.11`
- `winget install -e --id UB-Mannheim.TesseractOCR`
You will need to install Ghostscript manually, [since it does not support automated
installs anymore](https://artifex.com/news/ghostscript-10.01.0-disabling-silent-install-option).
- [Ghostscript download page](https://ghostscript.com/releases/gsdnld.html).\`
(Or alternately, using the [Chocolatey](https://chocolatey.org/) package manager, install
the following when running in an Administrator command prompt):
- `choco install python3`
- `choco install --pre tesseract`
- `choco install pngquant` (optional)
Either set of commands will install the required software. At the moment there is no
single command to install Windows.
You may then use `pip` to install ocrmypdf. (This can performed by a user or
Administrator.):
- `python3 -m pip install ocrmypdf`
% The Windows Python versions do not place any python or python3 executable in the path.
% They add the py launcher to the path:
% https://docs.python.org/3/using/windows.html#python-launcher-for-windows
If you installed Python using WinGet, then use the following command instead:
- `py -m pip install ocrmypdf`
and use:
- `py -m ocrmypdf`
To start OCRmyPDF.
If you intend to use more Python software on your Windows machine, consider the use of
[pipx](https://pipx.pypa.io/stable/) or a similar tool to create isolated Python
environments for each Python software that you want to use.
OCRmyPDF will check the Windows Registry and standard locations in your Program Files
for third party software it needs (specifically, Tesseract and Ghostscript). To
override the versions OCRmyPDF selects, you can modify the `PATH` environment
variable. [Follow these directions](https://www.computerhope.com/issues/ch000549.htm#dospath)
to change the PATH.
:::{warning}
As of early 2021, users have reported problems with the Microsoft Store version of
Python and OCRmyPDF. These issues affect many other third party Python packages.
Please download Python from Python.org or a package manager instead of the
Microsoft Store version.
:::
:::{warning}
32-bit Windows is not supported.
:::
### Windows Subsystem for Linux
1. Install Ubuntu 22.04 for Windows Subsystem for Linux, if not already installed.
2. Follow the procedure to install {ref}`OCRmyPDF on Ubuntu 22.04 <ubuntu-lts-latest>`.
3. Open the Windows command prompt and create a symlink:
```powershell
wsl sudo ln -s /home/$USER/.local/bin/ocrmypdf /usr/local/bin/ocrmypdf
```
Then confirm that the expected version from PyPI ({{ latest }}) is installed:
```powershell
wsl ocrmypdf --version
```
You can then run OCRmyPDF in the Windows command prompt or Powershell, prefixing
`wsl`, and call it from Windows programs or batch files.
### Cygwin64
First install the the following prerequisite Cygwin packages using `setup-x86_64.exe`:
```
python310 (or later)
python3?-devel
python3?-pip
python3?-lxml
python3?-imaging
(where 3? means match the version of python3 you installed)
gcc-g++
ghostscript
libexempi3
libexempi-devel
libffi6
libffi-devel
pngquant
qpdf
libqpdf-devel
tesseract-ocr
tesseract-ocr-devel
```
Then open a Cygwin terminal (i.e. `mintty`), run the following commands. Note
that if you are using the version of `pip` that was installed with the Cygwin
Python package, the command name will be `pip3`. If you have since updated
`pip` (with, for instance `pip3 install --upgrade pip`) the the command is
likely just `pip` instead of `pip3`:
```bash
pip3 install wheel
pip3 install ocrmypdf
```
The optional dependency "unpaper" that is currently not available under Cygwin.
Without it, certain options such as `--clean` will produce an error message.
However, the OCR-to-text-layer functionality is available.
### Docker
You can also [Install the Docker image](docker) on Windows. Ensure that
your command prompt can run the docker "hello world" container.
## Installing on FreeBSD
:::{image} https://repology.org/badge/version-for-repo/freebsd/ocrmypdf.svg
:alt: FreeBSD
:target: https://repology.org/project/ocrmypdf/versions
:::
```bash
pkg install textproc/py-ocrmypdf
```
To install a more recent version, you could attempt to first install the system
version with `pkg`, then use `pip install --user ocrmypdf`.
## Installing the Docker image
For some users, installing the Docker image will be easier than
installing all of OCRmyPDF's dependencies.
See [Installing the Docker image](docker) for more information.
(installing-with-python-pip)=
## Installing with Python pip
OCRmyPDF is delivered by PyPI because it is a convenient way to install
the latest version. However, PyPI and `pip` cannot address the fact
that `ocrmypdf` depends on certain non-Python system libraries and
programs being installed.
For best results, first install [your platform's
version](https://repology.org/metapackage/ocrmypdf/versions) of
`ocrmypdf`, using the instructions elsewhere in this document. Then
you can use `pip` to get the latest version if your platform version
is out of date. Chances are that this will satisfy most dependencies.
Use `ocrmypdf --version` to confirm what version was installed.
Then you can install the latest OCRmyPDF from the Python wheels. First
try:
```bash
pip install --user ocrmypdf
```
(If the message appears `Requirement already satisfied: ocrmypdf in...`,
you will need to use `pip install --user --upgrade ocrmypdf`.)
You should then be able to run `ocrmypdf --version` and see that the
latest version was located.
## Installing with pipx
Some users may prefer pipx. As with the method above, you will need to
satisfy all non-Python dependencies. Then if pipx is installed, you
can use
```bash
pipx run ocrmypdf
```
(If not installed, pipx will install first.)
(requirements-for-pip-and-head-install)=
### Requirements for pip and HEAD install
OCRmyPDF currently requires these external programs and libraries to be
installed, and must be satisfied using the operating system package
manager. `pip` cannot provide them.
The following versions are required:
- Python 3.10 or newer
- Ghostscript 9.54 or newer
- Tesseract 4.1.1 or newer
- jbig2enc 0.29 or newer
- pngquant 2.5 or newer
- unpaper 6.1
We recommend 64-bit versions of all software. (32-bit versions are not
supported, although on Linux, they may still work.)
jbig2enc, pngquant, and unpaper are optional. If missing certain
features are disabled. OCRmyPDF will discover them as soon as they are
available.
**jbig2enc**, if present, will be used to optimize the encoding of
monochrome images. This can significantly reduce the file size of the
output file. It is not required.
[jbig2enc](https://github.com/agl/jbig2enc) is not generally
available for Ubuntu or Debian due to lingering concerns about patent
issues, but can easily be built from source. To add JBIG2 encoding, see
{ref}`jbig2`.
**pngquant**, if present, is optionally used to optimize the encoding of
PNG-style images in PDFs (actually, any that are that losslessly
encoded) by lossily quantizing to a smaller color palette. It is only
activated then the `--optimize` argument is `2` or `3`.
**unpaper**, if present, enables the `--clean` and `--clean-final`
command line options.
These are in addition to the Python packaging dependencies, meaning that
unfortunately, the `pip install` command cannot satisfy all of them.
(installing-head-revision-from-sources)=
## Installing HEAD revision from sources
If you have `git` and Python 3.10 or newer installed, you can install
from source. When the `pip` installer runs, it will alert you if
dependencies are missing.
If you prefer to build every from source, you will need to [build
pikepdf from
source](https://pikepdf.readthedocs.io/en/latest/installation.html#building-from-source).
First ensure you can build and install pikepdf.
To install the HEAD revision from sources in the current Python 3
environment:
```bash
pip install git+https://github.com/ocrmypdf/OCRmyPDF.git
```
Or, to install in editable mode
allowing customization of OCRmyPDF, use the `-e` flag:
```bash
pip install -e git+https://github.com/ocrmypdf/OCRmyPDF.git
```
You may find it easiest to install in a virtual environment, rather than
system-wide:
```bash
git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
python3 -m venv .venv
source .venv/bin/activate
cd OCRmyPDF
pip install .
```
However, `ocrmypdf` will only be accessible on the system PATH when
you activate the virtual environment.
To run the program:
```bash
ocrmypdf --help
```
If not yet installed, the script will notify you about dependencies that
need to be installed. The script requires specific versions of the
dependencies. Older version than the ones mentioned in the release notes
are likely not to be compatible to OCRmyPDF.
### For development
To install all of the development and test requirements:
```bash
git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
python -m venv .venv
source .venv/bin/activate
cd OCRmyPDF
pip install -e .[test]
```
To add JBIG2 encoding, see {ref}`jbig2`.
## Shell completions
Completions for `bash` and `fish` are available in the project's
`misc/completion` folder. The `bash` completions are likely `zsh`
compatible but this has not been confirmed. Package maintainers, please
install these at the appropriate locations for your system.
To manually install the `bash` completion, copy
`misc/completion/ocrmypdf.bash` to `/etc/bash_completion.d/ocrmypdf`
(rename the file).
To manually install the `fish` completion, copy
`misc/completion/ocrmypdf.fish` to
`~/.config/fish/completions/ocrmypdf.fish`.
## Note on 32-bit support
Many Python libraries no longer provide 32-bit binary wheels for Linux. This
includes many of the libraries that OCRmyPDF depends on, such as
Pillow. The easiest way to express this to end users is to say we don't
support 32-bit Linux.
However, if your Linux distribution still supports 32-bit binaries, you
can still install and use OCRmyPDF. A warning message will appear.
In practice, OCRmyPDF may need more than 32-bit memory space to run when
large documents are processed, so there are practical limitations to what
users can accomplish with it. Still, for the common use case of an 32-bit
ARM NAS or Raspberry Pi processing small documents, it should work.

View File

@@ -1,740 +0,0 @@
.. SPDX-FileCopyrightText: 2022 James R. Barlow
..
.. SPDX-License-Identifier: CC-BY-SA-4.0
===================
Installing OCRmyPDF
===================
.. |latest| image:: https://img.shields.io/pypi/v/ocrmypdf.svg
:alt: OCRmyPDF latest released version on PyPI
|latest|
The easiest way to install OCRmyPDF is to follow the steps for your operating
system/platform. This version may be out of date, however.
These platforms have one-liner installs:
+-------------------------------+-----------------------------------------+
| Debian, Ubuntu | ``apt install ocrmypdf`` |
+-------------------------------+-----------------------------------------+
| Windows Subsystem for Linux | ``apt install ocrmypdf`` |
+-------------------------------+-----------------------------------------+
| Fedora | ``dnf install ocrmypdf tesseract-osd`` |
+-------------------------------+-----------------------------------------+
| macOS (Homebrew) | ``brew install ocrmypdf`` |
+-------------------------------+-----------------------------------------+
| macOS (MacPorts) | ``port install ocrmypdf`` |
+-------------------------------+-----------------------------------------+
| LinuxBrew | ``brew install ocrmypdf`` |
+-------------------------------+-----------------------------------------+
| FreeBSD | ``pkg install textproc/py-ocrmypdf`` |
+-------------------------------+-----------------------------------------+
| Snap (snapcraft packaging) | ``snap install ocrmypdf`` |
+-------------------------------+-----------------------------------------+
More detailed procedures are outlined below. If you want to do a manual
install, or install a more recent version than your platform provides, read on.
.. contents:: Platform-specific steps
:depth: 2
:local:
Installing on Linux
===================
Debian and Ubuntu 20.04 or newer
--------------------------------
.. |deb-11| image:: https://repology.org/badge/version-for-repo/debian_11/ocrmypdf.svg
:alt: Debian 11
.. |deb-12| image:: https://repology.org/badge/version-for-repo/debian_12/ocrmypdf.svg
:alt: Debian 12
.. |deb-unstable| image:: https://repology.org/badge/version-for-repo/debian_unstable/ocrmypdf.svg
:alt: Debian unstable
.. |ubu-2004| image:: https://repology.org/badge/version-for-repo/ubuntu_20_04/ocrmypdf.svg
:alt: Ubuntu 20.04 LTS
.. |ubu-2204| image:: https://repology.org/badge/version-for-repo/ubuntu_22_04/ocrmypdf.svg
:alt: Ubuntu 22.04 LTS
+-----------------------------------------------+
| **OCRmyPDF versions in Debian & Ubuntu** |
+-----------------------------------------------+
| |latest| |
+-----------------------------------------------+
| |deb-11| |deb-12| |deb-unstable| |
+-----------------------------------------------+
| |ubu-2004| |ubu-2204| |
+-----------------------------------------------+
Users of Debian or Ubuntu may simply
.. code-block:: bash
apt install ocrmypdf
As indicated in the table above, Debian and Ubuntu releases may lag
behind the latest version. If the version available for your platform is
out of date, you could opt to install the latest version from source.
See `Installing HEAD revision from
sources <#installing-head-revision-from-sources>`__.
For full details on version availability for your platform, check the
`Debian Package Tracker <https://tracker.debian.org/pkg/ocrmypdf>`__ or
`Ubuntu launchpad.net <https://launchpad.net/ocrmypdf>`__.
.. note::
OCRmyPDF for Debian and Ubuntu currently omit the JBIG2 encoder.
OCRmyPDF works fine without it but will produce larger output files.
If you build jbig2enc from source, ocrmypdf will
automatically detect it (specifically the ``jbig2`` binary) on the
``PATH``. To add JBIG2 encoding, see :ref:`jbig2`.
Fedora
------
.. |fedora-38| image:: https://repology.org/badge/version-for-repo/fedora_38/ocrmypdf.svg
:alt: Fedora 38
.. |fedora-39| image:: https://repology.org/badge/version-for-repo/fedora_39/ocrmypdf.svg
:alt: Fedora 39
.. |fedora-rawhide| image:: https://repology.org/badge/version-for-repo/fedora_rawhide/ocrmypdf.svg
:alt: Fedore Rawhide
+-----------------------------------------------+
| **OCRmyPDF version** |
+-----------------------------------------------+
| |latest| |
+-----------------------------------------------+
| |fedora-38| |fedora-39| |fedora-rawhide| |
+-----------------------------------------------+
Users of Fedora may simply
.. code-block:: bash
dnf install ocrmypdf tesseract-osd
For full details on version availability, check the `Fedora Package
Tracker <https://packages.fedoraproject.org/pkgs/ocrmypdf/ocrmypdf/>`__.
If the version available for your platform is out of date, you could opt
to install the latest version from source. See `Installing HEAD revision
from sources <#installing-head-revision-from-sources>`__.
.. note::
OCRmyPDF for Fedora currently omits the JBIG2 encoder due to patent
issues. OCRmyPDF works fine without it but will produce larger output
files. If you build jbig2enc from source, ocrmypdf 7.0.0 and later
will automatically detect it on the ``PATH``. To add JBIG2 encoding,
see :ref:`Installing the JBIG2 encoder <jbig2>`.
.. _ubuntu-lts-latest:
RHEL 9
------
Prepare the environment by getting Python 3.11:
.. code-block:: bash
dnf install python3.11 python3.11-pip
Then, follow `Requirements for pip and HEAD install <#requirements-for-pip-and-head-install>`__ to install dependencies:
.. code-block:: bash
dnf install ghostscript tesseract
and build ocrmypdf in virtual environment:
.. code-block:: bash
python3.11 -m venv .venv
To add JBIG2 encoding, see :ref:`Installing the JBIG2 encoder <jbig2>`.
Note Fedora packages for language data haven't been branched for RHEL/EPEL, but you can get traineddata files directly from `tesseract
<https://github.com/tesseract-ocr/tessdata/>`__ and place them in ``/usr/share/tesseract/tessdata``.
Installing the latest version on Ubuntu 22.04 LTS
-------------------------------------------------
Ubuntu 22.04 includes ocrmypdf 13.4.0 - you can install that with
``apt install ocrmypdf``. To install a more recent version for the current
user, follow these steps:
.. code-block:: bash
sudo apt-get update
sudo apt-get -y install ocrmypdf python3-pip
pip install --user --upgrade ocrmypdf
If you get the message ``WARNING: The script ocrmypdf is installed in
'/home/$USER/.local/bin' which is not on PATH.``, you may need to re-login
or open a new shell, or manually adjust your PATH.
To add JBIG2 encoding, see :ref:`jbig2`.
Ubuntu 20.04 LTS
----------------
Ubuntu 20.04 includes ocrmypdf 9.6.0 - you can install that with ``apt``. The
most convenient way to install recent OCRmyPDF on older Ubuntu is to use
Homebrew on Linux (Linuxbrew).
.. code-block:: bash
brew install ocrmypdf
Arch Linux (AUR)
----------------
.. image:: https://repology.org/badge/version-for-repo/aur/ocrmypdf.svg
:alt: ArchLinux
:target: https://repology.org/metapackage/ocrmypdf
There is an `Arch User Repository (AUR) package for OCRmyPDF
<https://aur.archlinux.org/packages/ocrmypdf/>`__.
Installing AUR packages as root is not allowed, so you must first `setup a
non-root user
<https://wiki.archlinux.org/index.php/Users_and_groups#User_management>`__ and
`configure sudo <https://wiki.archlinux.org/index.php/Sudo#Configuration>`__.
The standard Docker image, ``archlinux/base:latest``, does **not** have a
non-root user configured, so users of that image must follow these guides. If
you are using a VM image, such as `the official Vagrant image
<https://app.vagrantup.com/archlinux/boxes/archlinux>`__, this work may already
be completed for you.
Next you should install the `base-devel package group
<https://archlinux.org/packages/core/any/base-devel/>`__. This includes the
standard tooling needed to build packages, such as a compiler and binary tools.
.. code-block:: bash
sudo pacman -S --needed base-devel
Now you are ready to install the OCRmyPDF package.
.. code-block:: bash
curl -O https://aur.archlinux.org/cgit/aur.git/snapshot/ocrmypdf.tar.gz
tar xvzf ocrmypdf.tar.gz
cd ocrmypdf
makepkg -sri
At this point you will have a working install of OCRmyPDF, but the Tesseract
install wont include any OCR language data. You can install `the
tesseract-data package group
<https://www.archlinux.org/groups/any/tesseract-data/>`__ to add all supported
languages, or use that package listing to identify the appropriate package for
your desired language.
.. code-block:: bash
sudo pacman -S tesseract-data-eng
As an alternative to this manual procedure, consider using an `AUR helper
<https://wiki.archlinux.org/index.php/AUR_helpers>`__. Such a tool will
automatically fetch, build and install the AUR package, resolve dependencies
(including dependencies on AUR packages), and ease the upgrade procedure.
If you have any difficulties with installation, check the repository package
page.
.. note::
The OCRmyPDF AUR package currently omits the JBIG2 encoder. OCRmyPDF works
fine without it but will produce larger output files. The encoder is
available from `the jbig2enc-git AUR package
<https://aur.archlinux.org/packages/jbig2enc-git/>`__ and may be installed
using the same series of steps as for the installation OCRmyPDF AUR
package. Alternatively, it may be built manually from source following the
instructions in :ref:`Installing the JBIG2 encoder <jbig2>`. If JBIG2 is
installed, OCRmyPDF 7.0.0 and later will automatically detect it.
Alpine Linux
------------
.. image:: https://repology.org/badge/version-for-repo/alpine_edge/ocrmypdf.svg
:alt: Alpine Linux
:target: https://repology.org/metapackage/ocrmypdf
To install OCRmyPDF for Alpine Linux:
.. code-block:: bash
apk add ocrmypdf
Gentoo Linux
------------
.. image:: https://repology.org/badge/version-for-repo/gentoo_ovl_guru/ocrmypdf.svg
:alt: Gentoo Linux
:target: https://repology.org/metapackage/ocrmypdf
To install OCRmyPDF on Gentoo Linux, use the following commands:
.. code-block:: bash
eselect repository enable guru
emaint sync --repo guru
emerge --ask app-text/OCRmyPDF
Other Linux packages
--------------------
See the
`Repology <https://repology.org/metapackage/ocrmypdf/versions>`__ page.
In general, first install the OCRmyPDF package for your system, then
optionally use the procedure `Installing with Python
pip <#installing-with-python-pip>`__ to install a more recent version.
Installing on macOS
===================
Homebrew
--------
.. image:: https://img.shields.io/homebrew/v/ocrmypdf.svg
:alt: homebrew
:target: https://formulae.brew.sh/formula/ocrmypdf
OCRmyPDF is now a standard `Homebrew <https://brew.sh>`__ formula. To
install on macOS:
.. code-block:: bash
brew install ocrmypdf
This will include only the English language pack. If you need other
languages you can optionally install them all:
.. code-block:: bash
brew install tesseract-lang # Optional: Install all language packs
MacPorts
--------
.. image:: https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fports.macports.org%2Fapi%2Fv1%2Fports%2Focrmypdf%2F%3Fformat%3Djson&query=version&label=MacPorts
:alt: Macports Version Information
:target: https://ports.macports.org/port/ocrmypdf
OCRmyPDF is includes in MacPorts:
.. code-block:: bash
sudo port install ocrmypdf
Note that while this will install tesseract you will need to install
the appropriate tesseract `language ports <https://ports.macports.org/search/?selected_facets=categories_exact%3Atextproc&installed_file=&q=tesseract&name=on>`__.
Manual installation on macOS
----------------------------
These instructions probably work on all macOS supported by Homebrew, and are
for installing a more current version of OCRmyPDF than is available from
Homebrew. Note that the Homebrew versions usually track the release versions
fairly closely.
If it's not already present, `install Homebrew <http://brew.sh/>`__.
Update Homebrew:
.. code-block:: bash
brew update
Install or upgrade the required Homebrew packages, if any are missing.
To do this, use ``brew edit ocrmypdf`` to obtain a recent list of Homebrew
dependencies. You could also check the ``.workflows/build.yml``.
This will include the English, French, German and Spanish language
packs. If you need other languages you can optionally install them all:
.. _macos-all-languages:
.. code-block:: bash
brew install tesseract-lang # Option 2: for all language packs
Update the homebrew pip:
.. code-block:: bash
pip install --upgrade pip
You can then install OCRmyPDF from PyPI for the current user:
.. code-block:: bash
pip install --user ocrmypdf
The command line program should now be available:
.. code-block:: bash
ocrmypdf --help
Installing on Windows
=====================
Native Windows
--------------
..
If you have a Windows that is not the Home edition, you can use Windows Sandbox to test on a blank Windows instance.
https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/
.. note::
Administrator privileges will be required for some of these steps.
You must install the following for Windows:
* Python 64-bit
* Tesseract 64-bit
* Ghostscript 64-bit
Using the `winget <https://docs.microsoft.com/en-us/windows/package-manager/winget/>`_
package manager:
* ``winget install -e --id Python.Python.3.11``
* ``winget install -e --id UB-Mannheim.TesseractOCR``
You will need to install Ghostscript manually, `since it does not support automated
installs anymore <https://artifex.com/news/ghostscript-10.01.0-disabling-silent-install-option>`_.
* `Ghostscript download page <https://ghostscript.com/releases/gsdnld.html>`_.`
(Or alternately, using the `Chocolatey <https://chocolatey.org/>`_ package manager, install
the following when running in an Administrator command prompt):
* ``choco install python3``
* ``choco install --pre tesseract``
* ``choco install pngquant`` (optional)
Either set of commands will install the required software. At the moment there is no
single command to install Windows.
You may then use ``pip`` to install ocrmypdf. (This can performed by a user or
Administrator.):
* ``python3 -m pip install ocrmypdf``
..
The Windows Python versions do not place any python or python3 executable in the path.
They add the py launcher to the path:
https://docs.python.org/3/using/windows.html#python-launcher-for-windows
If you installed Python using WinGet, then use the following command instead:
* ``py -m pip install ocrmypdf``
and use:
* ``py -m ocrmypdf``
To start OCRmyPDF.
If you intend to use more Python software on your Windows machine, consider the use of
`pipx <https://pipx.pypa.io/stable/>`_ or a similar tool to create isolated Python
environments for each Python software that you want to use.
OCRmyPDF will check the Windows Registry and standard locations in your Program Files
for third party software it needs (specifically, Tesseract and Ghostscript). To
override the versions OCRmyPDF selects, you can modify the ``PATH`` environment
variable. `Follow these directions <https://www.computerhope.com/issues/ch000549.htm#dospath>`_
to change the PATH.
.. warning::
As of early 2021, users have reported problems with the Microsoft Store version of
Python and OCRmyPDF. These issues affect many other third party Python packages.
Please download Python from Python.org or a package manager instead of the
Microsoft Store version.
.. warning::
32-bit Windows is not supported.
Windows Subsystem for Linux
---------------------------
#. Install Ubuntu 22.04 for Windows Subsystem for Linux, if not already installed.
#. Follow the procedure to install :ref:`OCRmyPDF on Ubuntu 22.04 <ubuntu-lts-latest>`.
#. Open the Windows command prompt and create a symlink:
.. code-block:: powershell
wsl sudo ln -s /home/$USER/.local/bin/ocrmypdf /usr/local/bin/ocrmypdf
Then confirm that the expected version from PyPI (|latest|) is installed:
.. code-block:: powershell
wsl ocrmypdf --version
You can then run OCRmyPDF in the Windows command prompt or Powershell, prefixing
``wsl``, and call it from Windows programs or batch files.
Cygwin64
--------
First install the the following prerequisite Cygwin packages using ``setup-x86_64.exe``::
python310 (or later)
python3?-devel
python3?-pip
python3?-lxml
python3?-imaging
(where 3? means match the version of python3 you installed)
gcc-g++
ghostscript
libexempi3
libexempi-devel
libffi6
libffi-devel
pngquant
qpdf
libqpdf-devel
tesseract-ocr
tesseract-ocr-devel
Then open a Cygwin terminal (i.e. ``mintty``), run the following commands. Note
that if you are using the version of ``pip`` that was installed with the Cygwin
Python package, the command name will be ``pip3``. If you have since updated
``pip`` (with, for instance ``pip3 install --upgrade pip``) the the command is
likely just ``pip`` instead of ``pip3``:
.. code-block:: bash
pip3 install wheel
pip3 install ocrmypdf
The optional dependency "unpaper" that is currently not available under Cygwin.
Without it, certain options such as ``--clean`` will produce an error message.
However, the OCR-to-text-layer functionality is available.
Docker
------
You can also :ref:`Install the Docker <docker>` container on Windows. Ensure that
your command prompt can run the docker "hello world" container.
Installing on FreeBSD
=====================
.. image:: https://repology.org/badge/version-for-repo/freebsd/ocrmypdf.svg
:alt: FreeBSD
:target: https://repology.org/project/ocrmypdf/versions
.. code-block:: bash
pkg install textproc/py-ocrmypdf
To install a more recent version, you could attempt to first install the system
version with ``pkg``, then use ``pip install --user ocrmypdf``.
Installing the Docker image
===========================
For some users, installing the Docker image will be easier than
installing all of OCRmyPDF's dependencies.
See :ref:`docker` for more information.
Installing with Python pip
==========================
OCRmyPDF is delivered by PyPI because it is a convenient way to install
the latest version. However, PyPI and ``pip`` cannot address the fact
that ``ocrmypdf`` depends on certain non-Python system libraries and
programs being installed.
For best results, first install `your platform's
version <https://repology.org/metapackage/ocrmypdf/versions>`__ of
``ocrmypdf``, using the instructions elsewhere in this document. Then
you can use ``pip`` to get the latest version if your platform version
is out of date. Chances are that this will satisfy most dependencies.
Use ``ocrmypdf --version`` to confirm what version was installed.
Then you can install the latest OCRmyPDF from the Python wheels. First
try:
.. code-block:: bash
pip install --user ocrmypdf
(If the message appears ``Requirement already satisfied: ocrmypdf in...``,
you will need to use ``pip install --user --upgrade ocrmypdf``.)
You should then be able to run ``ocrmypdf --version`` and see that the
latest version was located.
Installing with pipx
====================
Some users may prefer pipx. As with the method above, you will need to
satisfy all non-Python dependencies. Then if pipx is installed, you
can use
.. code-block:: bash
pipx run ocrmypdf
(If not installed, pipx will install first.)
Requirements for pip and HEAD install
-------------------------------------
OCRmyPDF currently requires these external programs and libraries to be
installed, and must be satisfied using the operating system package
manager. ``pip`` cannot provide them.
The following versions are required:
- Python 3.10 or newer
- Ghostscript 9.54 or newer
- Tesseract 4.1.1 or newer
- jbig2enc 0.29 or newer
- pngquant 2.5 or newer
- unpaper 6.1
We recommend 64-bit versions of all software. (32-bit versions are not
supported, although on Linux, they may still work.)
jbig2enc, pngquant, and unpaper are optional. If missing certain
features are disabled. OCRmyPDF will discover them as soon as they are
available.
**jbig2enc**, if present, will be used to optimize the encoding of
monochrome images. This can significantly reduce the file size of the
output file. It is not required.
`jbig2enc <https://github.com/agl/jbig2enc>`__ is not generally
available for Ubuntu or Debian due to lingering concerns about patent
issues, but can easily be built from source. To add JBIG2 encoding, see
:ref:`jbig2`.
**pngquant**, if present, is optionally used to optimize the encoding of
PNG-style images in PDFs (actually, any that are that losslessly
encoded) by lossily quantizing to a smaller color palette. It is only
activated then the ``--optimize`` argument is ``2`` or ``3``.
**unpaper**, if present, enables the ``--clean`` and ``--clean-final``
command line options.
These are in addition to the Python packaging dependencies, meaning that
unfortunately, the ``pip install`` command cannot satisfy all of them.
Installing HEAD revision from sources
=====================================
If you have ``git`` and Python 3.10 or newer installed, you can install
from source. When the ``pip`` installer runs, it will alert you if
dependencies are missing.
If you prefer to build every from source, you will need to `build
pikepdf from
source <https://pikepdf.readthedocs.io/en/latest/installation.html#building-from-source>`__.
First ensure you can build and install pikepdf.
To install the HEAD revision from sources in the current Python 3
environment:
.. code-block:: bash
pip install git+https://github.com/ocrmypdf/OCRmyPDF.git
Or, to install in editable mode
allowing customization of OCRmyPDF, use the ``-e`` flag:
.. code-block:: bash
pip install -e git+https://github.com/ocrmypdf/OCRmyPDF.git
You may find it easiest to install in a virtual environment, rather than
system-wide:
.. code-block:: bash
git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
python3 -m venv .venv
source .venv/bin/activate
cd OCRmyPDF
pip install .
However, ``ocrmypdf`` will only be accessible on the system PATH when
you activate the virtual environment.
To run the program:
.. code-block:: bash
ocrmypdf --help
If not yet installed, the script will notify you about dependencies that
need to be installed. The script requires specific versions of the
dependencies. Older version than the ones mentioned in the release notes
are likely not to be compatible to OCRmyPDF.
For development
---------------
To install all of the development and test requirements:
.. code-block:: bash
git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
python -m venv .venv
source .venv/bin/activate
cd OCRmyPDF
pip install -e .[test]
To add JBIG2 encoding, see :ref:`jbig2`.
Shell completions
=================
Completions for ``bash`` and ``fish`` are available in the project's
``misc/completion`` folder. The ``bash`` completions are likely ``zsh``
compatible but this has not been confirmed. Package maintainers, please
install these at the appropriate locations for your system.
To manually install the ``bash`` completion, copy
``misc/completion/ocrmypdf.bash`` to ``/etc/bash_completion.d/ocrmypdf``
(rename the file).
To manually install the ``fish`` completion, copy
``misc/completion/ocrmypdf.fish`` to
``~/.config/fish/completions/ocrmypdf.fish``.
Note on 32-bit support
======================
Many Python libraries no longer provide 32-bit binary wheels for Linux. This
includes many of the libraries that OCRmyPDF depends on, such as
Pillow. The easiest way to express this to end users is to say we don't
support 32-bit Linux.
However, if your Linux distribution still supports 32-bit binaries, you
can still install and use OCRmyPDF. A warning message will appear.
In practice, OCRmyPDF may need more than 32-bit memory space to run when
large documents are processed, so there are practical limitations to what
users can accomplish with it. Still, for the common use case of an 32-bit
ARM NAS or Raspberry Pi processing small documents, it should work.

View File

@@ -1,10 +1,14 @@
.. SPDX-FileCopyrightText: 2022 James R. Barlow
..
.. SPDX-License-Identifier: CC-BY-SA-4.0
---
substitutions:
image: |-
```{image} images/bitmap_vs_svg.svg
```
---
============
Introduction
============
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
# Introduction
OCRmyPDF is a Python application and library that adds text "layers" to images in
PDFs, making scanned image PDFs searchable. It uses OCR to guess the text
@@ -13,31 +17,30 @@ that enable customization of its processing steps, and it is highly tolerant
of PDFs containing scanned images and "born digital" content that doesn't
require text recognition.
About OCR
=========
## About OCR
`Optical character
recognition <https://en.wikipedia.org/wiki/Optical_character_recognition>`__
[Optical character
recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)
is a technology that converts images of typed or handwritten text, such as
in a scanned document, into computer text that can be selected, searched and copied.
OCRmyPDF uses
`Tesseract <https://github.com/tesseract-ocr/tesseract>`__, a widely
[Tesseract](https://github.com/tesseract-ocr/tesseract), a widely
available open source OCR engine, to perform OCR.
.. _raster-vector:
(raster-vector)=
About PDFs
==========
## About PDFs
PDFs are page description files that attempt to preserve a layout
exactly. They contain `vector
graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
exactly. They contain [vector
graphics](http://vector-conversions.com/vectorizing/raster_vs_vector.html)
that can contain raster objects, such as scanned images. Because PDFs can
contain multiple pages (unlike many image formats) and can contain fonts
and text, they are a suitable format for exchanging scanned documents.
|image|
:::{image} images/bitmap_vs_svg.svg
:::
A PDF page may contain multiple images, even if it appears to have only
one image. Some scanners or scanning software may segment pages into
@@ -48,10 +51,9 @@ Rasterizing a PDF is the process of generating corresponding raster images.
OCR engines like Tesseract work with images, not scalable vector graphics
or mixed raster-vector-text graphics such as PDF.
About PDF/A
===========
## About PDF/A
`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`__ is an ISO-standardized
[PDF/A](https://en.wikipedia.org/wiki/PDF/A) is an ISO-standardized
subset of the full PDF specification that is designed for archiving (the
'A' stands for Archive). PDF/A differs from PDF primarily by omitting
features that could complicate future file readability,
@@ -63,8 +65,8 @@ of embedded content, it is likely more secure.
There are various conformance levels and versions, such as "PDF/A-2b".
In general, the preferred format for scanned documents is PDF/A. Some
governments and jurisdictions, US Courts in particular, `mandate the use
of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`__ for scanned
governments and jurisdictions, US Courts in particular, [mandate the use
of PDF/A](https://pdfblog.com/2012/02/13/what-is-pdfa/) for scanned
documents.
Since most individuals scanning documents aim for long-term readability,
@@ -78,13 +80,12 @@ files can be digitally signed but may not be encrypted to ensure future
readability. Fortunately, converting from PDF/A to a regular PDF is
straightforward, and any PDF viewer can handle PDF/A files.
What OCRmyPDF does
==================
## What OCRmyPDF does
OCRmyPDF analyzes each page of a PDF to determine the required colorspace
and resolution (DPI) for capturing all the information on that page without
losing content. It uses
`Ghostscript <http://ghostscript.com/>`__ to rasterize each page and subsequently
[Ghostscript](http://ghostscript.com/) to rasterize each page and subsequently
performs OCR on the rasterized image to generate an OCR "layer." This layer
is then integrated back into the original PDF.
@@ -101,10 +102,9 @@ options are utilized, the OCR layer is integrated into the processed image.
By default, OCRmyPDF generates archival PDFs in the PDF/A format, which is
a more rigid subset of PDF features designed for long-term archives. If you
prefer regular PDFs, you can disable this feature using the
``--output-type pdf`` option.
`--output-type pdf` option.
Why you shouldn't do this manually
==================================
## Why you shouldn't do this manually
A PDF is similar to an HTML file, in that it contains document structure
along with images. While some PDFs may solely display a full-page image,
@@ -142,55 +142,53 @@ like pikepdf and QPDF, it can auto-repair damaged PDFs. You don't need to
understand the intricacies of these issues; you should be able to use
OCRmyPDF with any PDF file, and expect reasonable results.
Limitations
===========
## Limitations
OCRmyPDF is subject to limitations imposed by the Tesseract OCR engine.
These limitations are inherent to any software relying on Tesseract:
- The OCR accuracy may not match that of commercial OCR solutions.
- It is incapable of recognizing handwriting.
- It may detect gibberish and report it as OCR output.
- Results may be subpar when a document contains languages not specified
in the ``-l LANG`` argument.
- Tesseract may struggle to analyze the natural reading order of documents.
For instance, it might fail to recognize two columns in a document and
attempt to join text across columns.
- Poor quality scans can result in subpar OCR quality. In other words, the
quality of the OCR output depends on the quality of the input.
- Tesseract does not provide information about the font family to which text
belongs.
- Tesseract does not divide text into paragraphs or headings. It only provides
the text and its bounding box. As such, the generated PDF does not
contain any information about the document's structure.
- The OCR accuracy may not match that of commercial OCR solutions.
- It is incapable of recognizing handwriting.
- It may detect gibberish and report it as OCR output.
- Results may be subpar when a document contains languages not specified
in the `-l LANG` argument.
- Tesseract may struggle to analyze the natural reading order of documents.
For instance, it might fail to recognize two columns in a document and
attempt to join text across columns.
- Poor quality scans can result in subpar OCR quality. In other words, the
quality of the OCR output depends on the quality of the input.
- Tesseract does not provide information about the font family to which text
belongs.
- Tesseract does not divide text into paragraphs or headings. It only provides
the text and its bounding box. As such, the generated PDF does not
contain any information about the document's structure.
Ghostscript also imposes some limitations:
- PDFs containing JPEG 2000-encoded content may be converted to JPEG
encoding, which may introduce compression artifacts, if Ghostscript
PDF/A is enabled.
- Ghostscript may transcode grayscale and color images, potentially
lossily, based on an internal algorithm. This
behavior can be suppressed by setting ``--pdfa-image-compression`` to
``jpeg`` or ``lossless`` to set all images to one type or the other.
Ghostscript lacks an option to maintain the input image's format.
(Modern Ghostscript can copy JPEG images without transcoding them.)
- Ghostscript's PDF/A conversion removes any XMP metadata that is not
one of the standard XMP metadata namespaces for PDFs. In particular,
PRISM Metadata is removed.
- Ghostscript's PDF/A conversion may remove or deactivate
hyperlinks and other active content.
- PDFs containing JPEG 2000-encoded content may be converted to JPEG
encoding, which may introduce compression artifacts, if Ghostscript
PDF/A is enabled.
- Ghostscript may transcode grayscale and color images, potentially
lossily, based on an internal algorithm. This
behavior can be suppressed by setting `--pdfa-image-compression` to
`jpeg` or `lossless` to set all images to one type or the other.
Ghostscript lacks an option to maintain the input image's format.
(Modern Ghostscript can copy JPEG images without transcoding them.)
- Ghostscript's PDF/A conversion removes any XMP metadata that is not
one of the standard XMP metadata namespaces for PDFs. In particular,
PRISM Metadata is removed.
- Ghostscript's PDF/A conversion may remove or deactivate
hyperlinks and other active content.
You can use ``--output-type pdf`` to disable PDF/A conversion and produce
You can use `--output-type pdf` to disable PDF/A conversion and produce
a standard, non-archival PDF.
Regarding OCRmyPDF itself:
- PDFs using transparency are not currently represented in the test
suite
- PDFs using transparency are not currently represented in the test
suite
Similar programs
================
## Similar programs
To the author's knowledge, OCRmyPDF is the most feature-rich and
thoroughly tested command line OCR PDF conversion tool. If it does not
@@ -199,8 +197,7 @@ meet your needs, contributions and suggestions are welcome.
Ghostscript recently added three "pdfocr" output devices. They work by
rasterizing all content and converting all pages to a single colour space.
Web front-ends
==============
## Web front-ends
The Docker image of OCRmyPDF provides a web service front-end
that allows files to submitted over HTTP, and the results can be downloaded.
@@ -210,16 +207,14 @@ public internet and does not provide any security measures.
In addition, the following third-party integrations are available:
- `Paperless-ngx <https://docs.paperless-ngx.com/>`__ is a free software
document management system that uses OCRmyPDF to perform OCR on
uploaded documents.
- `Nextcloud OCR <https://github.com/janis91/ocr>`__ is a free software
plugin for the Nextcloud private cloud software.
- [Paperless-ngx](https://docs.paperless-ngx.com/) is a free software
document management system that uses OCRmyPDF to perform OCR on
uploaded documents.
- [Nextcloud OCR](https://github.com/janis91/ocr) is a free software
plugin for the Nextcloud private cloud software.
OCRmyPDF is not designed to be secure against malware-bearing PDFs (see
`Using OCRmyPDF online <ocr-service>`__). Users should ensure they
[Using OCRmyPDF online](ocr-service)). Users should ensure they
comply with OCRmyPDF's licenses and the licenses of all dependencies. In
particular, OCRmyPDF requires Ghostscript, which is licensed under
AGPLv3.
.. |image| image:: images/bitmap_vs_svg.svg

129
docs/languages.md Normal file
View File

@@ -0,0 +1,129 @@
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
(lang-packs)=
# Installing additional language packs
OCRmyPDF uses Tesseract for OCR, and relies on its language packs for all languages.
On most platforms, English is installed with Tesseract by default, but not always.
Tesseract supports [most
languages](https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages).
Languages are identified by standardized three-letter codes (called ISO 639-2 Alpha-3).
Tesseract's documentation also lists the three-letter code for your language.
Some are anglicized, e.g. Spanish is `spa` rather than `esp`, while others
are not, e.g. German is `deu` and French is `fra`.
Language packs (strictly speaking, Tesseract "traineddata" files) generally correspond
to the language in question, but different language packs are used in certain
situations. For German, the "Fraktur" language pack can assist with reading older
materials in the Fraktur typeface family (`deu_frak`). Some communities have changed
their script from Cyrillic to Latin; the Cyrillic version of Uzbek is available
as `uzb_cyrl` and the Latin version is `uzb`.
After you have installed a language pack, you can use it with `ocrmypdf -l <language>`,
for example `ocrmypdf -l spa`. For multilingual documents, you can specify
all languages to be expected, e.g. `ocrmypdf -l eng+fra` for English and French.
English is assumed by default unless other language(s) are specified.
For Linux users, you can often find packages that provide language
packs.
## Platform install steps
### Debian and Ubuntu (apt)
```bash
# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr
# Install Chinese Simplified language pack
apt-get install tesseract-ocr-chi-sim
```
You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
to what languages it should search for. Multiple languages can be
requested using either `-l eng+fra` (English and French) or
`-l eng -l fra`.
### Fedora
```bash
# Display a list of all Tesseract language packs
dnf search tesseract
# Install Chinese Simplified language pack
dnf install tesseract-langpack-chi_sim
```
You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
to what languages it should search for. Multiple languages can be
requested using either `-l eng+fra` (English and French) or
`-l eng -l fra`.
### Arch Linux
```bash
# Display a list of all Tesseract language packs
pacman -Ss tesseract-data
# Install German language pack
pacman -S tesseract-data-deu
```
You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
to what languages it should search for. Multiple languages can be
requested using either `-l eng+fra` (English and French) or
`-l eng -l fra`.
### Gentoo
On Gentoo the package `app-text/tessdata_fast`, which `app-text/tesseract` depends on, handles Tesseract languages.
It accepts USE flags to select what languages should be installed, these can be set in `/etc/portage/package.use`.
Alternatively one can globally set the [L10N use extension](https://wiki.gentoo.org/wiki/Localization/Guide#L10N) in `/etc/portage/make.conf`.
This enables these languages for all packages (e.g. including aspell).
```bash
# Display a list of all Tesseract language packs
equery uses app-text/tessdata_fast
# Add English and German language support for Tesseract only
echo 'app-text/tessdata_fast l10n_de l10n_en' >> /etc/portage/package.use
# Add global English and German language support (the `l10n_` from equery has to be omitted)
echo L10N="de en" >> /etc/portage/make.conf
# update system to reflect changed USE flags
emerge --update --deep --newuse @world
```
You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
to what languages it should search for. Multiple languages can be
requested using either `-l eng+fra` (English and French) or
`-l eng -l fra`.
### macOS
You can install additional language packs by
{ref}`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.
### Docker
Users of the OCRmyPDF Docker image should install language packs into a
derived Docker image as
{ref}`described in that section <docker-lang-packs>`.
### Windows
The Tesseract installer provided by Chocolatey currently includes only English language.
To install other languages, download the respective language pack (`.traineddata` file)
from <https://github.com/tesseract-ocr/tessdata/> and place it in
`C:\\Program Files\\Tesseract-OCR\\tessdata` (or wherever Tesseract OCR is installed).
## Custom language packs
If you have fine-tuned or trained Tesseract and generated custom trained data, you can
copy your `customlang.traineddata` file into your Tesseract "tessdata" folder, and
then use the `-l customlang` argument to tell OCRmyPDF to pass that language on to
Tesseract.

View File

@@ -1,141 +0,0 @@
.. SPDX-FileCopyrightText: 2022 James R. Barlow
..
.. SPDX-License-Identifier: CC-BY-SA-4.0
.. _lang-packs:
====================================
Installing additional language packs
====================================
OCRmyPDF uses Tesseract for OCR, and relies on its language packs for all languages.
On most platforms, English is installed with Tesseract by default, but not always.
Tesseract supports `most
languages <https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages>`__.
Languages are identified by standardized three-letter codes (called ISO 639-2 Alpha-3).
Tesseract's documentation also lists the three-letter code for your language.
Some are anglicized, e.g. Spanish is ``spa`` rather than ``esp``, while others
are not, e.g. German is ``deu`` and French is ``fra``.
Language packs (strictly speaking, Tesseract "traineddata" files) generally correspond
to the language in question, but different language packs are used in certain
situations. For German, the "Fraktur" language pack can assist with reading older
materials in the Fraktur typeface family (``deu_frak``). Some communities have changed
their script from Cyrillic to Latin; the Cyrillic version of Uzbek is available
as ``uzb_cyrl`` and the Latin version is ``uzb``.
After you have installed a language pack, you can use it with ``ocrmypdf -l <language>``,
for example ``ocrmypdf -l spa``. For multilingual documents, you can specify
all languages to be expected, e.g. ``ocrmypdf -l eng+fra`` for English and French.
English is assumed by default unless other language(s) are specified.
For Linux users, you can often find packages that provide language
packs.
Platform install steps
======================
Debian and Ubuntu (apt)
-----------------------
.. code-block:: bash
# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr
# Install Chinese Simplified language pack
apt-get install tesseract-ocr-chi-sim
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
to what languages it should search for. Multiple languages can be
requested using either ``-l eng+fra`` (English and French) or
``-l eng -l fra``.
Fedora
------
.. code-block:: bash
# Display a list of all Tesseract language packs
dnf search tesseract
# Install Chinese Simplified language pack
dnf install tesseract-langpack-chi_sim
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
to what languages it should search for. Multiple languages can be
requested using either ``-l eng+fra`` (English and French) or
``-l eng -l fra``.
Arch Linux
----------
.. code-block:: bash
# Display a list of all Tesseract language packs
pacman -Ss tesseract-data
# Install German language pack
pacman -S tesseract-data-deu
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
to what languages it should search for. Multiple languages can be
requested using either ``-l eng+fra`` (English and French) or
``-l eng -l fra``.
Gentoo
------
On Gentoo the package ``app-text/tessdata_fast``, which ``app-text/tesseract`` depends on, handles Tesseract languages.
It accepts USE flags to select what languages should be installed, these can be set in ``/etc/portage/package.use``.
Alternatively one can globally set the `L10N use extension <https://wiki.gentoo.org/wiki/Localization/Guide#L10N>`__ in ``/etc/portage/make.conf``.
This enables these languages for all packages (e.g. including aspell).
.. code-block:: bash
# Display a list of all Tesseract language packs
equery uses app-text/tessdata_fast
# Add English and German language support for Tesseract only
echo 'app-text/tessdata_fast l10n_de l10n_en' >> /etc/portage/package.use
# Add global English and German language support (the `l10n_` from equery has to be omitted)
echo L10N="de en" >> /etc/portage/make.conf
# update system to reflect changed USE flags
emerge --update --deep --newuse @world
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
to what languages it should search for. Multiple languages can be
requested using either ``-l eng+fra`` (English and French) or
``-l eng -l fra``.
macOS
-----
You can install additional language packs by
:ref:`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.
Docker
------
Users of the OCRmyPDF Docker image should install language packs into a
derived Docker image as
:ref:`described in that section <docker-lang-packs>`.
Windows
-------
The Tesseract installer provided by Chocolatey currently includes only English language.
To install other languages, download the respective language pack (``.traineddata`` file)
from https://github.com/tesseract-ocr/tessdata/ and place it in
``C:\\Program Files\\Tesseract-OCR\\tessdata`` (or wherever Tesseract OCR is installed).
Custom language packs
=====================
If you have fine-tuned or trained Tesseract and generated custom trained data, you can
copy your ``customlang.traineddata`` file into your Tesseract "tessdata" folder, and
then use the ``-l customlang`` argument to tell OCRmyPDF to pass that language on to
Tesseract.

24
docs/performance.md Normal file
View File

@@ -0,0 +1,24 @@
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
# Performance
Some users have noticed that current versions of OCRmyPDF do not run as
quickly as some older versions (specifically 6.x and older). This is
because OCRmyPDF added image optimization as a postprocessing step, and
it is enabled by default.
## Speed
If running OCRmyPDF quickly is your main goal, you can use settings such
as:
- `--optimize 0` to disable file size optimization
- `--output-type pdf` to disable PDF/A generation
- `--fast-web-view 999999` to disable fast web view optimization
- `--skip-big` to skip large images, if some pages have large images
You can also avoid:
- `--force-ocr`
- Image preprocessing

View File

@@ -1,26 +0,0 @@
.. SPDX-FileCopyrightText: 2022 James R. Barlow
..
.. SPDX-License-Identifier: CC-BY-SA-4.0
===========
Performance
===========
Some users have noticed that current versions of OCRmyPDF do not run as quickly
as some older versions (specifically 6.x and older). This is because OCRmyPDF
added image optimization as a postprocessing step, and it is enabled by default.
Speed
=====
If running OCRmyPDF quickly is your main goal, you can use settings such as:
* ``--optimize 0`` to disable file size optimization
* ``--output-type pdf`` to disable PDF/A generation
* ``--fast-web-view 999999`` to disable fast web view optimization
* ``--skip-big`` to skip large images, if some pages have large images
You can also avoid:
* ``--force-ocr``
* Image preprocessing

View File

@@ -1,15 +1,12 @@
.. SPDX-FileCopyrightText: 2022 James R. Barlow
..
.. SPDX-License-Identifier: CC-BY-SA-4.0
% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0
=======
Plugins
=======
# Plugins
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
RFC 2119.
> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
> NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
> "OPTIONAL" in this document are to be interpreted as described in
> RFC 2119.
You can use plugins to customize the behavior of OCRmyPDF at certain points of
interest.
@@ -24,75 +21,71 @@ Currently, it is possible to:
- replace Ghostscript with another PDF to image converter (rasterizer) or
PDF/A generator
OCRmyPDF plugins are based on the Python ``pluggy`` package and conform to its
OCRmyPDF plugins are based on the Python `pluggy` package and conform to its
conventions. Note that: plugins installed with as setuptools entrypoints are
not checked currently, because OCRmyPDF assumes you may not want to enable
plugins for all files.
See [OCRmyPDF-EasyOCR](https://github.com/ocrmypdf/OCRmyPDF-EasyOCR) for an
See \[OCRmyPDF-EasyOCR\](<https://github.com/ocrmypdf/OCRmyPDF-EasyOCR>) for an
example of a straightforward, fully working plugin.
Script plugins
==============
## Script plugins
Script plugins may be called from the command line, by specifying the name of a file.
Script plugins may be convenient for informal or "one-off" plugins, when a certain
batch of files needs a special processing step for example.
.. code-block:: bash
```bash
ocrmypdf --plugin ocrmypdf_example_plugin.py input.pdf output.pdf
```
ocrmypdf --plugin ocrmypdf_example_plugin.py input.pdf output.pdf
Multiple plugins may be installed by issuing the `--plugin` argument multiple times.
Multiple plugins may be installed by issuing the ``--plugin`` argument multiple times.
Packaged plugins
================
## Packaged plugins
Installed plugins may be installed into the same virtual environment as OCRmyPDF
is installed into. They may be invoked using Python standard module naming.
If you are intending to distribute a plugin, please package it.
.. code-block:: bash
ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf
```bash
ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf
```
OCRmyPDF does not automatically import plugins, because the assumption is that
plugins affect different files differently and you may not want them activated
all the time. The command line or ``ocrmypdf.ocr(plugin='...')`` must call
all the time. The command line or `ocrmypdf.ocr(plugin='...')` must call
for them.
Third parties that wish to distribute packages for ocrmypdf should package them
as packaged plugins, and these modules should begin with the name ``ocrmypdf_``
similar to ``pytest`` packages such as ``pytest-cov`` (the package) and
``pytest_cov`` (the module).
as packaged plugins, and these modules should begin with the name `ocrmypdf_`
similar to `pytest` packages such as `pytest-cov` (the package) and
`pytest_cov` (the module).
.. note::
:::{note}
We recommend plugin authors name their plugins with the prefix
`ocrmypdf-` (for the package name on PyPI) and `ocrmypdf_` (for the
module), just like pytest plugins. At the same time, please make it clear
that your package is not official.
:::
We recommend plugin authors name their plugins with the prefix
``ocrmypdf-`` (for the package name on PyPI) and ``ocrmypdf_`` (for the
module), just like pytest plugins. At the same time, please make it clear
that your package is not official.
Plugins
=======
## Plugins
You can also create a plugin that OCRmyPDF will always automatically load if both are
installed in the same virtual environment, using a project entrypoint.
OCRmyPDF uses the entrypoint namespace "ocrmypdf".
For example, ``pyproject.toml`` would need to contain the following, for a plugin named
``ocrmypdf-exampleplugin``:
For example, `pyproject.toml` would need to contain the following, for a plugin named
`ocrmypdf-exampleplugin`:
.. code-block:: toml
```toml
[project]
name = "ocrmypdf-exampleplugin"
[project]
name = "ocrmypdf-exampleplugin"
[project.entry-points."ocrmypdf"]
exampleplugin = "exampleplugin.pluginmodule"
```
[project.entry-points."ocrmypdf"]
exampleplugin = "exampleplugin.pluginmodule"
Plugin requirements
===================
## Plugin requirements
OCRmyPDF generally uses multiple worker processes. When a new worker is started,
Python will import all plugins again, including all plugins that were imported earlier.
@@ -103,14 +96,14 @@ to obtain a reference to shared state prepared by another hook implementation.
Plugins must expect that other instances of the plugin will be running
simultaneously.
The ``context`` object that is passed to many hooks can be used to share information
The `context` object that is passed to many hooks can be used to share information
about a file being worked on. Plugins must write private, plugin-specific data to
a subfolder named ``{options.work_folder}/ocrmypdf-plugin-name``. Plugins MAY
read and write files in ``options.work_folder``, but should be aware that their
a subfolder named `{options.work_folder}/ocrmypdf-plugin-name`. Plugins MAY
read and write files in `options.work_folder`, but should be aware that their
semantics are subject to change.
OCRmyPDF will delete ``options.work_folder`` when it has finished OCRing
a file, unless invoked with ``--keep-temporary-files``.
OCRmyPDF will delete `options.work_folder` when it has finished OCRing
a file, unless invoked with `--keep-temporary-files`.
The documentation for some plugin hooks contain a detailed description of the
execution context in which they will be called.
@@ -119,114 +112,139 @@ Plugins should be prepared to work whether executed in worker threads or worker
processes. Generally, OCRmyPDF uses processes, but has a semi-hidden threaded
argument that simplifies debugging.
Plugin hooks
============
## Plugin hooks
A plugin may provide the following hooks. Hooks must be decorated with
``ocrmypdf.hookimpl``, for example:
`ocrmypdf.hookimpl`, for example:
.. code-block:: python
```python
from ocrmpydf import hookimpl
from ocrmpydf import hookimpl
@hookimpl
def add_options(parser):
pass
@hookimpl
def add_options(parser):
pass
```
The following is a complete list of hooks that are available, and when
they are called.
.. _firstresult:
(firstresult)=
**Note on firstresult hooks**
If multiple plugins install implementations for this hook, they will be called in
the reverse of the order in which they are installed (i.e., last plugin wins).
When each hook implementation is called in order, the first implementation that
returns a value other than ``None`` will "win" and prevent execution of all other
returns a value other than `None` will "win" and prevent execution of all other
hooks. As such, you cannot "chain" a series of plugin filters together in this
way. Instead, a single hook implementation should be responsible for any such
chaining operations.
Examples
========
## Examples
* OCRmyPDF's test suite contains several plugins that are used to simulate certain
- OCRmyPDF's test suite contains several plugins that are used to simulate certain
test conditions.
* `ocrmypdf-papermerge <https://github.com/papermerge/OCRmyPDF_papermerge>`_ is
- [ocrmypdf-papermerge](https://github.com/papermerge/OCRmyPDF_papermerge) is
a production plugin that integrates OCRmyPDF and the Papermerge document
management system.
### Suppressing or overriding other plugins
Suppressing or overriding other plugins
---------------------------------------
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.initialize
```
Custom command line arguments
-----------------------------
### Custom command line arguments
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.add_options
```
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.check_options
```
Execution and progress reporting
--------------------------------
### Execution and progress reporting
```{eval-rst}
.. autoclass:: ocrmypdf.pluginspec.ProgressBar
:members:
:special-members: __init__, __enter__, __exit__
```
```{eval-rst}
.. autoclass:: ocrmypdf.pluginspec.Executor
:members:
:special-members: __call__
```
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.get_logging_console
```
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.get_executor
```
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.get_progressbar_class
```
Applying special behavior before processing
-------------------------------------------
### Applying special behavior before processing
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.validate
```
PDF page to image
-----------------
### PDF page to image
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.rasterize_pdf_page
```
Modifying intermediate images
-----------------------------
### Modifying intermediate images
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.filter_ocr_image
```
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.filter_page_image
```
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.filter_pdf_page
```
OCR engine
----------
### OCR engine
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.get_ocr_engine
```
```{eval-rst}
.. autoclass:: ocrmypdf.pluginspec.OcrEngine
:members:
.. automethod:: __str__
```
```{eval-rst}
.. autoclass:: ocrmypdf.pluginspec.OrientationConfidence
```
PDF/A production
----------------
### PDF/A production
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.generate_pdfa
```
PDF optimization
----------------
### PDF optimization
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.optimize_pdf
```
.. autofunction:: ocrmypdf.pluginspec.is_optimization_enabled
```{eval-rst}
.. autofunction:: ocrmypdf.pluginspec.is_optimization_enabled
```

2840
docs/release_notes.md Normal file
View File

File diff suppressed because it is too large Load Diff

View File

File diff suppressed because it is too large Load Diff