mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2025-12-23 22:28:05 -05:00
Convert remaining rst -> md
This commit is contained in:
461
docs/advanced.md
Normal file
461
docs/advanced.md
Normal file
@@ -0,0 +1,461 @@
|
||||
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
# Advanced features
|
||||
|
||||
## Control of unpaper
|
||||
|
||||
OCRmyPDF uses `unpaper` to provide the implementation of the
|
||||
`--clean` and `--clean-final` arguments.
|
||||
[unpaper](https://github.com/Flameeyes/unpaper/blob/main/doc/basic-concepts.md)
|
||||
provides a variety of image processing filters to improve images.
|
||||
|
||||
By default, OCRmyPDF uses only `unpaper` arguments that were found to
|
||||
be safe to use on almost all files without having to inspect every page
|
||||
of the file afterwards. This is particularly true when only `--clean`
|
||||
is used, since that instructs OCRmyPDF to only clean the image before
|
||||
OCR and not the final image.
|
||||
|
||||
However, if you wish to use the more aggressive options in `unpaper`,
|
||||
you may use `--unpaper-args '...'` to override the OCRmyPDF's defaults
|
||||
and forward other arguments to unpaper. This option will forward
|
||||
arguments to `unpaper` without any knowledge of what that program
|
||||
considers to be valid arguments. The string of arguments must be quoted
|
||||
as shown in the examples below. No filename arguments may be included.
|
||||
OCRmyPDF will assume it can append input and output filename of
|
||||
intermediate images to the `--unpaper-args` string.
|
||||
|
||||
In this example, we tell `unpaper` to expect two pages of text on a
|
||||
sheet (image), such as occurs when two facing pages of a book are
|
||||
scanned. `unpaper` uses this information to deskew each independently
|
||||
and clean up the margins of both.
|
||||
|
||||
```bash
|
||||
ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf output.pdf
|
||||
ocrmypdf --clean --clean-final --unpaper-args '--layout double --no-noisefilter' input.pdf output.pdf
|
||||
```
|
||||
|
||||
:::{warning}
|
||||
Some `unpaper` features will reposition text within the image.
|
||||
`--clean-final` is recommended to avoid this issue.
|
||||
:::
|
||||
|
||||
:::{warning}
|
||||
Some `unpaper` features cause multiple input or output files to be
|
||||
consumed or produced. OCRmyPDF requires `unpaper` to consume one
|
||||
file and produce one file; errors will result if this assumption is not
|
||||
met.
|
||||
:::
|
||||
|
||||
:::{note}
|
||||
`unpaper` uses uncompressed PBM/PGM/PPM files for its intermediate
|
||||
files. For large images or documents, it can take a lot of temporary
|
||||
disk space.
|
||||
:::
|
||||
|
||||
## Control of OCR options
|
||||
|
||||
OCRmyPDF provides many features to control the behavior of the OCR
|
||||
engine, Tesseract.
|
||||
|
||||
### When OCR is skipped
|
||||
|
||||
If a page in a PDF seems to have text, by default OCRmyPDF will exit
|
||||
without modifying the PDF. This is to ensure that PDFs that were
|
||||
previously OCRed or were "born digital" rather than scanned are not
|
||||
processed.
|
||||
|
||||
If `--skip-text` is issued, then no image processing or OCR will be
|
||||
performed on pages that already have text. The page will be copied to
|
||||
the output. This may be useful for documents that contain both "born
|
||||
digital" and scanned content, or to use OCRmyPDF to normalize and
|
||||
convert to PDF/A regardless of their contents.
|
||||
|
||||
If `--redo-ocr` is issued, then a detailed text analysis is performed.
|
||||
Text is categorized as either visible or invisible. Invisible text (OCR)
|
||||
is stripped out. Then an image of each page is created with visible text
|
||||
masked out. The page image is sent for OCR, and any additional text is
|
||||
inserted as OCR. If a file contains a mix of text and bitmap images that
|
||||
contain text, OCRmyPDF will locate the additional text in images without
|
||||
disrupting the existing text. Some PDF OCR solutions render text as
|
||||
technically printable or visible in some way, perhaps by drawing it and
|
||||
then painting over it. OCRmyPDF cannot distinguish this type of OCR
|
||||
text from real text, so it will not be "redone".
|
||||
|
||||
If `--force-ocr` is issued, then all pages will be rasterized to
|
||||
images, discarding any hidden OCR text, rasterizing any printable
|
||||
text, and flattening form fields or interactive objects into their visual
|
||||
representation. This is useful for redoing OCR, for fixing OCR text
|
||||
with a damaged character map (text is selectable but not searchable),
|
||||
and destroying redacted information.
|
||||
|
||||
### Time and image size limits
|
||||
|
||||
By default, OCRmyPDF permits tesseract to run for three minutes (180
|
||||
seconds) per page. This is usually more than enough time to find all
|
||||
text on a reasonably sized page with modern hardware.
|
||||
|
||||
If a page is skipped, it will be inserted without OCR. If preprocessing
|
||||
was requested, the preprocessed image layer will be inserted.
|
||||
|
||||
If you want to adjust the amount of time spent on OCR, change
|
||||
`--tesseract-timeout`. You can also automatically skip images that
|
||||
exceed a certain number of megapixels with `--skip-big`. (A 300 DPI,
|
||||
8.5×11" page image is 8.4 megapixels.)
|
||||
|
||||
```bash
|
||||
# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
|
||||
ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
|
||||
```
|
||||
|
||||
### OCR for huge images
|
||||
|
||||
Tesseract has internal limits on the size
|
||||
of images it will process. By default,
|
||||
`--tesseract-downsample-large-images` is enabled, and OCRmyPDF will
|
||||
downsample images to fit Tesseract limits. (The limits are usually encountered
|
||||
only for scanned images of oversized media, such as large maps or blueprints exceeding
|
||||
110 cm or 43 inches in either dimension, and at high DPI.) This feature can disabled
|
||||
using `--no-tesseract-downsample-large-images`.
|
||||
|
||||
`--tesseract-downsample-above Npixels` adjusts the threshold at which images
|
||||
will be downsampled. By default, only images that exceed any of Tesseract's
|
||||
internal limits are downsampled (32767 pixels on either dimension).
|
||||
|
||||
You will also need to set `--tesseract-timeout` high enough to allow
|
||||
for processing.
|
||||
|
||||
Only the image sent for OCR is downsampled. The original image is
|
||||
preserved.
|
||||
|
||||
```bash
|
||||
# Allow 600 seconds for OCR on huge images
|
||||
ocrmypdf --tesseract-timeout 600 \
|
||||
--tesseract-downsample-large-images \
|
||||
bigfile.pdf output.pdf
|
||||
|
||||
# Downsample images above 5000 pixels on the longest dimension to
|
||||
# 5000 pixels
|
||||
ocrmypdf --tesseract-timeout 120 \
|
||||
--tesseract-downsample-large-images \
|
||||
--tesseract-downsample-above 5000 \
|
||||
bigfile.pdf output_downsampled_ocr.pdf
|
||||
```
|
||||
|
||||
### Overriding default tesseract
|
||||
|
||||
OCRmyPDF checks the system `PATH` for the `tesseract` binary.
|
||||
|
||||
Some relevant environment variables that influence Tesseract's behavior
|
||||
include:
|
||||
|
||||
```{eval-rst}
|
||||
.. envvar:: TESSDATA_PREFIX
|
||||
|
||||
Overrides the path to Tesseract's data files. This can allow
|
||||
simultaneous installation of the "best" and "fast" training data
|
||||
sets. OCRmyPDF does not manage this environment variable.
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. envvar:: OMP_THREAD_LIMIT
|
||||
|
||||
Controls the number of threads Tesseract will use. OCRmyPDF will
|
||||
manage this environment variable if it is not already set.
|
||||
```
|
||||
|
||||
For example, if you have a development build of Tesseract don't wish to
|
||||
use the system installation, you can launch OCRmyPDF as follows:
|
||||
|
||||
```bash
|
||||
env \
|
||||
PATH=/home/user/src/tesseract/api:$PATH \
|
||||
TESSDATA_PREFIX=/home/user/src/tesseract \
|
||||
ocrmypdf input.pdf output.pdf
|
||||
```
|
||||
|
||||
In this example `TESSDATA_PREFIX` is required to redirect Tesseract to
|
||||
an alternate folder for its "tessdata" files.
|
||||
|
||||
### Overriding other support programs
|
||||
|
||||
In addition to tesseract, OCRmyPDF uses the following external binaries:
|
||||
|
||||
- `gs` (Ghostscript)
|
||||
- `unpaper`
|
||||
- `pngquant`
|
||||
- `jbig2`
|
||||
|
||||
In each case OCRmyPDF will search the `PATH` environment variable to
|
||||
locate the binaries. By modifying the `PATH` environment variable, you
|
||||
can override the binaries that OCRmyPDF uses.
|
||||
|
||||
### Changing Tesseract configuration variables
|
||||
|
||||
You can override Tesseract's default [control
|
||||
parameters](https://tesseract-ocr.github.io/tessdoc/tess3/ControlParams.html)
|
||||
with a configuration file.
|
||||
|
||||
As an example, this configuration will disable Tesseract's dictionary
|
||||
for current language. Normally the dictionary is helpful for
|
||||
interpolating words that are unclear, but it may interfere with OCR if
|
||||
the document does not contain many words (for example, a list of part
|
||||
numbers).
|
||||
|
||||
Create a file named "no-dict.cfg" with these contents:
|
||||
|
||||
```
|
||||
load_system_dawg 0
|
||||
language_model_penalty_non_dict_word 0
|
||||
language_model_penalty_non_freq_dict_word 0
|
||||
```
|
||||
|
||||
then run ocrmypdf as follows (along with any other desired arguments):
|
||||
|
||||
```bash
|
||||
ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf
|
||||
```
|
||||
|
||||
:::{warning}
|
||||
Some combinations of control parameters will break Tesseract or break
|
||||
assumptions that OCRmyPDF makes about Tesseract's output.
|
||||
:::
|
||||
|
||||
### Changing page segmentation mode
|
||||
|
||||
The directive `--tesseract-pagesegmode Nmode` forwards the desired page segmentation
|
||||
mode to Tesseract OCR. The default is 3.
|
||||
|
||||
Page segmentation can improve OCR results when you know that a PDF ought to be
|
||||
analyzed a particular way, such as PDFs whose pages contain only a single line of
|
||||
text. For the vast majority of users, changing the page segmentation mode will only
|
||||
make things worse.
|
||||
|
||||
As of June 2024, the Tesseract page segmentation modes are:
|
||||
|
||||
| ID | Description |
|
||||
| --- | --------------------------------------------------------------------------------------------- |
|
||||
| 0 | Orientation and script detection (OSD) only. |
|
||||
| 1 | Automatic page segmentation with OSD. |
|
||||
| 2 | Automatic page segmentation, but no OSD, or OCR. (not implemented) |
|
||||
| 3 | Fully automatic page segmentation, but no OSD. (Default) |
|
||||
| 4 | Assume a single column of text of variable sizes. |
|
||||
| 5 | Assume a single uniform block of vertically aligned text. |
|
||||
| 6 | Assume a single uniform block of text. |
|
||||
| 7 | Treat the image as a single text line. |
|
||||
| 8 | Treat the image as a single word. |
|
||||
| 9 | Treat the image as a single word in a circle. |
|
||||
| 10 | Treat the image as a single character. |
|
||||
| 11 | Sparse text. Find as much text as possible in no particular order. |
|
||||
| 12 | Sparse text with OSD. |
|
||||
| 13 | Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. |
|
||||
|
||||
Modes 0, 1, 2, and 12 (all of those that enable orientation and script detection)
|
||||
are not compatible with OCRmyPDF, which performs OSD in a separate step from OCR.
|
||||
Their use may interfere with `--rotate-pages` and other features.
|
||||
|
||||
It is currently not possible to use advanced Tesseract OCR features, such as creating
|
||||
OCR information, when using Tesseract through OCRmyPDF.
|
||||
|
||||
## Changing the PDF renderer
|
||||
|
||||
rasterizing
|
||||
|
||||
: Converting a PDF to an image for display.
|
||||
|
||||
rendering
|
||||
|
||||
: Creating a new PDF from other data (such as an existing PDF).
|
||||
|
||||
OCRmyPDF has these PDF renderers: `sandwich` and `hocr`. The
|
||||
renderer may be selected using `--pdf-renderer`. The default is
|
||||
`auto` which lets OCRmyPDF select the renderer to use. Currently,
|
||||
`auto` always selects `hocr`.
|
||||
|
||||
### The `hocr` renderer
|
||||
|
||||
:::{versionchanged} 16.0.0
|
||||
:::
|
||||
|
||||
In both renderers, a text-only layer is rendered and sandwiched (overlaid)
|
||||
on to either the original PDF page, or newly rasterized version of the
|
||||
original PDF page (when `--force-ocr` is used). In this way, loss
|
||||
of PDF information is generally avoided. (You may need to disable PDF/A
|
||||
conversion and optimization to eliminate all lossy transformations.)
|
||||
|
||||
The current approach used by the new hOCR renderer is a re-implementation
|
||||
of Tesseract's PDF renderer, using the same Glyphless font and general
|
||||
ideas, but fixing many technical issues that impeded it. The new hocr
|
||||
provides better text placement accuracy, avoids issues with word
|
||||
segmentation, and provides better positioning of skewed text.
|
||||
|
||||
Using the experimental API, it is also possible to edit the OCR output
|
||||
from Tesseract, using any tool that is capable of editing hOCR files.
|
||||
|
||||
Older versions of this renderer did not support non-Latin languages, but
|
||||
it is now universal.
|
||||
|
||||
### The `sandwich` renderer
|
||||
|
||||
The `sandwich` renderer uses Tesseract's text-only PDF feature,
|
||||
which produces a PDF page that lays out the OCR in invisible text.
|
||||
|
||||
Currently some problematic PDF viewers like Mozilla PDF.js and macOS
|
||||
Preview have problems with segmenting its text output, and
|
||||
mightrunseveralwordstogether. It also does not implement right to left
|
||||
fonts (Arabic, Hebrew, Persian). The output of this renderer cannot
|
||||
be edited. The sandwich renderer is retained for testing.
|
||||
|
||||
When image preprocessing features like `--deskew` are used, the
|
||||
original PDF will be rendered as a full page and the OCR layer will be
|
||||
placed on top.
|
||||
|
||||
## Rendering and rasterizing options
|
||||
|
||||
:::{versionadded} 14.3.0
|
||||
:::
|
||||
|
||||
The `--continue-on-soft-render-error` option allows OCRmyPDF to
|
||||
proceed if a page cannot be rasterized/rendered. This is useful if you are
|
||||
trying to get the best possible OCR from a PDF that is not well-formed,
|
||||
and you are willing to accept some pages that may not visually match the
|
||||
input, and that may not OCR well.
|
||||
|
||||
## Color conversion strategy
|
||||
|
||||
:::{versionadded} 15.0.0
|
||||
:::
|
||||
|
||||
OCRmyPDF uses Ghostscript to convert PDF to PDF/A. In some cases, this
|
||||
conversion requires color conversion. The default strategy is to convert
|
||||
using the `LeaveColorUnchanged` strategy, which preserves the original
|
||||
color space wherever possible (some rare color spaces might still be
|
||||
converted).
|
||||
|
||||
Usually document scanners produce PDFs in the sRGB color space, and do
|
||||
not need to be converted, so the default strategy is appropriate.
|
||||
|
||||
Suppose that you have a document that was prepared for professional
|
||||
printing in a Separation or CMYK color space, and text was converted to
|
||||
curves. In this case, you may want to use a different color conversion
|
||||
strategy. The `--color-conversion-strategy` option allows you to select a
|
||||
different strategy, such as `RGB`.
|
||||
|
||||
## Return code policy
|
||||
|
||||
OCRmyPDF writes all messages to `stderr`. `stdout` is reserved for
|
||||
piping output files. `stdin` is reserved for piping input files.
|
||||
|
||||
The return codes generated by the OCRmyPDF are considered part of the
|
||||
stable user interface. They may be imported from
|
||||
`ocrmypdf.exceptions`.
|
||||
|
||||
```{eval-rst}
|
||||
.. list-table:: Return codes
|
||||
:widths: 5 35 60
|
||||
:header-rows: 1
|
||||
|
||||
* - Code
|
||||
- Name
|
||||
- Interpretation
|
||||
* - 0
|
||||
- ``ExitCode.ok``
|
||||
- Everything worked as expected.
|
||||
* - 1
|
||||
- ``ExitCode.bad_args``
|
||||
- Invalid arguments, exited with an error.
|
||||
* - 2
|
||||
- ``ExitCode.input_file``
|
||||
- The input file does not seem to be a valid PDF.
|
||||
* - 3
|
||||
- ``ExitCode.missing_dependency``
|
||||
- An external program required by OCRmyPDF is missing.
|
||||
* - 4
|
||||
- ``ExitCode.invalid_output_pdf``
|
||||
- An output file was created, but it does not seem to be a valid PDF. The file will be available.
|
||||
* - 5
|
||||
- ``ExitCode.file_access_error``
|
||||
- The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file.
|
||||
* - 6
|
||||
- ``ExitCode.already_done_ocr``
|
||||
- The file already appears to contain text so it may not need OCR. See output message.
|
||||
* - 7
|
||||
- ``ExitCode.child_process_error``
|
||||
- An error occurred in an external program (child process) and OCRmyPDF cannot continue.
|
||||
* - 8
|
||||
- ``ExitCode.encrypted_pdf``
|
||||
- The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption.
|
||||
* - 9
|
||||
- ``ExitCode.invalid_config``
|
||||
- A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file.
|
||||
* - 10
|
||||
- ``ExitCode.pdfa_conversion_failed``
|
||||
- A valid PDF was created, PDF/A conversion failed. The file will be available.
|
||||
* - 15
|
||||
- ``ExitCode.other_error``
|
||||
- Some other error occurred.
|
||||
* - 130
|
||||
- ``ExitCode.ctrl_c``
|
||||
- The program was interrupted by pressing Ctrl+C.
|
||||
|
||||
```
|
||||
|
||||
(tmpdir)=
|
||||
|
||||
## Changing temporary storage location
|
||||
|
||||
OCRmyPDF generates many temporary files during processing.
|
||||
|
||||
To change where temporary files are stored, change the `TMPDIR`
|
||||
environment variable for ocrmypdf's environment. (Python's
|
||||
`tempfile.gettempdir()` returns the root directory in which temporary
|
||||
files will be stored.) For example, one could redirect `TMPDIR` to a
|
||||
large RAM disk to avoid wear on HDD/SSD and potentially improve
|
||||
performance.
|
||||
|
||||
On Windows, the `TEMP` environment variable is used instead.
|
||||
|
||||
## Debugging the intermediate files
|
||||
|
||||
OCRmyPDF normally saves its intermediate results to a temporary folder
|
||||
and deletes this folder when it exits, whether it succeeded or failed.
|
||||
|
||||
If the `--keep-temporary-files` (`-k`) argument is issued on the
|
||||
command line, OCRmyPDF will keep the temporary folder and print the location,
|
||||
whether it succeeded or failed. An example message is:
|
||||
|
||||
```none
|
||||
Temporary working files retained at:
|
||||
/tmp/ocrmypdf.io.u20wpz07
|
||||
```
|
||||
|
||||
When OCRmyPDF is launched as a snap, this corresponds to the snap filesystem, for instance:
|
||||
|
||||
> /tmp/snap-private-tmp/snap.ocrmypdf/tmp/ocrmypdf.io.u20wpz07
|
||||
|
||||
The organization of this folder is an implementation detail and subject
|
||||
to change between releases. However the general organization is that
|
||||
working files on a per page basis have the page number as a prefix
|
||||
(starting with page 1), an infix indicates the processing stage, and a
|
||||
suffix indicates the file type. Some important files include:
|
||||
|
||||
- `_rasterize.png` - what the input page looks like
|
||||
- `_ocr.png` - the file that is sent to Tesseract for OCR; depending
|
||||
on arguments this may differ from the presentation image
|
||||
- `_pp_deskew.png` - the image, after deskewing
|
||||
- `_pp_clean.png` - the image, after cleaning with unpaper
|
||||
- `_ocr_hocr.pdf` - the OCR file; appears as a blank page with invisible
|
||||
text embedded
|
||||
- `_ocr_hocr.txt` - the OCR text (not necessarily all text on the page,
|
||||
if the page is mixed format)
|
||||
- `fix_docinfo.pdf` - a temporary file created to fix the PDF DocumentInfo
|
||||
data structure
|
||||
- `graft_layers.pdf` - the rendered PDF with OCR layers grafted on
|
||||
- `pdfa.pdf` - `graft_layers.pdf` after conversion to PDF/A
|
||||
- `pdfa.ps` - a PostScript file used by Ghostscript for PDF/A conversion
|
||||
- `optimize.pdf` - the PDF generated before optimization
|
||||
- `optimize.out.pdf` - the PDF generated by optimization
|
||||
- `origin` - the input file
|
||||
- `origin.pdf` - the input file or the input image converted to PDF
|
||||
- `images/*` - images extracted during the optimization process; here
|
||||
the prefix indicates a PDF object ID not a page number
|
||||
@@ -1,486 +0,0 @@
|
||||
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
=================
|
||||
Advanced features
|
||||
=================
|
||||
|
||||
Control of unpaper
|
||||
==================
|
||||
|
||||
OCRmyPDF uses ``unpaper`` to provide the implementation of the
|
||||
``--clean`` and ``--clean-final`` arguments.
|
||||
`unpaper <https://github.com/Flameeyes/unpaper/blob/main/doc/basic-concepts.md>`__
|
||||
provides a variety of image processing filters to improve images.
|
||||
|
||||
By default, OCRmyPDF uses only ``unpaper`` arguments that were found to
|
||||
be safe to use on almost all files without having to inspect every page
|
||||
of the file afterwards. This is particularly true when only ``--clean``
|
||||
is used, since that instructs OCRmyPDF to only clean the image before
|
||||
OCR and not the final image.
|
||||
|
||||
However, if you wish to use the more aggressive options in ``unpaper``,
|
||||
you may use ``--unpaper-args '...'`` to override the OCRmyPDF's defaults
|
||||
and forward other arguments to unpaper. This option will forward
|
||||
arguments to ``unpaper`` without any knowledge of what that program
|
||||
considers to be valid arguments. The string of arguments must be quoted
|
||||
as shown in the examples below. No filename arguments may be included.
|
||||
OCRmyPDF will assume it can append input and output filename of
|
||||
intermediate images to the ``--unpaper-args`` string.
|
||||
|
||||
In this example, we tell ``unpaper`` to expect two pages of text on a
|
||||
sheet (image), such as occurs when two facing pages of a book are
|
||||
scanned. ``unpaper`` uses this information to deskew each independently
|
||||
and clean up the margins of both.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf output.pdf
|
||||
ocrmypdf --clean --clean-final --unpaper-args '--layout double --no-noisefilter' input.pdf output.pdf
|
||||
|
||||
.. warning::
|
||||
|
||||
Some ``unpaper`` features will reposition text within the image.
|
||||
``--clean-final`` is recommended to avoid this issue.
|
||||
|
||||
.. warning::
|
||||
|
||||
Some ``unpaper`` features cause multiple input or output files to be
|
||||
consumed or produced. OCRmyPDF requires ``unpaper`` to consume one
|
||||
file and produce one file; errors will result if this assumption is not
|
||||
met.
|
||||
|
||||
.. note::
|
||||
|
||||
``unpaper`` uses uncompressed PBM/PGM/PPM files for its intermediate
|
||||
files. For large images or documents, it can take a lot of temporary
|
||||
disk space.
|
||||
|
||||
Control of OCR options
|
||||
======================
|
||||
|
||||
OCRmyPDF provides many features to control the behavior of the OCR
|
||||
engine, Tesseract.
|
||||
|
||||
When OCR is skipped
|
||||
-------------------
|
||||
|
||||
If a page in a PDF seems to have text, by default OCRmyPDF will exit
|
||||
without modifying the PDF. This is to ensure that PDFs that were
|
||||
previously OCRed or were "born digital" rather than scanned are not
|
||||
processed.
|
||||
|
||||
If ``--skip-text`` is issued, then no image processing or OCR will be
|
||||
performed on pages that already have text. The page will be copied to
|
||||
the output. This may be useful for documents that contain both "born
|
||||
digital" and scanned content, or to use OCRmyPDF to normalize and
|
||||
convert to PDF/A regardless of their contents.
|
||||
|
||||
If ``--redo-ocr`` is issued, then a detailed text analysis is performed.
|
||||
Text is categorized as either visible or invisible. Invisible text (OCR)
|
||||
is stripped out. Then an image of each page is created with visible text
|
||||
masked out. The page image is sent for OCR, and any additional text is
|
||||
inserted as OCR. If a file contains a mix of text and bitmap images that
|
||||
contain text, OCRmyPDF will locate the additional text in images without
|
||||
disrupting the existing text. Some PDF OCR solutions render text as
|
||||
technically printable or visible in some way, perhaps by drawing it and
|
||||
then painting over it. OCRmyPDF cannot distinguish this type of OCR
|
||||
text from real text, so it will not be "redone".
|
||||
|
||||
If ``--force-ocr`` is issued, then all pages will be rasterized to
|
||||
images, discarding any hidden OCR text, rasterizing any printable
|
||||
text, and flattening form fields or interactive objects into their visual
|
||||
representation. This is useful for redoing OCR, for fixing OCR text
|
||||
with a damaged character map (text is selectable but not searchable),
|
||||
and destroying redacted information.
|
||||
|
||||
Time and image size limits
|
||||
--------------------------
|
||||
|
||||
By default, OCRmyPDF permits tesseract to run for three minutes (180
|
||||
seconds) per page. This is usually more than enough time to find all
|
||||
text on a reasonably sized page with modern hardware.
|
||||
|
||||
If a page is skipped, it will be inserted without OCR. If preprocessing
|
||||
was requested, the preprocessed image layer will be inserted.
|
||||
|
||||
If you want to adjust the amount of time spent on OCR, change
|
||||
``--tesseract-timeout``. You can also automatically skip images that
|
||||
exceed a certain number of megapixels with ``--skip-big``. (A 300 DPI,
|
||||
8.5×11" page image is 8.4 megapixels.)
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
|
||||
ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
|
||||
|
||||
OCR for huge images
|
||||
-------------------
|
||||
|
||||
Tesseract has internal limits on the size
|
||||
of images it will process. By default,
|
||||
``--tesseract-downsample-large-images`` is enabled, and OCRmyPDF will
|
||||
downsample images to fit Tesseract limits. (The limits are usually encountered
|
||||
only for scanned images of oversized media, such as large maps or blueprints exceeding
|
||||
110 cm or 43 inches in either dimension, and at high DPI.) This feature can disabled
|
||||
using ``--no-tesseract-downsample-large-images``.
|
||||
|
||||
``--tesseract-downsample-above Npixels`` adjusts the threshold at which images
|
||||
will be downsampled. By default, only images that exceed any of Tesseract's
|
||||
internal limits are downsampled (32767 pixels on either dimension).
|
||||
|
||||
You will also need to set ``--tesseract-timeout`` high enough to allow
|
||||
for processing.
|
||||
|
||||
Only the image sent for OCR is downsampled. The original image is
|
||||
preserved.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Allow 600 seconds for OCR on huge images
|
||||
ocrmypdf --tesseract-timeout 600 \
|
||||
--tesseract-downsample-large-images \
|
||||
bigfile.pdf output.pdf
|
||||
|
||||
# Downsample images above 5000 pixels on the longest dimension to
|
||||
# 5000 pixels
|
||||
ocrmypdf --tesseract-timeout 120 \
|
||||
--tesseract-downsample-large-images \
|
||||
--tesseract-downsample-above 5000 \
|
||||
bigfile.pdf output_downsampled_ocr.pdf
|
||||
|
||||
|
||||
Overriding default tesseract
|
||||
----------------------------
|
||||
|
||||
OCRmyPDF checks the system ``PATH`` for the ``tesseract`` binary.
|
||||
|
||||
Some relevant environment variables that influence Tesseract's behavior
|
||||
include:
|
||||
|
||||
.. envvar:: TESSDATA_PREFIX
|
||||
|
||||
Overrides the path to Tesseract's data files. This can allow
|
||||
simultaneous installation of the "best" and "fast" training data
|
||||
sets. OCRmyPDF does not manage this environment variable.
|
||||
|
||||
.. envvar:: OMP_THREAD_LIMIT
|
||||
|
||||
Controls the number of threads Tesseract will use. OCRmyPDF will
|
||||
manage this environment variable if it is not already set.
|
||||
|
||||
For example, if you have a development build of Tesseract don't wish to
|
||||
use the system installation, you can launch OCRmyPDF as follows:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
env \
|
||||
PATH=/home/user/src/tesseract/api:$PATH \
|
||||
TESSDATA_PREFIX=/home/user/src/tesseract \
|
||||
ocrmypdf input.pdf output.pdf
|
||||
|
||||
In this example ``TESSDATA_PREFIX`` is required to redirect Tesseract to
|
||||
an alternate folder for its "tessdata" files.
|
||||
|
||||
Overriding other support programs
|
||||
---------------------------------
|
||||
|
||||
In addition to tesseract, OCRmyPDF uses the following external binaries:
|
||||
|
||||
- ``gs`` (Ghostscript)
|
||||
- ``unpaper``
|
||||
- ``pngquant``
|
||||
- ``jbig2``
|
||||
|
||||
In each case OCRmyPDF will search the ``PATH`` environment variable to
|
||||
locate the binaries. By modifying the ``PATH`` environment variable, you
|
||||
can override the binaries that OCRmyPDF uses.
|
||||
|
||||
Changing Tesseract configuration variables
|
||||
------------------------------------------
|
||||
|
||||
You can override Tesseract's default `control
|
||||
parameters <https://tesseract-ocr.github.io/tessdoc/tess3/ControlParams.html>`__
|
||||
with a configuration file.
|
||||
|
||||
As an example, this configuration will disable Tesseract's dictionary
|
||||
for current language. Normally the dictionary is helpful for
|
||||
interpolating words that are unclear, but it may interfere with OCR if
|
||||
the document does not contain many words (for example, a list of part
|
||||
numbers).
|
||||
|
||||
Create a file named "no-dict.cfg" with these contents:
|
||||
|
||||
::
|
||||
|
||||
load_system_dawg 0
|
||||
language_model_penalty_non_dict_word 0
|
||||
language_model_penalty_non_freq_dict_word 0
|
||||
|
||||
then run ocrmypdf as follows (along with any other desired arguments):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf
|
||||
|
||||
.. warning::
|
||||
|
||||
Some combinations of control parameters will break Tesseract or break
|
||||
assumptions that OCRmyPDF makes about Tesseract's output.
|
||||
|
||||
Changing page segmentation mode
|
||||
-------------------------------
|
||||
|
||||
The directive ``--tesseract-pagesegmode Nmode`` forwards the desired page segmentation
|
||||
mode to Tesseract OCR. The default is 3.
|
||||
|
||||
Page segmentation can improve OCR results when you know that a PDF ought to be
|
||||
analyzed a particular way, such as PDFs whose pages contain only a single line of
|
||||
text. For the vast majority of users, changing the page segmentation mode will only
|
||||
make things worse.
|
||||
|
||||
As of June 2024, the Tesseract page segmentation modes are:
|
||||
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| ID | Description |
|
||||
+=====+==================================================================================+
|
||||
| 0 | Orientation and script detection (OSD) only. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 1 | Automatic page segmentation with OSD. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 2 | Automatic page segmentation, but no OSD, or OCR. (not implemented) |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 3 | Fully automatic page segmentation, but no OSD. (Default) |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 4 | Assume a single column of text of variable sizes. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 5 | Assume a single uniform block of vertically aligned text. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 6 | Assume a single uniform block of text. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 7 | Treat the image as a single text line. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 8 | Treat the image as a single word. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 9 | Treat the image as a single word in a circle. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 10 | Treat the image as a single character. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 11 | Sparse text. Find as much text as possible in no particular order. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 12 | Sparse text with OSD. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
| 13 | Raw line. Treat the image as a single text line, bypassing hacks that are |
|
||||
| | Tesseract-specific. |
|
||||
+-----+----------------------------------------------------------------------------------+
|
||||
|
||||
Modes 0, 1, 2, and 12 (all of those that enable orientation and script detection)
|
||||
are not compatible with OCRmyPDF, which performs OSD in a separate step from OCR.
|
||||
Their use may interfere with ``--rotate-pages`` and other features.
|
||||
|
||||
It is currently not possible to use advanced Tesseract OCR features, such as creating
|
||||
OCR information, when using Tesseract through OCRmyPDF.
|
||||
|
||||
Changing the PDF renderer
|
||||
=========================
|
||||
|
||||
rasterizing
|
||||
Converting a PDF to an image for display.
|
||||
|
||||
rendering
|
||||
Creating a new PDF from other data (such as an existing PDF).
|
||||
|
||||
OCRmyPDF has these PDF renderers: ``sandwich`` and ``hocr``. The
|
||||
renderer may be selected using ``--pdf-renderer``. The default is
|
||||
``auto`` which lets OCRmyPDF select the renderer to use. Currently,
|
||||
``auto`` always selects ``hocr``.
|
||||
|
||||
The ``hocr`` renderer
|
||||
---------------------
|
||||
|
||||
.. versionchanged:: 16.0.0
|
||||
|
||||
In both renderers, a text-only layer is rendered and sandwiched (overlaid)
|
||||
on to either the original PDF page, or newly rasterized version of the
|
||||
original PDF page (when ``--force-ocr`` is used). In this way, loss
|
||||
of PDF information is generally avoided. (You may need to disable PDF/A
|
||||
conversion and optimization to eliminate all lossy transformations.)
|
||||
|
||||
The current approach used by the new hOCR renderer is a re-implementation
|
||||
of Tesseract's PDF renderer, using the same Glyphless font and general
|
||||
ideas, but fixing many technical issues that impeded it. The new hocr
|
||||
provides better text placement accuracy, avoids issues with word
|
||||
segmentation, and provides better positioning of skewed text.
|
||||
|
||||
Using the experimental API, it is also possible to edit the OCR output
|
||||
from Tesseract, using any tool that is capable of editing hOCR files.
|
||||
|
||||
Older versions of this renderer did not support non-Latin languages, but
|
||||
it is now universal.
|
||||
|
||||
The ``sandwich`` renderer
|
||||
-------------------------
|
||||
|
||||
The ``sandwich`` renderer uses Tesseract's text-only PDF feature,
|
||||
which produces a PDF page that lays out the OCR in invisible text.
|
||||
|
||||
Currently some problematic PDF viewers like Mozilla PDF.js and macOS
|
||||
Preview have problems with segmenting its text output, and
|
||||
mightrunseveralwordstogether. It also does not implement right to left
|
||||
fonts (Arabic, Hebrew, Persian). The output of this renderer cannot
|
||||
be edited. The sandwich renderer is retained for testing.
|
||||
|
||||
When image preprocessing features like ``--deskew`` are used, the
|
||||
original PDF will be rendered as a full page and the OCR layer will be
|
||||
placed on top.
|
||||
|
||||
Rendering and rasterizing options
|
||||
=================================
|
||||
|
||||
.. versionadded:: 14.3.0
|
||||
|
||||
The ``--continue-on-soft-render-error`` option allows OCRmyPDF to
|
||||
proceed if a page cannot be rasterized/rendered. This is useful if you are
|
||||
trying to get the best possible OCR from a PDF that is not well-formed,
|
||||
and you are willing to accept some pages that may not visually match the
|
||||
input, and that may not OCR well.
|
||||
|
||||
Color conversion strategy
|
||||
=========================
|
||||
|
||||
.. versionadded:: 15.0.0
|
||||
|
||||
OCRmyPDF uses Ghostscript to convert PDF to PDF/A. In some cases, this
|
||||
conversion requires color conversion. The default strategy is to convert
|
||||
using the ``LeaveColorUnchanged`` strategy, which preserves the original
|
||||
color space wherever possible (some rare color spaces might still be
|
||||
converted).
|
||||
|
||||
Usually document scanners produce PDFs in the sRGB color space, and do
|
||||
not need to be converted, so the default strategy is appropriate.
|
||||
|
||||
Suppose that you have a document that was prepared for professional
|
||||
printing in a Separation or CMYK color space, and text was converted to
|
||||
curves. In this case, you may want to use a different color conversion
|
||||
strategy. The ``--color-conversion-strategy`` option allows you to select a
|
||||
different strategy, such as ``RGB``.
|
||||
|
||||
Return code policy
|
||||
==================
|
||||
|
||||
OCRmyPDF writes all messages to ``stderr``. ``stdout`` is reserved for
|
||||
piping output files. ``stdin`` is reserved for piping input files.
|
||||
|
||||
The return codes generated by the OCRmyPDF are considered part of the
|
||||
stable user interface. They may be imported from
|
||||
``ocrmypdf.exceptions``.
|
||||
|
||||
.. list-table:: Return codes
|
||||
:widths: 5 35 60
|
||||
:header-rows: 1
|
||||
|
||||
* - Code
|
||||
- Name
|
||||
- Interpretation
|
||||
* - 0
|
||||
- ``ExitCode.ok``
|
||||
- Everything worked as expected.
|
||||
* - 1
|
||||
- ``ExitCode.bad_args``
|
||||
- Invalid arguments, exited with an error.
|
||||
* - 2
|
||||
- ``ExitCode.input_file``
|
||||
- The input file does not seem to be a valid PDF.
|
||||
* - 3
|
||||
- ``ExitCode.missing_dependency``
|
||||
- An external program required by OCRmyPDF is missing.
|
||||
* - 4
|
||||
- ``ExitCode.invalid_output_pdf``
|
||||
- An output file was created, but it does not seem to be a valid PDF. The file will be available.
|
||||
* - 5
|
||||
- ``ExitCode.file_access_error``
|
||||
- The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file.
|
||||
* - 6
|
||||
- ``ExitCode.already_done_ocr``
|
||||
- The file already appears to contain text so it may not need OCR. See output message.
|
||||
* - 7
|
||||
- ``ExitCode.child_process_error``
|
||||
- An error occurred in an external program (child process) and OCRmyPDF cannot continue.
|
||||
* - 8
|
||||
- ``ExitCode.encrypted_pdf``
|
||||
- The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption.
|
||||
* - 9
|
||||
- ``ExitCode.invalid_config``
|
||||
- A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file.
|
||||
* - 10
|
||||
- ``ExitCode.pdfa_conversion_failed``
|
||||
- A valid PDF was created, PDF/A conversion failed. The file will be available.
|
||||
* - 15
|
||||
- ``ExitCode.other_error``
|
||||
- Some other error occurred.
|
||||
* - 130
|
||||
- ``ExitCode.ctrl_c``
|
||||
- The program was interrupted by pressing Ctrl+C.
|
||||
|
||||
|
||||
.. _tmpdir:
|
||||
|
||||
Changing temporary storage location
|
||||
===================================
|
||||
|
||||
OCRmyPDF generates many temporary files during processing.
|
||||
|
||||
To change where temporary files are stored, change the ``TMPDIR``
|
||||
environment variable for ocrmypdf's environment. (Python's
|
||||
``tempfile.gettempdir()`` returns the root directory in which temporary
|
||||
files will be stored.) For example, one could redirect ``TMPDIR`` to a
|
||||
large RAM disk to avoid wear on HDD/SSD and potentially improve
|
||||
performance.
|
||||
|
||||
On Windows, the ``TEMP`` environment variable is used instead.
|
||||
|
||||
Debugging the intermediate files
|
||||
================================
|
||||
|
||||
OCRmyPDF normally saves its intermediate results to a temporary folder
|
||||
and deletes this folder when it exits, whether it succeeded or failed.
|
||||
|
||||
If the ``--keep-temporary-files`` (``-k``) argument is issued on the
|
||||
command line, OCRmyPDF will keep the temporary folder and print the location,
|
||||
whether it succeeded or failed. An example message is:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
Temporary working files retained at:
|
||||
/tmp/ocrmypdf.io.u20wpz07
|
||||
|
||||
When OCRmyPDF is launched as a snap, this corresponds to the snap filesystem, for instance:
|
||||
|
||||
/tmp/snap-private-tmp/snap.ocrmypdf/tmp/ocrmypdf.io.u20wpz07
|
||||
|
||||
The organization of this folder is an implementation detail and subject
|
||||
to change between releases. However the general organization is that
|
||||
working files on a per page basis have the page number as a prefix
|
||||
(starting with page 1), an infix indicates the processing stage, and a
|
||||
suffix indicates the file type. Some important files include:
|
||||
|
||||
- ``_rasterize.png`` - what the input page looks like
|
||||
- ``_ocr.png`` - the file that is sent to Tesseract for OCR; depending
|
||||
on arguments this may differ from the presentation image
|
||||
- ``_pp_deskew.png`` - the image, after deskewing
|
||||
- ``_pp_clean.png`` - the image, after cleaning with unpaper
|
||||
- ``_ocr_hocr.pdf`` - the OCR file; appears as a blank page with invisible
|
||||
text embedded
|
||||
- ``_ocr_hocr.txt`` - the OCR text (not necessarily all text on the page,
|
||||
if the page is mixed format)
|
||||
- ``fix_docinfo.pdf`` - a temporary file created to fix the PDF DocumentInfo
|
||||
data structure
|
||||
- ``graft_layers.pdf`` - the rendered PDF with OCR layers grafted on
|
||||
- ``pdfa.pdf`` - ``graft_layers.pdf`` after conversion to PDF/A
|
||||
- ``pdfa.ps`` - a PostScript file used by Ghostscript for PDF/A conversion
|
||||
- ``optimize.pdf`` - the PDF generated before optimization
|
||||
- ``optimize.out.pdf`` - the PDF generated by optimization
|
||||
- ``origin`` - the input file
|
||||
- ``origin.pdf`` - the input file or the input image converted to PDF
|
||||
- ``images/*`` - images extracted during the optimization process; here
|
||||
the prefix indicates a PDF object ID not a page number
|
||||
@@ -1,10 +1,7 @@
|
||||
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
..
|
||||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
======================
|
||||
Using the OCRmyPDF API
|
||||
======================
|
||||
# Using the OCRmyPDF API
|
||||
|
||||
OCRmyPDF originated as a command line program and continues to have this
|
||||
legacy, but parts of it can be imported and used in other Python
|
||||
@@ -13,100 +10,95 @@ applications.
|
||||
Some applications may want to consider running ocrmypdf from a
|
||||
subprocess call anyway, as this provides isolation of its activities.
|
||||
|
||||
Example
|
||||
=======
|
||||
## Example
|
||||
|
||||
OCRmyPDF provides one high-level function to run its main engine from an
|
||||
application. The parameters are symmetric to the command line arguments
|
||||
and largely have the same functions.
|
||||
|
||||
.. code-block:: python
|
||||
```python
|
||||
import ocrmypdf
|
||||
|
||||
import ocrmypdf
|
||||
|
||||
if __name__ == '__main__': # To ensure correct behavior on Windows and macOS
|
||||
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
|
||||
if __name__ == '__main__': # To ensure correct behavior on Windows and macOS
|
||||
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
|
||||
```
|
||||
|
||||
With some exceptions, all of the command line arguments are available
|
||||
and may be passed as equivalent keywords.
|
||||
|
||||
A few differences are that ``verbose`` and ``quiet`` are not available.
|
||||
A few differences are that `verbose` and `quiet` are not available.
|
||||
Instead, output should be managed by configuring logging.
|
||||
|
||||
Parent process requirements
|
||||
---------------------------
|
||||
### Parent process requirements
|
||||
|
||||
The :func:`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
|
||||
The {func}`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
|
||||
execution. To do this, it will:
|
||||
|
||||
- create worker processes or threads
|
||||
- manage the signal flags of its worker processes
|
||||
- execute other subprocesses (forking and executing other programs)
|
||||
|
||||
The Python process that calls :func:`ocrmypdf.ocr()` must be sufficiently
|
||||
The Python process that calls {func}`ocrmypdf.ocr()` must be sufficiently
|
||||
privileged to perform these actions.
|
||||
|
||||
There currently is no option to manage how jobs are scheduled other
|
||||
than the argument ``jobs=`` which will limit the number of worker
|
||||
than the argument `jobs=` which will limit the number of worker
|
||||
processes.
|
||||
|
||||
Creating a child process to call :func:`ocrmypdf.ocr()` is suggested. That
|
||||
Creating a child process to call {func}`ocrmypdf.ocr()` is suggested. That
|
||||
way your application will survive and remain interactive even if
|
||||
OCRmyPDF fails for any reason. For example:
|
||||
|
||||
.. code-block:: python
|
||||
```python
|
||||
from multiprocessing import Process
|
||||
|
||||
from multiprocessing import Process
|
||||
def ocrmypdf_process():
|
||||
ocrmypdf.ocr('input.pdf', 'output.pdf')
|
||||
|
||||
def ocrmypdf_process():
|
||||
ocrmypdf.ocr('input.pdf', 'output.pdf')
|
||||
def call_ocrmypdf_from_my_app():
|
||||
p = Process(target=ocrmypdf_process)
|
||||
p.start()
|
||||
p.join()
|
||||
```
|
||||
|
||||
def call_ocrmypdf_from_my_app():
|
||||
p = Process(target=ocrmypdf_process)
|
||||
p.start()
|
||||
p.join()
|
||||
|
||||
Programs that call :func:`ocrmypdf.ocr()` should also install a SIGBUS signal
|
||||
Programs that call {func}`ocrmypdf.ocr()` should also install a SIGBUS signal
|
||||
handler (except on Windows), to raise an exception if access to a memory
|
||||
mapped file fails. OCRmyPDF may use memory mapping.
|
||||
|
||||
:func:`ocrmypdf.ocr()` will take a threading lock to prevent multiple runs of itself
|
||||
{func}`ocrmypdf.ocr()` will take a threading lock to prevent multiple runs of itself
|
||||
in the same Python interpreter process. This is not thread-safe, because of how
|
||||
OCRmyPDF's plugins and Python's library import system work. If you need to parallelize
|
||||
OCRmyPDF, use processes.
|
||||
|
||||
.. warning::
|
||||
:::{warning}
|
||||
On Windows and macOS, the script that calls {func}`ocrmypdf.ocr()` must be
|
||||
protected by an "ifmain" guard (`if __name__ == '__main__'`). If you do
|
||||
not take at least one of these steps, process semantics will prevent
|
||||
OCRmyPDF from working correctly.
|
||||
:::
|
||||
|
||||
On Windows and macOS, the script that calls :func:`ocrmypdf.ocr()` must be
|
||||
protected by an "ifmain" guard (``if __name__ == '__main__'``). If you do
|
||||
not take at least one of these steps, process semantics will prevent
|
||||
OCRmyPDF from working correctly.
|
||||
### Logging
|
||||
|
||||
Logging
|
||||
-------
|
||||
|
||||
OCRmyPDF will log under loggers named ``ocrmypdf``. In addition, it
|
||||
imports ``pdfminer`` and ``PIL``, both of which post log messages under
|
||||
OCRmyPDF will log under loggers named `ocrmypdf`. In addition, it
|
||||
imports `pdfminer` and `PIL`, both of which post log messages under
|
||||
those logging namespaces.
|
||||
|
||||
You can configure the logging as desired for your application or call
|
||||
:func:`ocrmypdf.configure_logging` to configure logging the same way
|
||||
OCRmyPDF itself does. The command line parameters such as ``--quiet``
|
||||
and ``--verbose`` have no equivalents in the API; you must use the
|
||||
{func}`ocrmypdf.configure_logging` to configure logging the same way
|
||||
OCRmyPDF itself does. The command line parameters such as `--quiet`
|
||||
and `--verbose` have no equivalents in the API; you must use the
|
||||
provided configuration function or do configuration in a way that suits
|
||||
your use case.
|
||||
|
||||
Progress monitoring
|
||||
-------------------
|
||||
### Progress monitoring
|
||||
|
||||
OCRmyPDF uses the ``rich`` package to implement its progress bars.
|
||||
:func:`ocrmypdf.configure_logging` will set up logging output to
|
||||
``sys.stderr`` in a way that is compatible with the display of the
|
||||
progress bar. Use ``ocrmypdf.ocr(...progress_bar=False)`` to disable
|
||||
OCRmyPDF uses the `rich` package to implement its progress bars.
|
||||
{func}`ocrmypdf.configure_logging` will set up logging output to
|
||||
`sys.stderr` in a way that is compatible with the display of the
|
||||
progress bar. Use `ocrmypdf.ocr(...progress_bar=False)` to disable
|
||||
the progress bar.
|
||||
|
||||
Standard output
|
||||
---------------
|
||||
### Standard output
|
||||
|
||||
OCRmyPDF is strict about not writing to standard output so that
|
||||
users can safely use it in a pipeline and produce a valid output
|
||||
@@ -116,12 +108,11 @@ behavior and support piping to a file. Another benefit of running
|
||||
OCRmyPDF in a child process, as recommended above, is that it will
|
||||
not interfere with the parent process's standard output.
|
||||
|
||||
Exceptions
|
||||
----------
|
||||
### Exceptions
|
||||
|
||||
OCRmyPDF may throw standard Python exceptions, ``ocrmypdf.exceptions.*``
|
||||
OCRmyPDF may throw standard Python exceptions, `ocrmypdf.exceptions.*`
|
||||
exceptions, some exceptions related to multiprocessing, and
|
||||
:exc:`KeyboardInterrupt`. The parent process should provide an exception
|
||||
{exc}`KeyboardInterrupt`. The parent process should provide an exception
|
||||
handler. OCRmyPDF will clean up its temporary files and worker processes
|
||||
automatically when an exception occurs.
|
||||
|
||||
@@ -1,56 +1,60 @@
|
||||
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
..
|
||||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
=============
|
||||
API reference
|
||||
=============
|
||||
# API reference
|
||||
|
||||
This page summarizes the rest of the public API. Generally speaking this
|
||||
should be mainly of interest to plugin developers.
|
||||
|
||||
ocrmypdf.api
|
||||
============
|
||||
## ocrmypdf.api
|
||||
|
||||
```{eval-rst}
|
||||
.. automodule:: ocrmypdf.api
|
||||
:members:
|
||||
```
|
||||
|
||||
ocrmypdf.exceptions
|
||||
===================
|
||||
## ocrmypdf.exceptions
|
||||
|
||||
```{eval-rst}
|
||||
.. automodule:: ocrmypdf.exceptions
|
||||
:members:
|
||||
:undoc-members:
|
||||
```
|
||||
|
||||
ocrmypdf.helpers
|
||||
================
|
||||
## ocrmypdf.helpers
|
||||
|
||||
```{eval-rst}
|
||||
.. automodule:: ocrmypdf.helpers
|
||||
:members:
|
||||
:noindex: deprecated
|
||||
|
||||
.. autodecorator:: deprecated
|
||||
```
|
||||
|
||||
ocrmypdf.hocrtransform
|
||||
======================
|
||||
## ocrmypdf.hocrtransform
|
||||
|
||||
```{eval-rst}
|
||||
.. automodule:: ocrmypdf.hocrtransform
|
||||
:members:
|
||||
```
|
||||
|
||||
ocrmypdf.pdfa
|
||||
=============
|
||||
## ocrmypdf.pdfa
|
||||
|
||||
```{eval-rst}
|
||||
.. automodule:: ocrmypdf.pdfa
|
||||
:members:
|
||||
```
|
||||
|
||||
ocrmypdf.quality
|
||||
================
|
||||
## ocrmypdf.quality
|
||||
|
||||
```{eval-rst}
|
||||
.. automodule:: ocrmypdf.quality
|
||||
:members:
|
||||
```
|
||||
|
||||
ocrmypdf.subprocess
|
||||
===================
|
||||
## ocrmypdf.subprocess
|
||||
|
||||
```{eval-rst}
|
||||
.. automodule:: ocrmypdf.subprocess
|
||||
:members:
|
||||
```
|
||||
@@ -45,7 +45,7 @@ extensions = [
|
||||
'sphinx_issues',
|
||||
]
|
||||
|
||||
myst_enable_extensions = ['colon_fence', 'attrs_block', 'attrs_inline']
|
||||
myst_enable_extensions = ['colon_fence', 'attrs_block', 'attrs_inline', 'substitution']
|
||||
|
||||
# Extension settings
|
||||
intersphinx_mapping = {'python': ('https://docs.python.org/3', None)}
|
||||
|
||||
106
docs/cookbook.md
106
docs/cookbook.md
@@ -1,45 +1,43 @@
|
||||
% SPDX-FileCopyrightText: 2025 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
Cookbook
|
||||
========
|
||||
# Cookbook
|
||||
|
||||
Basic examples
|
||||
--------------
|
||||
## Basic examples
|
||||
|
||||
### Help!
|
||||
|
||||
ocrmypdf has built-in help.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --help
|
||||
:::
|
||||
```
|
||||
|
||||
### Add an OCR layer and convert to PDF/A
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
### Add an OCR layer and output a standard PDF
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --output-type pdf input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
### Create a PDF/A with all color and grayscale images converted to JPEG
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --output-type pdfa --pdfa-image-compression jpeg input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
### Modify a file in place
|
||||
|
||||
The file will only be overwritten if OCRmyPDF is successful.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf myfile.pdf myfile.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
### Correct page rotation
|
||||
|
||||
@@ -47,9 +45,9 @@ OCR will attempt to automatic correct the rotation of each page. This
|
||||
can help fix a scanning job that contains a mix of landscape and
|
||||
portrait pages.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --rotate-pages myfile.pdf myfile.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
You can increase (decrease) the parameter `--rotate-pages-threshold` to
|
||||
make page rotation more (less) aggressive. The threshold number is the
|
||||
@@ -70,10 +68,10 @@ angle is wrong.
|
||||
OCRmyPDF assumes the document is in English unless told otherwise. OCR
|
||||
quality may be poor if the wrong language is used.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf -l fra LeParisien.pdf LeParisien.pdf
|
||||
ocrmypdf -l eng+fra Bilingual-English-French.pdf Bilingual-English-French.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
Language packs must be installed for all languages specified. See
|
||||
`Installing additional language packs <lang-packs>`{.interpreted-text
|
||||
@@ -87,9 +85,9 @@ language when it is unknown.
|
||||
This produces a file named \"output.pdf\" and a companion text file
|
||||
named \"output.txt\".
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --sidecar output.txt input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
:::{note}
|
||||
The sidecar file contains the **OCR text** found by OCRmyPDF. If the
|
||||
@@ -114,14 +112,14 @@ use a program like Poppler\'s `pdftotext` or `pdfgrep`.
|
||||
If you are starting with images, you can just use Tesseract directly to
|
||||
convert images to PDFs:
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
tesseract my-image.jpg output-prefix pdf
|
||||
:::
|
||||
```
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
# When there are multiple images
|
||||
tesseract text-file-containing-list-of-image-filenames.txt output-prefix pdf
|
||||
:::
|
||||
```
|
||||
|
||||
Tesseract\'s PDF output is quite good -- OCRmyPDF uses it internally, in
|
||||
some cases. However, OCRmyPDF has many features not available in
|
||||
@@ -134,9 +132,9 @@ You can also use a program like
|
||||
images to PDFs, and then pipe the results to run ocrmypdf. The `-` tells
|
||||
ocrmypdf to read standard input.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
img2pdf my-images*.jpg | ocrmypdf - myfile.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
`img2pdf` is recommended because it does an excellent job at generating
|
||||
PDFs without transcoding images.
|
||||
@@ -148,9 +146,9 @@ own. If the resolution (dots per inch, DPI) of an image is not set or is
|
||||
incorrect, it can be overridden with `--image-dpi`. (As 1 inch is 2.54
|
||||
cm, 1 dpi = 0.39 dpcm).
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --image-dpi 300 image.png myfile.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
If you have multiple images, you must use `img2pdf` to convert the
|
||||
images to PDF.
|
||||
@@ -161,8 +159,9 @@ We caution against using ImageMagick or Ghostscript to convert images to
|
||||
PDF, since they may transcode images or produce downsampled images,
|
||||
sometimes without warning.
|
||||
|
||||
Image processing
|
||||
----------------
|
||||
(image-processing)=
|
||||
|
||||
## Image processing
|
||||
|
||||
OCRmyPDF perform some image processing on each page of a PDF, if
|
||||
desired. The same processing is applied to each page. It is suggested
|
||||
@@ -200,18 +199,18 @@ should be visually reviewed after using these options.
|
||||
|
||||
Deskew:
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --deskew input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
Image processing commands can be combined. The order in which options
|
||||
are given does not matter. OCRmyPDF always applies the steps of the
|
||||
image processing pipeline in the same order (rotate, remove background,
|
||||
deskew, clean).
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --deskew --clean --rotate-pages input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
Don\'t actually OCR my PDF
|
||||
--------------------------
|
||||
@@ -221,12 +220,11 @@ processing without performing OCR (by causing OCR to time out). This
|
||||
works if all you want to is to apply image processing or PDF/A
|
||||
conversion.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --tesseract-timeout=0 --remove-background input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
::: {.versionchanged}
|
||||
v14.1.0
|
||||
:::{versionchanged} v14.1.0
|
||||
|
||||
Prior to this version, `--tesseract-timeout 0` would prevent other uses
|
||||
of Tesseract, such as deskewing, from working. This is no longer the
|
||||
@@ -239,9 +237,9 @@ non-OCR operations, if needed.
|
||||
This is getting ridiculous, but OCRmyPDF can complete strip all textual
|
||||
information from a PDF and reconstruct it as a \"bag of images\" PDF.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --tesseract-timeout 0 --force-ocr input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
Why would you want to do this? Perhaps you have a PDF where OCR fails to
|
||||
produce useful results, and just want to get rid of all OCR information.
|
||||
@@ -251,18 +249,18 @@ This command also removes OCR generated by third party tools.
|
||||
|
||||
You can also optimize all images without performing any OCR:
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --tesseract-timeout=0 --optimize 3 --skip-text input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
### Process only certain pages
|
||||
|
||||
You can ask OCRmyPDF to only apply [image processing](#image-processing)
|
||||
and OCR to certain pages.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --pages 2,3,13-17 input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
Hyphens denote a range of pages and commas separate page numbers. If you
|
||||
prefer to use spaces, quote all of the page numbers:
|
||||
@@ -281,9 +279,9 @@ those options. Both of these steps are \"whole file\" operations. In
|
||||
this example, we want to OCR only the title and otherwise change the PDF
|
||||
as little as possible:
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --pages 1 --output-type pdf --optimize 0 input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
Redo existing OCR
|
||||
-----------------
|
||||
@@ -297,9 +295,9 @@ This may be helpful for users who want to take advantage of accuracy
|
||||
improvements in Tesseract for files they previously OCRed with an
|
||||
earlier version of Tesseract and OCRmyPDF.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --redo-ocr input.pdf output.pdf
|
||||
:::
|
||||
```
|
||||
|
||||
This method will replace OCR without rasterizing, reducing quality or
|
||||
removing vector content. If a file contains a mix of pure digital text
|
||||
@@ -351,18 +349,18 @@ header-rows: 1
|
||||
|
||||
* - Level
|
||||
- Comments
|
||||
* - ``--optimize=0``
|
||||
* - <nobr>``--optimize=0``</nobr>
|
||||
- Disables optimization.
|
||||
* - ``--optimize 1``
|
||||
* - <nobr>``--optimize 1``</nobr>
|
||||
- Enables lossless optimizations, such as transcoding images to more
|
||||
efficient formats. Also compress other uncompressed objects in the
|
||||
PDF and enables the more efficient "object streams" within the PDF.
|
||||
(If ``--jbig2-lossy`` is issued, then lossy JBIG2 optimization is used.
|
||||
The decision to use lossy JBIG2 is separate from standard optimization
|
||||
settings.)
|
||||
* - ``--optimize 2``
|
||||
* - <nobr>``--optimize 2``</nobr>
|
||||
- All of the above, and enables lossy optimizations and color quantization.
|
||||
* - ``--optimize 3``
|
||||
* - <nobr>``--optimize 3``</nobr>
|
||||
- All of the above, and enables more aggressive optimizations and targets lower image quality.
|
||||
:::
|
||||
|
||||
@@ -376,9 +374,9 @@ inefficient compression modes to more modern versions. A program like
|
||||
`qpdf` can be used to change encodings, e.g. to inspect the internals
|
||||
for a PDF.
|
||||
|
||||
:::{code} bash
|
||||
```bash
|
||||
ocrmypdf --optimize 3 in.pdf out.pdf # Make it small
|
||||
:::
|
||||
```
|
||||
|
||||
Some users may consider enabling lossy JBIG2. See:
|
||||
`jbig2-lossy`{.interpreted-text role="ref"}.
|
||||
|
||||
57
docs/index.md
Normal file
57
docs/index.md
Normal file
@@ -0,0 +1,57 @@
|
||||
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
# OCRmyPDF documentation
|
||||
|
||||
:::{figure} images/logo.svg
|
||||
:::
|
||||
|
||||
OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF
|
||||
files, allowing them to be searched.
|
||||
|
||||
PDF is the best format for storing and exchanging scanned documents.
|
||||
Unfortunately, PDFs can be difficult to modify. OCRmyPDF makes it easy to apply
|
||||
image processing and OCR (recognized, searchable text) to existing PDFs.
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
introduction
|
||||
release_notes
|
||||
installation
|
||||
languages
|
||||
jbig2
|
||||
```
|
||||
|
||||
```{toctree}
|
||||
:caption: Usage
|
||||
:maxdepth: 2
|
||||
|
||||
cookbook
|
||||
optimizer
|
||||
docker
|
||||
advanced
|
||||
batch
|
||||
cloud
|
||||
performance
|
||||
pdfsecurity
|
||||
errors
|
||||
```
|
||||
|
||||
```{toctree}
|
||||
:caption: Developers
|
||||
:maxdepth: 2
|
||||
|
||||
api
|
||||
plugins
|
||||
apiref
|
||||
design_notes
|
||||
contributing
|
||||
maintainers
|
||||
```
|
||||
|
||||
# Indices and tables
|
||||
|
||||
- {ref}`genindex`
|
||||
- {ref}`modindex`
|
||||
- {ref}`search`
|
||||
@@ -1,56 +0,0 @@
|
||||
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
..
|
||||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
OCRmyPDF documentation
|
||||
======================
|
||||
|
||||
.. figure:: images/logo.svg
|
||||
|
||||
OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF
|
||||
files, allowing them to be searched.
|
||||
|
||||
PDF is the best format for storing and exchanging scanned documents.
|
||||
Unfortunately, PDFs can be difficult to modify. OCRmyPDF makes it easy to apply
|
||||
image processing and OCR (recognized, searchable text) to existing PDFs.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
introduction
|
||||
release_notes
|
||||
installation
|
||||
languages
|
||||
jbig2
|
||||
|
||||
.. toctree::
|
||||
:caption: Usage
|
||||
:maxdepth: 2
|
||||
|
||||
cookbook
|
||||
optimizer
|
||||
docker
|
||||
advanced
|
||||
batch
|
||||
cloud
|
||||
performance
|
||||
pdfsecurity
|
||||
errors
|
||||
|
||||
.. toctree::
|
||||
:caption: Developers
|
||||
:maxdepth: 2
|
||||
|
||||
api
|
||||
plugins
|
||||
apiref
|
||||
design_notes
|
||||
contributing
|
||||
maintainers
|
||||
|
||||
Indices and tables
|
||||
==================
|
||||
|
||||
* :ref:`genindex`
|
||||
* :ref:`modindex`
|
||||
* :ref:`search`
|
||||
730
docs/installation.md
Normal file
730
docs/installation.md
Normal file
@@ -0,0 +1,730 @@
|
||||
---
|
||||
myst:
|
||||
substitutions:
|
||||
deb_11: |-
|
||||
:::{image} https://repology.org/badge/version-for-repo/debian_11/ocrmypdf.svg
|
||||
:alt: Debian 11
|
||||
:::
|
||||
deb_12: |-
|
||||
:::{image} https://repology.org/badge/version-for-repo/debian_12/ocrmypdf.svg
|
||||
:alt: Debian 12
|
||||
:::
|
||||
deb_unstable: |-
|
||||
:::{image} https://repology.org/badge/version-for-repo/debian_unstable/ocrmypdf.svg
|
||||
:alt: Debian unstable
|
||||
:::
|
||||
fedora_38: |-
|
||||
:::{image} https://repology.org/badge/version-for-repo/fedora_38/ocrmypdf.svg
|
||||
:alt: Fedora 38
|
||||
:::
|
||||
fedora_39: |-
|
||||
:::{image} https://repology.org/badge/version-for-repo/fedora_39/ocrmypdf.svg
|
||||
:alt: Fedora 39
|
||||
:::
|
||||
fedora_rawhide: |-
|
||||
:::{image} https://repology.org/badge/version-for-repo/fedora_rawhide/ocrmypdf.svg
|
||||
:alt: Fedore Rawhide
|
||||
:::
|
||||
latest: |-
|
||||
:::{image} https://img.shields.io/pypi/v/ocrmypdf.svg
|
||||
:alt: OCRmyPDF latest released version on PyPI
|
||||
:::
|
||||
ubu_2004: |-
|
||||
:::{image} https://repology.org/badge/version-for-repo/ubuntu_20_04/ocrmypdf.svg
|
||||
:alt: Ubuntu 20.04 LTS
|
||||
:::
|
||||
ubu_2204: |-
|
||||
:::{image} https://repology.org/badge/version-for-repo/ubuntu_22_04/ocrmypdf.svg
|
||||
:alt: Ubuntu 22.04 LTS
|
||||
:::
|
||||
---
|
||||
|
||||
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
# Installing OCRmyPDF
|
||||
|
||||
(latest)=
|
||||
|
||||
The easiest way to install OCRmyPDF is to follow the steps for your operating
|
||||
system/platform. This version may be out of date, however.
|
||||
|
||||
These platforms have one-liner installs:
|
||||
|
||||
:::{list-table}
|
||||
:header-rows: 0
|
||||
|
||||
* - Debian, Ubuntu
|
||||
- ``apt install ocrmypdf``
|
||||
* - Windows Subsystem for Linux
|
||||
- ``apt install ocrmypdf``
|
||||
* - Fedora
|
||||
- ``dnf install ocrmypdf tesseract-osd``
|
||||
* - macOS (Homebrew)
|
||||
- ``brew install ocrmypdf``
|
||||
* - macOS (MacPorts)
|
||||
- ``port install ocrmypdf``
|
||||
* - LinuxBrew
|
||||
- ``brew install ocrmypdf``
|
||||
* - FreeBSD
|
||||
- ``pkg install textproc/py-ocrmypdf``
|
||||
* - Snap (snapcraft packaging)
|
||||
- ``snap install ocrmypdf``
|
||||
:::
|
||||
|
||||
More detailed procedures are outlined below. If you want to do a manual
|
||||
install, or install a more recent version than your platform provides, read on.
|
||||
|
||||
:::{contents} Platform-specific steps
|
||||
:depth: 2
|
||||
:local: true
|
||||
:::
|
||||
|
||||
## Installing on Linux
|
||||
|
||||
### Debian and Ubuntu 20.04 or newer
|
||||
|
||||
:::{list-table}
|
||||
:header-rows: 1
|
||||
|
||||
* - OCRmyPDF versions in Debian & Ubuntu
|
||||
* - {{ latest }}
|
||||
* - {{ deb_11 }} {{ deb_12 }} {{ deb_unstable }}
|
||||
* - {{ ubu_2004 }} {{ ubu_2204 }}
|
||||
:::
|
||||
|
||||
Users of Debian or Ubuntu may simply
|
||||
|
||||
```bash
|
||||
apt install ocrmypdf
|
||||
```
|
||||
|
||||
As indicated in the table above, Debian and Ubuntu releases may lag
|
||||
behind the latest version. If the version available for your platform is
|
||||
out of date, you could opt to install the latest version from source.
|
||||
See [Installing HEAD revision from
|
||||
sources](#installing-head-revision-from-sources).
|
||||
|
||||
For full details on version availability for your platform, check the
|
||||
[Debian Package Tracker](https://tracker.debian.org/pkg/ocrmypdf) or
|
||||
[Ubuntu launchpad.net](https://launchpad.net/ocrmypdf).
|
||||
|
||||
:::{note}
|
||||
OCRmyPDF for Debian and Ubuntu currently omit the JBIG2 encoder.
|
||||
OCRmyPDF works fine without it but will produce larger output files.
|
||||
If you build jbig2enc from source, ocrmypdf will
|
||||
automatically detect it (specifically the `jbig2` binary) on the
|
||||
`PATH`. To add JBIG2 encoding, see {ref}`jbig2`.
|
||||
:::
|
||||
|
||||
### Fedora
|
||||
|
||||
:::{list-table}
|
||||
:header-rows: 1
|
||||
|
||||
* - OCRmyPDF version
|
||||
* - {{latest}}
|
||||
* - {{fedora_38}} {{fedora_39}} {{fedora_rawhide}}
|
||||
:::
|
||||
|
||||
Users of Fedora may simply
|
||||
|
||||
```bash
|
||||
dnf install ocrmypdf tesseract-osd
|
||||
```
|
||||
|
||||
For full details on version availability, check the [Fedora Package
|
||||
Tracker](https://packages.fedoraproject.org/pkgs/ocrmypdf/ocrmypdf/).
|
||||
|
||||
If the version available for your platform is out of date, you could opt
|
||||
to install the latest version from source. See [Installing HEAD revision
|
||||
from sources](#installing-head-revision-from-sources).
|
||||
|
||||
:::{note}
|
||||
OCRmyPDF for Fedora currently omits the JBIG2 encoder due to patent
|
||||
issues. OCRmyPDF works fine without it but will produce larger output
|
||||
files. If you build jbig2enc from source, ocrmypdf 7.0.0 and later
|
||||
will automatically detect it on the `PATH`. To add JBIG2 encoding,
|
||||
see {ref}`Installing the JBIG2 encoder <jbig2>`.
|
||||
:::
|
||||
|
||||
(ubuntu-lts-latest)=
|
||||
|
||||
### RHEL 9
|
||||
|
||||
Prepare the environment by getting Python 3.11:
|
||||
|
||||
```bash
|
||||
dnf install python3.11 python3.11-pip
|
||||
```
|
||||
|
||||
Then, follow [Requirements for pip and HEAD install](#requirements-for-pip-and-head-install) to install dependencies:
|
||||
|
||||
```bash
|
||||
dnf install ghostscript tesseract
|
||||
```
|
||||
|
||||
and build ocrmypdf in virtual environment:
|
||||
|
||||
```bash
|
||||
python3.11 -m venv .venv
|
||||
```
|
||||
|
||||
To add JBIG2 encoding, see {ref}`Installing the JBIG2 encoder <jbig2>`.
|
||||
|
||||
Note Fedora packages for language data haven't been branched for RHEL/EPEL, but you can get traineddata files directly from [tesseract](https://github.com/tesseract-ocr/tessdata/) and place them in `/usr/share/tesseract/tessdata`.
|
||||
|
||||
### Installing the latest version on Ubuntu 22.04 LTS
|
||||
|
||||
Ubuntu 22.04 includes ocrmypdf 13.4.0 - you can install that with
|
||||
`apt install ocrmypdf`. To install a more recent version for the current
|
||||
user, follow these steps:
|
||||
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get -y install ocrmypdf python3-pip
|
||||
|
||||
pip install --user --upgrade ocrmypdf
|
||||
```
|
||||
|
||||
If you get the message `WARNING: The script ocrmypdf is installed in
|
||||
'/home/$USER/.local/bin' which is not on PATH.`, you may need to re-login
|
||||
or open a new shell, or manually adjust your PATH.
|
||||
|
||||
To add JBIG2 encoding, see {ref}`jbig2`.
|
||||
|
||||
### Ubuntu 20.04 LTS
|
||||
|
||||
Ubuntu 20.04 includes ocrmypdf 9.6.0 - you can install that with `apt`. The
|
||||
most convenient way to install recent OCRmyPDF on older Ubuntu is to use
|
||||
Homebrew on Linux (Linuxbrew).
|
||||
|
||||
```bash
|
||||
brew install ocrmypdf
|
||||
```
|
||||
|
||||
### Arch Linux (AUR)
|
||||
|
||||
:::{image} https://repology.org/badge/version-for-repo/aur/ocrmypdf.svg
|
||||
:alt: ArchLinux
|
||||
:target: https://repology.org/metapackage/ocrmypdf
|
||||
:::
|
||||
|
||||
There is an [Arch User Repository (AUR) package for OCRmyPDF](https://aur.archlinux.org/packages/ocrmypdf/).
|
||||
|
||||
Installing AUR packages as root is not allowed, so you must first [setup a
|
||||
non-root user](https://wiki.archlinux.org/index.php/Users_and_groups#User_management) and
|
||||
[configure sudo](https://wiki.archlinux.org/index.php/Sudo#Configuration).
|
||||
The standard Docker image, `archlinux/base:latest`, does **not** have a
|
||||
non-root user configured, so users of that image must follow these guides. If
|
||||
you are using a VM image, such as [the official Vagrant image](https://app.vagrantup.com/archlinux/boxes/archlinux), this work may already
|
||||
be completed for you.
|
||||
|
||||
Next you should install the [base-devel package group](https://archlinux.org/packages/core/any/base-devel/). This includes the
|
||||
standard tooling needed to build packages, such as a compiler and binary tools.
|
||||
|
||||
```bash
|
||||
sudo pacman -S --needed base-devel
|
||||
```
|
||||
|
||||
Now you are ready to install the OCRmyPDF package.
|
||||
|
||||
```bash
|
||||
curl -O https://aur.archlinux.org/cgit/aur.git/snapshot/ocrmypdf.tar.gz
|
||||
tar xvzf ocrmypdf.tar.gz
|
||||
cd ocrmypdf
|
||||
makepkg -sri
|
||||
```
|
||||
|
||||
At this point you will have a working install of OCRmyPDF, but the Tesseract
|
||||
install won’t include any OCR language data. You can install [the
|
||||
tesseract-data package group](https://www.archlinux.org/groups/any/tesseract-data/) to add all supported
|
||||
languages, or use that package listing to identify the appropriate package for
|
||||
your desired language.
|
||||
|
||||
```bash
|
||||
sudo pacman -S tesseract-data-eng
|
||||
```
|
||||
|
||||
As an alternative to this manual procedure, consider using an [AUR helper](https://wiki.archlinux.org/index.php/AUR_helpers). Such a tool will
|
||||
automatically fetch, build and install the AUR package, resolve dependencies
|
||||
(including dependencies on AUR packages), and ease the upgrade procedure.
|
||||
|
||||
If you have any difficulties with installation, check the repository package
|
||||
page.
|
||||
|
||||
:::{note}
|
||||
The OCRmyPDF AUR package currently omits the JBIG2 encoder. OCRmyPDF works
|
||||
fine without it but will produce larger output files. The encoder is
|
||||
available from [the jbig2enc-git AUR package](https://aur.archlinux.org/packages/jbig2enc-git/) and may be installed
|
||||
using the same series of steps as for the installation OCRmyPDF AUR
|
||||
package. Alternatively, it may be built manually from source following the
|
||||
instructions in {ref}`Installing the JBIG2 encoder <jbig2>`. If JBIG2 is
|
||||
installed, OCRmyPDF 7.0.0 and later will automatically detect it.
|
||||
:::
|
||||
|
||||
### Alpine Linux
|
||||
|
||||
:::{image} https://repology.org/badge/version-for-repo/alpine_edge/ocrmypdf.svg
|
||||
:alt: Alpine Linux
|
||||
:target: https://repology.org/metapackage/ocrmypdf
|
||||
:::
|
||||
|
||||
To install OCRmyPDF for Alpine Linux:
|
||||
|
||||
```bash
|
||||
apk add ocrmypdf
|
||||
```
|
||||
|
||||
### Gentoo Linux
|
||||
|
||||
:::{image} https://repology.org/badge/version-for-repo/gentoo_ovl_guru/ocrmypdf.svg
|
||||
:alt: Gentoo Linux
|
||||
:target: https://repology.org/metapackage/ocrmypdf
|
||||
:::
|
||||
|
||||
To install OCRmyPDF on Gentoo Linux, use the following commands:
|
||||
|
||||
```bash
|
||||
eselect repository enable guru
|
||||
emaint sync --repo guru
|
||||
emerge --ask app-text/OCRmyPDF
|
||||
```
|
||||
|
||||
### Other Linux packages
|
||||
|
||||
See the
|
||||
[Repology](https://repology.org/metapackage/ocrmypdf/versions) page.
|
||||
|
||||
In general, first install the OCRmyPDF package for your system, then
|
||||
optionally use the procedure [Installing with Python
|
||||
pip](#installing-with-python-pip) to install a more recent version.
|
||||
|
||||
## Installing on macOS
|
||||
|
||||
### Homebrew
|
||||
|
||||
:::{image} https://img.shields.io/homebrew/v/ocrmypdf.svg
|
||||
:alt: homebrew
|
||||
:target: https://formulae.brew.sh/formula/ocrmypdf
|
||||
:::
|
||||
|
||||
OCRmyPDF is now a standard [Homebrew](https://brew.sh) formula. To
|
||||
install on macOS:
|
||||
|
||||
```bash
|
||||
brew install ocrmypdf
|
||||
```
|
||||
|
||||
This will include only the English language pack. If you need other
|
||||
languages you can optionally install them all:
|
||||
|
||||
```bash
|
||||
brew install tesseract-lang # Optional: Install all language packs
|
||||
```
|
||||
|
||||
### MacPorts
|
||||
|
||||
:::{image} https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fports.macports.org%2Fapi%2Fv1%2Fports%2Focrmypdf%2F%3Fformat%3Djson&query=version&label=MacPorts
|
||||
:alt: Macports Version Information
|
||||
:target: https://ports.macports.org/port/ocrmypdf
|
||||
:::
|
||||
|
||||
OCRmyPDF is includes in MacPorts:
|
||||
|
||||
```bash
|
||||
sudo port install ocrmypdf
|
||||
```
|
||||
|
||||
Note that while this will install tesseract you will need to install
|
||||
the appropriate tesseract [language ports](https://ports.macports.org/search/?selected_facets=categories_exact%3Atextproc&installed_file=&q=tesseract&name=on).
|
||||
|
||||
### Manual installation on macOS
|
||||
|
||||
These instructions probably work on all macOS supported by Homebrew, and are
|
||||
for installing a more current version of OCRmyPDF than is available from
|
||||
Homebrew. Note that the Homebrew versions usually track the release versions
|
||||
fairly closely.
|
||||
|
||||
If it's not already present, [install Homebrew](http://brew.sh/).
|
||||
|
||||
Update Homebrew:
|
||||
|
||||
```bash
|
||||
brew update
|
||||
```
|
||||
|
||||
Install or upgrade the required Homebrew packages, if any are missing.
|
||||
To do this, use `brew edit ocrmypdf` to obtain a recent list of Homebrew
|
||||
dependencies. You could also check the `.workflows/build.yml`.
|
||||
|
||||
This will include the English, French, German and Spanish language
|
||||
packs. If you need other languages you can optionally install them all:
|
||||
|
||||
(macos-all-languages)=
|
||||
|
||||
> ```bash
|
||||
> brew install tesseract-lang # Option 2: for all language packs
|
||||
> ```
|
||||
|
||||
Update the homebrew pip:
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
```
|
||||
|
||||
You can then install OCRmyPDF from PyPI for the current user:
|
||||
|
||||
```bash
|
||||
pip install --user ocrmypdf
|
||||
```
|
||||
|
||||
The command line program should now be available:
|
||||
|
||||
```bash
|
||||
ocrmypdf --help
|
||||
```
|
||||
|
||||
## Installing on Windows
|
||||
|
||||
### Native Windows
|
||||
|
||||
% If you have a Windows that is not the Home edition, you can use Windows Sandbox to test on a blank Windows instance.
|
||||
% https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/
|
||||
|
||||
:::{note}
|
||||
Administrator privileges will be required for some of these steps.
|
||||
:::
|
||||
|
||||
You must install the following for Windows:
|
||||
|
||||
- Python 64-bit
|
||||
- Tesseract 64-bit
|
||||
- Ghostscript 64-bit
|
||||
|
||||
Using the [winget](https://docs.microsoft.com/en-us/windows/package-manager/winget/)
|
||||
package manager:
|
||||
|
||||
- `winget install -e --id Python.Python.3.11`
|
||||
- `winget install -e --id UB-Mannheim.TesseractOCR`
|
||||
|
||||
You will need to install Ghostscript manually, [since it does not support automated
|
||||
installs anymore](https://artifex.com/news/ghostscript-10.01.0-disabling-silent-install-option).
|
||||
|
||||
- [Ghostscript download page](https://ghostscript.com/releases/gsdnld.html).\`
|
||||
|
||||
(Or alternately, using the [Chocolatey](https://chocolatey.org/) package manager, install
|
||||
the following when running in an Administrator command prompt):
|
||||
|
||||
- `choco install python3`
|
||||
- `choco install --pre tesseract`
|
||||
- `choco install pngquant` (optional)
|
||||
|
||||
Either set of commands will install the required software. At the moment there is no
|
||||
single command to install Windows.
|
||||
|
||||
You may then use `pip` to install ocrmypdf. (This can performed by a user or
|
||||
Administrator.):
|
||||
|
||||
- `python3 -m pip install ocrmypdf`
|
||||
|
||||
% The Windows Python versions do not place any python or python3 executable in the path.
|
||||
% They add the py launcher to the path:
|
||||
% https://docs.python.org/3/using/windows.html#python-launcher-for-windows
|
||||
|
||||
If you installed Python using WinGet, then use the following command instead:
|
||||
|
||||
- `py -m pip install ocrmypdf`
|
||||
|
||||
and use:
|
||||
|
||||
- `py -m ocrmypdf`
|
||||
|
||||
To start OCRmyPDF.
|
||||
|
||||
If you intend to use more Python software on your Windows machine, consider the use of
|
||||
[pipx](https://pipx.pypa.io/stable/) or a similar tool to create isolated Python
|
||||
environments for each Python software that you want to use.
|
||||
|
||||
OCRmyPDF will check the Windows Registry and standard locations in your Program Files
|
||||
for third party software it needs (specifically, Tesseract and Ghostscript). To
|
||||
override the versions OCRmyPDF selects, you can modify the `PATH` environment
|
||||
variable. [Follow these directions](https://www.computerhope.com/issues/ch000549.htm#dospath)
|
||||
to change the PATH.
|
||||
|
||||
:::{warning}
|
||||
As of early 2021, users have reported problems with the Microsoft Store version of
|
||||
Python and OCRmyPDF. These issues affect many other third party Python packages.
|
||||
Please download Python from Python.org or a package manager instead of the
|
||||
Microsoft Store version.
|
||||
:::
|
||||
|
||||
:::{warning}
|
||||
32-bit Windows is not supported.
|
||||
:::
|
||||
|
||||
### Windows Subsystem for Linux
|
||||
|
||||
1. Install Ubuntu 22.04 for Windows Subsystem for Linux, if not already installed.
|
||||
2. Follow the procedure to install {ref}`OCRmyPDF on Ubuntu 22.04 <ubuntu-lts-latest>`.
|
||||
3. Open the Windows command prompt and create a symlink:
|
||||
|
||||
```powershell
|
||||
wsl sudo ln -s /home/$USER/.local/bin/ocrmypdf /usr/local/bin/ocrmypdf
|
||||
```
|
||||
|
||||
Then confirm that the expected version from PyPI ({{ latest }}) is installed:
|
||||
|
||||
```powershell
|
||||
wsl ocrmypdf --version
|
||||
```
|
||||
|
||||
You can then run OCRmyPDF in the Windows command prompt or Powershell, prefixing
|
||||
`wsl`, and call it from Windows programs or batch files.
|
||||
|
||||
### Cygwin64
|
||||
|
||||
First install the the following prerequisite Cygwin packages using `setup-x86_64.exe`:
|
||||
|
||||
```
|
||||
python310 (or later)
|
||||
python3?-devel
|
||||
python3?-pip
|
||||
python3?-lxml
|
||||
python3?-imaging
|
||||
|
||||
(where 3? means match the version of python3 you installed)
|
||||
|
||||
gcc-g++
|
||||
ghostscript
|
||||
libexempi3
|
||||
libexempi-devel
|
||||
libffi6
|
||||
libffi-devel
|
||||
pngquant
|
||||
qpdf
|
||||
libqpdf-devel
|
||||
tesseract-ocr
|
||||
tesseract-ocr-devel
|
||||
```
|
||||
|
||||
Then open a Cygwin terminal (i.e. `mintty`), run the following commands. Note
|
||||
that if you are using the version of `pip` that was installed with the Cygwin
|
||||
Python package, the command name will be `pip3`. If you have since updated
|
||||
`pip` (with, for instance `pip3 install --upgrade pip`) the the command is
|
||||
likely just `pip` instead of `pip3`:
|
||||
|
||||
```bash
|
||||
pip3 install wheel
|
||||
pip3 install ocrmypdf
|
||||
```
|
||||
|
||||
The optional dependency "unpaper" that is currently not available under Cygwin.
|
||||
Without it, certain options such as `--clean` will produce an error message.
|
||||
However, the OCR-to-text-layer functionality is available.
|
||||
|
||||
### Docker
|
||||
|
||||
You can also [Install the Docker image](docker) on Windows. Ensure that
|
||||
your command prompt can run the docker "hello world" container.
|
||||
|
||||
## Installing on FreeBSD
|
||||
|
||||
:::{image} https://repology.org/badge/version-for-repo/freebsd/ocrmypdf.svg
|
||||
:alt: FreeBSD
|
||||
:target: https://repology.org/project/ocrmypdf/versions
|
||||
:::
|
||||
|
||||
```bash
|
||||
pkg install textproc/py-ocrmypdf
|
||||
```
|
||||
|
||||
To install a more recent version, you could attempt to first install the system
|
||||
version with `pkg`, then use `pip install --user ocrmypdf`.
|
||||
|
||||
## Installing the Docker image
|
||||
|
||||
For some users, installing the Docker image will be easier than
|
||||
installing all of OCRmyPDF's dependencies.
|
||||
|
||||
See [Installing the Docker image](docker) for more information.
|
||||
|
||||
(installing-with-python-pip)=
|
||||
|
||||
## Installing with Python pip
|
||||
|
||||
OCRmyPDF is delivered by PyPI because it is a convenient way to install
|
||||
the latest version. However, PyPI and `pip` cannot address the fact
|
||||
that `ocrmypdf` depends on certain non-Python system libraries and
|
||||
programs being installed.
|
||||
|
||||
For best results, first install [your platform's
|
||||
version](https://repology.org/metapackage/ocrmypdf/versions) of
|
||||
`ocrmypdf`, using the instructions elsewhere in this document. Then
|
||||
you can use `pip` to get the latest version if your platform version
|
||||
is out of date. Chances are that this will satisfy most dependencies.
|
||||
|
||||
Use `ocrmypdf --version` to confirm what version was installed.
|
||||
|
||||
Then you can install the latest OCRmyPDF from the Python wheels. First
|
||||
try:
|
||||
|
||||
```bash
|
||||
pip install --user ocrmypdf
|
||||
```
|
||||
|
||||
(If the message appears `Requirement already satisfied: ocrmypdf in...`,
|
||||
you will need to use `pip install --user --upgrade ocrmypdf`.)
|
||||
|
||||
You should then be able to run `ocrmypdf --version` and see that the
|
||||
latest version was located.
|
||||
|
||||
## Installing with pipx
|
||||
|
||||
Some users may prefer pipx. As with the method above, you will need to
|
||||
satisfy all non-Python dependencies. Then if pipx is installed, you
|
||||
can use
|
||||
|
||||
```bash
|
||||
pipx run ocrmypdf
|
||||
```
|
||||
|
||||
(If not installed, pipx will install first.)
|
||||
|
||||
(requirements-for-pip-and-head-install)=
|
||||
|
||||
### Requirements for pip and HEAD install
|
||||
|
||||
OCRmyPDF currently requires these external programs and libraries to be
|
||||
installed, and must be satisfied using the operating system package
|
||||
manager. `pip` cannot provide them.
|
||||
|
||||
The following versions are required:
|
||||
|
||||
- Python 3.10 or newer
|
||||
- Ghostscript 9.54 or newer
|
||||
- Tesseract 4.1.1 or newer
|
||||
- jbig2enc 0.29 or newer
|
||||
- pngquant 2.5 or newer
|
||||
- unpaper 6.1
|
||||
|
||||
We recommend 64-bit versions of all software. (32-bit versions are not
|
||||
supported, although on Linux, they may still work.)
|
||||
|
||||
jbig2enc, pngquant, and unpaper are optional. If missing certain
|
||||
features are disabled. OCRmyPDF will discover them as soon as they are
|
||||
available.
|
||||
|
||||
**jbig2enc**, if present, will be used to optimize the encoding of
|
||||
monochrome images. This can significantly reduce the file size of the
|
||||
output file. It is not required.
|
||||
[jbig2enc](https://github.com/agl/jbig2enc) is not generally
|
||||
available for Ubuntu or Debian due to lingering concerns about patent
|
||||
issues, but can easily be built from source. To add JBIG2 encoding, see
|
||||
{ref}`jbig2`.
|
||||
|
||||
**pngquant**, if present, is optionally used to optimize the encoding of
|
||||
PNG-style images in PDFs (actually, any that are that losslessly
|
||||
encoded) by lossily quantizing to a smaller color palette. It is only
|
||||
activated then the `--optimize` argument is `2` or `3`.
|
||||
|
||||
**unpaper**, if present, enables the `--clean` and `--clean-final`
|
||||
command line options.
|
||||
|
||||
These are in addition to the Python packaging dependencies, meaning that
|
||||
unfortunately, the `pip install` command cannot satisfy all of them.
|
||||
|
||||
(installing-head-revision-from-sources)=
|
||||
|
||||
## Installing HEAD revision from sources
|
||||
|
||||
If you have `git` and Python 3.10 or newer installed, you can install
|
||||
from source. When the `pip` installer runs, it will alert you if
|
||||
dependencies are missing.
|
||||
|
||||
If you prefer to build every from source, you will need to [build
|
||||
pikepdf from
|
||||
source](https://pikepdf.readthedocs.io/en/latest/installation.html#building-from-source).
|
||||
First ensure you can build and install pikepdf.
|
||||
|
||||
To install the HEAD revision from sources in the current Python 3
|
||||
environment:
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/ocrmypdf/OCRmyPDF.git
|
||||
```
|
||||
|
||||
Or, to install in editable mode
|
||||
allowing customization of OCRmyPDF, use the `-e` flag:
|
||||
|
||||
```bash
|
||||
pip install -e git+https://github.com/ocrmypdf/OCRmyPDF.git
|
||||
```
|
||||
|
||||
You may find it easiest to install in a virtual environment, rather than
|
||||
system-wide:
|
||||
|
||||
```bash
|
||||
git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
|
||||
python3 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
cd OCRmyPDF
|
||||
pip install .
|
||||
```
|
||||
|
||||
However, `ocrmypdf` will only be accessible on the system PATH when
|
||||
you activate the virtual environment.
|
||||
|
||||
To run the program:
|
||||
|
||||
```bash
|
||||
ocrmypdf --help
|
||||
```
|
||||
|
||||
If not yet installed, the script will notify you about dependencies that
|
||||
need to be installed. The script requires specific versions of the
|
||||
dependencies. Older version than the ones mentioned in the release notes
|
||||
are likely not to be compatible to OCRmyPDF.
|
||||
|
||||
### For development
|
||||
|
||||
To install all of the development and test requirements:
|
||||
|
||||
```bash
|
||||
git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate
|
||||
cd OCRmyPDF
|
||||
pip install -e .[test]
|
||||
```
|
||||
|
||||
To add JBIG2 encoding, see {ref}`jbig2`.
|
||||
|
||||
## Shell completions
|
||||
|
||||
Completions for `bash` and `fish` are available in the project's
|
||||
`misc/completion` folder. The `bash` completions are likely `zsh`
|
||||
compatible but this has not been confirmed. Package maintainers, please
|
||||
install these at the appropriate locations for your system.
|
||||
|
||||
To manually install the `bash` completion, copy
|
||||
`misc/completion/ocrmypdf.bash` to `/etc/bash_completion.d/ocrmypdf`
|
||||
(rename the file).
|
||||
|
||||
To manually install the `fish` completion, copy
|
||||
`misc/completion/ocrmypdf.fish` to
|
||||
`~/.config/fish/completions/ocrmypdf.fish`.
|
||||
|
||||
## Note on 32-bit support
|
||||
|
||||
Many Python libraries no longer provide 32-bit binary wheels for Linux. This
|
||||
includes many of the libraries that OCRmyPDF depends on, such as
|
||||
Pillow. The easiest way to express this to end users is to say we don't
|
||||
support 32-bit Linux.
|
||||
|
||||
However, if your Linux distribution still supports 32-bit binaries, you
|
||||
can still install and use OCRmyPDF. A warning message will appear.
|
||||
In practice, OCRmyPDF may need more than 32-bit memory space to run when
|
||||
large documents are processed, so there are practical limitations to what
|
||||
users can accomplish with it. Still, for the common use case of an 32-bit
|
||||
ARM NAS or Raspberry Pi processing small documents, it should work.
|
||||
@@ -1,740 +0,0 @@
|
||||
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
..
|
||||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
===================
|
||||
Installing OCRmyPDF
|
||||
===================
|
||||
|
||||
.. |latest| image:: https://img.shields.io/pypi/v/ocrmypdf.svg
|
||||
:alt: OCRmyPDF latest released version on PyPI
|
||||
|
||||
|latest|
|
||||
|
||||
The easiest way to install OCRmyPDF is to follow the steps for your operating
|
||||
system/platform. This version may be out of date, however.
|
||||
|
||||
These platforms have one-liner installs:
|
||||
|
||||
+-------------------------------+-----------------------------------------+
|
||||
| Debian, Ubuntu | ``apt install ocrmypdf`` |
|
||||
+-------------------------------+-----------------------------------------+
|
||||
| Windows Subsystem for Linux | ``apt install ocrmypdf`` |
|
||||
+-------------------------------+-----------------------------------------+
|
||||
| Fedora | ``dnf install ocrmypdf tesseract-osd`` |
|
||||
+-------------------------------+-----------------------------------------+
|
||||
| macOS (Homebrew) | ``brew install ocrmypdf`` |
|
||||
+-------------------------------+-----------------------------------------+
|
||||
| macOS (MacPorts) | ``port install ocrmypdf`` |
|
||||
+-------------------------------+-----------------------------------------+
|
||||
| LinuxBrew | ``brew install ocrmypdf`` |
|
||||
+-------------------------------+-----------------------------------------+
|
||||
| FreeBSD | ``pkg install textproc/py-ocrmypdf`` |
|
||||
+-------------------------------+-----------------------------------------+
|
||||
| Snap (snapcraft packaging) | ``snap install ocrmypdf`` |
|
||||
+-------------------------------+-----------------------------------------+
|
||||
|
||||
More detailed procedures are outlined below. If you want to do a manual
|
||||
install, or install a more recent version than your platform provides, read on.
|
||||
|
||||
.. contents:: Platform-specific steps
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
Installing on Linux
|
||||
===================
|
||||
|
||||
Debian and Ubuntu 20.04 or newer
|
||||
--------------------------------
|
||||
|
||||
.. |deb-11| image:: https://repology.org/badge/version-for-repo/debian_11/ocrmypdf.svg
|
||||
:alt: Debian 11
|
||||
|
||||
.. |deb-12| image:: https://repology.org/badge/version-for-repo/debian_12/ocrmypdf.svg
|
||||
:alt: Debian 12
|
||||
|
||||
.. |deb-unstable| image:: https://repology.org/badge/version-for-repo/debian_unstable/ocrmypdf.svg
|
||||
:alt: Debian unstable
|
||||
|
||||
.. |ubu-2004| image:: https://repology.org/badge/version-for-repo/ubuntu_20_04/ocrmypdf.svg
|
||||
:alt: Ubuntu 20.04 LTS
|
||||
|
||||
.. |ubu-2204| image:: https://repology.org/badge/version-for-repo/ubuntu_22_04/ocrmypdf.svg
|
||||
:alt: Ubuntu 22.04 LTS
|
||||
|
||||
+-----------------------------------------------+
|
||||
| **OCRmyPDF versions in Debian & Ubuntu** |
|
||||
+-----------------------------------------------+
|
||||
| |latest| |
|
||||
+-----------------------------------------------+
|
||||
| |deb-11| |deb-12| |deb-unstable| |
|
||||
+-----------------------------------------------+
|
||||
| |ubu-2004| |ubu-2204| |
|
||||
+-----------------------------------------------+
|
||||
|
||||
Users of Debian or Ubuntu may simply
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
apt install ocrmypdf
|
||||
|
||||
As indicated in the table above, Debian and Ubuntu releases may lag
|
||||
behind the latest version. If the version available for your platform is
|
||||
out of date, you could opt to install the latest version from source.
|
||||
See `Installing HEAD revision from
|
||||
sources <#installing-head-revision-from-sources>`__.
|
||||
|
||||
For full details on version availability for your platform, check the
|
||||
`Debian Package Tracker <https://tracker.debian.org/pkg/ocrmypdf>`__ or
|
||||
`Ubuntu launchpad.net <https://launchpad.net/ocrmypdf>`__.
|
||||
|
||||
.. note::
|
||||
|
||||
OCRmyPDF for Debian and Ubuntu currently omit the JBIG2 encoder.
|
||||
OCRmyPDF works fine without it but will produce larger output files.
|
||||
If you build jbig2enc from source, ocrmypdf will
|
||||
automatically detect it (specifically the ``jbig2`` binary) on the
|
||||
``PATH``. To add JBIG2 encoding, see :ref:`jbig2`.
|
||||
|
||||
Fedora
|
||||
------
|
||||
|
||||
.. |fedora-38| image:: https://repology.org/badge/version-for-repo/fedora_38/ocrmypdf.svg
|
||||
:alt: Fedora 38
|
||||
|
||||
.. |fedora-39| image:: https://repology.org/badge/version-for-repo/fedora_39/ocrmypdf.svg
|
||||
:alt: Fedora 39
|
||||
|
||||
.. |fedora-rawhide| image:: https://repology.org/badge/version-for-repo/fedora_rawhide/ocrmypdf.svg
|
||||
:alt: Fedore Rawhide
|
||||
|
||||
+-----------------------------------------------+
|
||||
| **OCRmyPDF version** |
|
||||
+-----------------------------------------------+
|
||||
| |latest| |
|
||||
+-----------------------------------------------+
|
||||
| |fedora-38| |fedora-39| |fedora-rawhide| |
|
||||
+-----------------------------------------------+
|
||||
|
||||
Users of Fedora may simply
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
dnf install ocrmypdf tesseract-osd
|
||||
|
||||
For full details on version availability, check the `Fedora Package
|
||||
Tracker <https://packages.fedoraproject.org/pkgs/ocrmypdf/ocrmypdf/>`__.
|
||||
|
||||
If the version available for your platform is out of date, you could opt
|
||||
to install the latest version from source. See `Installing HEAD revision
|
||||
from sources <#installing-head-revision-from-sources>`__.
|
||||
|
||||
.. note::
|
||||
|
||||
OCRmyPDF for Fedora currently omits the JBIG2 encoder due to patent
|
||||
issues. OCRmyPDF works fine without it but will produce larger output
|
||||
files. If you build jbig2enc from source, ocrmypdf 7.0.0 and later
|
||||
will automatically detect it on the ``PATH``. To add JBIG2 encoding,
|
||||
see :ref:`Installing the JBIG2 encoder <jbig2>`.
|
||||
|
||||
.. _ubuntu-lts-latest:
|
||||
|
||||
RHEL 9
|
||||
------
|
||||
|
||||
Prepare the environment by getting Python 3.11:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
dnf install python3.11 python3.11-pip
|
||||
|
||||
Then, follow `Requirements for pip and HEAD install <#requirements-for-pip-and-head-install>`__ to install dependencies:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
dnf install ghostscript tesseract
|
||||
|
||||
and build ocrmypdf in virtual environment:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python3.11 -m venv .venv
|
||||
|
||||
To add JBIG2 encoding, see :ref:`Installing the JBIG2 encoder <jbig2>`.
|
||||
|
||||
Note Fedora packages for language data haven't been branched for RHEL/EPEL, but you can get traineddata files directly from `tesseract
|
||||
<https://github.com/tesseract-ocr/tessdata/>`__ and place them in ``/usr/share/tesseract/tessdata``.
|
||||
|
||||
Installing the latest version on Ubuntu 22.04 LTS
|
||||
-------------------------------------------------
|
||||
|
||||
Ubuntu 22.04 includes ocrmypdf 13.4.0 - you can install that with
|
||||
``apt install ocrmypdf``. To install a more recent version for the current
|
||||
user, follow these steps:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo apt-get update
|
||||
sudo apt-get -y install ocrmypdf python3-pip
|
||||
|
||||
pip install --user --upgrade ocrmypdf
|
||||
|
||||
If you get the message ``WARNING: The script ocrmypdf is installed in
|
||||
'/home/$USER/.local/bin' which is not on PATH.``, you may need to re-login
|
||||
or open a new shell, or manually adjust your PATH.
|
||||
|
||||
To add JBIG2 encoding, see :ref:`jbig2`.
|
||||
|
||||
Ubuntu 20.04 LTS
|
||||
----------------
|
||||
|
||||
Ubuntu 20.04 includes ocrmypdf 9.6.0 - you can install that with ``apt``. The
|
||||
most convenient way to install recent OCRmyPDF on older Ubuntu is to use
|
||||
Homebrew on Linux (Linuxbrew).
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install ocrmypdf
|
||||
|
||||
Arch Linux (AUR)
|
||||
----------------
|
||||
|
||||
.. image:: https://repology.org/badge/version-for-repo/aur/ocrmypdf.svg
|
||||
:alt: ArchLinux
|
||||
:target: https://repology.org/metapackage/ocrmypdf
|
||||
|
||||
There is an `Arch User Repository (AUR) package for OCRmyPDF
|
||||
<https://aur.archlinux.org/packages/ocrmypdf/>`__.
|
||||
|
||||
Installing AUR packages as root is not allowed, so you must first `setup a
|
||||
non-root user
|
||||
<https://wiki.archlinux.org/index.php/Users_and_groups#User_management>`__ and
|
||||
`configure sudo <https://wiki.archlinux.org/index.php/Sudo#Configuration>`__.
|
||||
The standard Docker image, ``archlinux/base:latest``, does **not** have a
|
||||
non-root user configured, so users of that image must follow these guides. If
|
||||
you are using a VM image, such as `the official Vagrant image
|
||||
<https://app.vagrantup.com/archlinux/boxes/archlinux>`__, this work may already
|
||||
be completed for you.
|
||||
|
||||
Next you should install the `base-devel package group
|
||||
<https://archlinux.org/packages/core/any/base-devel/>`__. This includes the
|
||||
standard tooling needed to build packages, such as a compiler and binary tools.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo pacman -S --needed base-devel
|
||||
|
||||
Now you are ready to install the OCRmyPDF package.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl -O https://aur.archlinux.org/cgit/aur.git/snapshot/ocrmypdf.tar.gz
|
||||
tar xvzf ocrmypdf.tar.gz
|
||||
cd ocrmypdf
|
||||
makepkg -sri
|
||||
|
||||
At this point you will have a working install of OCRmyPDF, but the Tesseract
|
||||
install won’t include any OCR language data. You can install `the
|
||||
tesseract-data package group
|
||||
<https://www.archlinux.org/groups/any/tesseract-data/>`__ to add all supported
|
||||
languages, or use that package listing to identify the appropriate package for
|
||||
your desired language.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo pacman -S tesseract-data-eng
|
||||
|
||||
As an alternative to this manual procedure, consider using an `AUR helper
|
||||
<https://wiki.archlinux.org/index.php/AUR_helpers>`__. Such a tool will
|
||||
automatically fetch, build and install the AUR package, resolve dependencies
|
||||
(including dependencies on AUR packages), and ease the upgrade procedure.
|
||||
|
||||
If you have any difficulties with installation, check the repository package
|
||||
page.
|
||||
|
||||
.. note::
|
||||
|
||||
The OCRmyPDF AUR package currently omits the JBIG2 encoder. OCRmyPDF works
|
||||
fine without it but will produce larger output files. The encoder is
|
||||
available from `the jbig2enc-git AUR package
|
||||
<https://aur.archlinux.org/packages/jbig2enc-git/>`__ and may be installed
|
||||
using the same series of steps as for the installation OCRmyPDF AUR
|
||||
package. Alternatively, it may be built manually from source following the
|
||||
instructions in :ref:`Installing the JBIG2 encoder <jbig2>`. If JBIG2 is
|
||||
installed, OCRmyPDF 7.0.0 and later will automatically detect it.
|
||||
|
||||
Alpine Linux
|
||||
------------
|
||||
|
||||
.. image:: https://repology.org/badge/version-for-repo/alpine_edge/ocrmypdf.svg
|
||||
:alt: Alpine Linux
|
||||
:target: https://repology.org/metapackage/ocrmypdf
|
||||
|
||||
To install OCRmyPDF for Alpine Linux:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
apk add ocrmypdf
|
||||
|
||||
Gentoo Linux
|
||||
------------
|
||||
|
||||
.. image:: https://repology.org/badge/version-for-repo/gentoo_ovl_guru/ocrmypdf.svg
|
||||
:alt: Gentoo Linux
|
||||
:target: https://repology.org/metapackage/ocrmypdf
|
||||
|
||||
To install OCRmyPDF on Gentoo Linux, use the following commands:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
eselect repository enable guru
|
||||
emaint sync --repo guru
|
||||
emerge --ask app-text/OCRmyPDF
|
||||
|
||||
Other Linux packages
|
||||
--------------------
|
||||
|
||||
See the
|
||||
`Repology <https://repology.org/metapackage/ocrmypdf/versions>`__ page.
|
||||
|
||||
In general, first install the OCRmyPDF package for your system, then
|
||||
optionally use the procedure `Installing with Python
|
||||
pip <#installing-with-python-pip>`__ to install a more recent version.
|
||||
|
||||
Installing on macOS
|
||||
===================
|
||||
|
||||
Homebrew
|
||||
--------
|
||||
|
||||
.. image:: https://img.shields.io/homebrew/v/ocrmypdf.svg
|
||||
:alt: homebrew
|
||||
:target: https://formulae.brew.sh/formula/ocrmypdf
|
||||
|
||||
OCRmyPDF is now a standard `Homebrew <https://brew.sh>`__ formula. To
|
||||
install on macOS:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install ocrmypdf
|
||||
|
||||
This will include only the English language pack. If you need other
|
||||
languages you can optionally install them all:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install tesseract-lang # Optional: Install all language packs
|
||||
|
||||
MacPorts
|
||||
--------
|
||||
|
||||
.. image:: https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fports.macports.org%2Fapi%2Fv1%2Fports%2Focrmypdf%2F%3Fformat%3Djson&query=version&label=MacPorts
|
||||
:alt: Macports Version Information
|
||||
:target: https://ports.macports.org/port/ocrmypdf
|
||||
|
||||
OCRmyPDF is includes in MacPorts:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo port install ocrmypdf
|
||||
|
||||
Note that while this will install tesseract you will need to install
|
||||
the appropriate tesseract `language ports <https://ports.macports.org/search/?selected_facets=categories_exact%3Atextproc&installed_file=&q=tesseract&name=on>`__.
|
||||
|
||||
Manual installation on macOS
|
||||
----------------------------
|
||||
|
||||
These instructions probably work on all macOS supported by Homebrew, and are
|
||||
for installing a more current version of OCRmyPDF than is available from
|
||||
Homebrew. Note that the Homebrew versions usually track the release versions
|
||||
fairly closely.
|
||||
|
||||
If it's not already present, `install Homebrew <http://brew.sh/>`__.
|
||||
|
||||
Update Homebrew:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew update
|
||||
|
||||
Install or upgrade the required Homebrew packages, if any are missing.
|
||||
To do this, use ``brew edit ocrmypdf`` to obtain a recent list of Homebrew
|
||||
dependencies. You could also check the ``.workflows/build.yml``.
|
||||
|
||||
This will include the English, French, German and Spanish language
|
||||
packs. If you need other languages you can optionally install them all:
|
||||
|
||||
.. _macos-all-languages:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install tesseract-lang # Option 2: for all language packs
|
||||
|
||||
Update the homebrew pip:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install --upgrade pip
|
||||
|
||||
You can then install OCRmyPDF from PyPI for the current user:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install --user ocrmypdf
|
||||
|
||||
The command line program should now be available:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --help
|
||||
|
||||
Installing on Windows
|
||||
=====================
|
||||
|
||||
Native Windows
|
||||
--------------
|
||||
|
||||
..
|
||||
If you have a Windows that is not the Home edition, you can use Windows Sandbox to test on a blank Windows instance.
|
||||
https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/
|
||||
|
||||
.. note::
|
||||
|
||||
Administrator privileges will be required for some of these steps.
|
||||
|
||||
You must install the following for Windows:
|
||||
|
||||
* Python 64-bit
|
||||
* Tesseract 64-bit
|
||||
* Ghostscript 64-bit
|
||||
|
||||
Using the `winget <https://docs.microsoft.com/en-us/windows/package-manager/winget/>`_
|
||||
package manager:
|
||||
|
||||
* ``winget install -e --id Python.Python.3.11``
|
||||
* ``winget install -e --id UB-Mannheim.TesseractOCR``
|
||||
|
||||
You will need to install Ghostscript manually, `since it does not support automated
|
||||
installs anymore <https://artifex.com/news/ghostscript-10.01.0-disabling-silent-install-option>`_.
|
||||
|
||||
* `Ghostscript download page <https://ghostscript.com/releases/gsdnld.html>`_.`
|
||||
|
||||
(Or alternately, using the `Chocolatey <https://chocolatey.org/>`_ package manager, install
|
||||
the following when running in an Administrator command prompt):
|
||||
|
||||
* ``choco install python3``
|
||||
* ``choco install --pre tesseract``
|
||||
* ``choco install pngquant`` (optional)
|
||||
|
||||
Either set of commands will install the required software. At the moment there is no
|
||||
single command to install Windows.
|
||||
|
||||
You may then use ``pip`` to install ocrmypdf. (This can performed by a user or
|
||||
Administrator.):
|
||||
|
||||
* ``python3 -m pip install ocrmypdf``
|
||||
|
||||
..
|
||||
The Windows Python versions do not place any python or python3 executable in the path.
|
||||
They add the py launcher to the path:
|
||||
https://docs.python.org/3/using/windows.html#python-launcher-for-windows
|
||||
|
||||
If you installed Python using WinGet, then use the following command instead:
|
||||
|
||||
* ``py -m pip install ocrmypdf``
|
||||
|
||||
and use:
|
||||
|
||||
* ``py -m ocrmypdf``
|
||||
|
||||
To start OCRmyPDF.
|
||||
|
||||
If you intend to use more Python software on your Windows machine, consider the use of
|
||||
`pipx <https://pipx.pypa.io/stable/>`_ or a similar tool to create isolated Python
|
||||
environments for each Python software that you want to use.
|
||||
|
||||
OCRmyPDF will check the Windows Registry and standard locations in your Program Files
|
||||
for third party software it needs (specifically, Tesseract and Ghostscript). To
|
||||
override the versions OCRmyPDF selects, you can modify the ``PATH`` environment
|
||||
variable. `Follow these directions <https://www.computerhope.com/issues/ch000549.htm#dospath>`_
|
||||
to change the PATH.
|
||||
|
||||
.. warning::
|
||||
|
||||
As of early 2021, users have reported problems with the Microsoft Store version of
|
||||
Python and OCRmyPDF. These issues affect many other third party Python packages.
|
||||
Please download Python from Python.org or a package manager instead of the
|
||||
Microsoft Store version.
|
||||
|
||||
.. warning::
|
||||
|
||||
32-bit Windows is not supported.
|
||||
|
||||
Windows Subsystem for Linux
|
||||
---------------------------
|
||||
|
||||
#. Install Ubuntu 22.04 for Windows Subsystem for Linux, if not already installed.
|
||||
#. Follow the procedure to install :ref:`OCRmyPDF on Ubuntu 22.04 <ubuntu-lts-latest>`.
|
||||
#. Open the Windows command prompt and create a symlink:
|
||||
|
||||
.. code-block:: powershell
|
||||
|
||||
wsl sudo ln -s /home/$USER/.local/bin/ocrmypdf /usr/local/bin/ocrmypdf
|
||||
|
||||
Then confirm that the expected version from PyPI (|latest|) is installed:
|
||||
|
||||
.. code-block:: powershell
|
||||
|
||||
wsl ocrmypdf --version
|
||||
|
||||
You can then run OCRmyPDF in the Windows command prompt or Powershell, prefixing
|
||||
``wsl``, and call it from Windows programs or batch files.
|
||||
|
||||
Cygwin64
|
||||
--------
|
||||
|
||||
First install the the following prerequisite Cygwin packages using ``setup-x86_64.exe``::
|
||||
|
||||
python310 (or later)
|
||||
python3?-devel
|
||||
python3?-pip
|
||||
python3?-lxml
|
||||
python3?-imaging
|
||||
|
||||
(where 3? means match the version of python3 you installed)
|
||||
|
||||
gcc-g++
|
||||
ghostscript
|
||||
libexempi3
|
||||
libexempi-devel
|
||||
libffi6
|
||||
libffi-devel
|
||||
pngquant
|
||||
qpdf
|
||||
libqpdf-devel
|
||||
tesseract-ocr
|
||||
tesseract-ocr-devel
|
||||
|
||||
Then open a Cygwin terminal (i.e. ``mintty``), run the following commands. Note
|
||||
that if you are using the version of ``pip`` that was installed with the Cygwin
|
||||
Python package, the command name will be ``pip3``. If you have since updated
|
||||
``pip`` (with, for instance ``pip3 install --upgrade pip``) the the command is
|
||||
likely just ``pip`` instead of ``pip3``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip3 install wheel
|
||||
pip3 install ocrmypdf
|
||||
|
||||
The optional dependency "unpaper" that is currently not available under Cygwin.
|
||||
Without it, certain options such as ``--clean`` will produce an error message.
|
||||
However, the OCR-to-text-layer functionality is available.
|
||||
|
||||
Docker
|
||||
------
|
||||
|
||||
You can also :ref:`Install the Docker <docker>` container on Windows. Ensure that
|
||||
your command prompt can run the docker "hello world" container.
|
||||
|
||||
Installing on FreeBSD
|
||||
=====================
|
||||
|
||||
.. image:: https://repology.org/badge/version-for-repo/freebsd/ocrmypdf.svg
|
||||
:alt: FreeBSD
|
||||
:target: https://repology.org/project/ocrmypdf/versions
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pkg install textproc/py-ocrmypdf
|
||||
|
||||
To install a more recent version, you could attempt to first install the system
|
||||
version with ``pkg``, then use ``pip install --user ocrmypdf``.
|
||||
|
||||
Installing the Docker image
|
||||
===========================
|
||||
|
||||
For some users, installing the Docker image will be easier than
|
||||
installing all of OCRmyPDF's dependencies.
|
||||
|
||||
See :ref:`docker` for more information.
|
||||
|
||||
Installing with Python pip
|
||||
==========================
|
||||
|
||||
OCRmyPDF is delivered by PyPI because it is a convenient way to install
|
||||
the latest version. However, PyPI and ``pip`` cannot address the fact
|
||||
that ``ocrmypdf`` depends on certain non-Python system libraries and
|
||||
programs being installed.
|
||||
|
||||
For best results, first install `your platform's
|
||||
version <https://repology.org/metapackage/ocrmypdf/versions>`__ of
|
||||
``ocrmypdf``, using the instructions elsewhere in this document. Then
|
||||
you can use ``pip`` to get the latest version if your platform version
|
||||
is out of date. Chances are that this will satisfy most dependencies.
|
||||
|
||||
Use ``ocrmypdf --version`` to confirm what version was installed.
|
||||
|
||||
Then you can install the latest OCRmyPDF from the Python wheels. First
|
||||
try:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install --user ocrmypdf
|
||||
|
||||
(If the message appears ``Requirement already satisfied: ocrmypdf in...``,
|
||||
you will need to use ``pip install --user --upgrade ocrmypdf``.)
|
||||
|
||||
You should then be able to run ``ocrmypdf --version`` and see that the
|
||||
latest version was located.
|
||||
|
||||
Installing with pipx
|
||||
====================
|
||||
|
||||
Some users may prefer pipx. As with the method above, you will need to
|
||||
satisfy all non-Python dependencies. Then if pipx is installed, you
|
||||
can use
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pipx run ocrmypdf
|
||||
|
||||
(If not installed, pipx will install first.)
|
||||
|
||||
Requirements for pip and HEAD install
|
||||
-------------------------------------
|
||||
|
||||
OCRmyPDF currently requires these external programs and libraries to be
|
||||
installed, and must be satisfied using the operating system package
|
||||
manager. ``pip`` cannot provide them.
|
||||
|
||||
The following versions are required:
|
||||
|
||||
- Python 3.10 or newer
|
||||
- Ghostscript 9.54 or newer
|
||||
- Tesseract 4.1.1 or newer
|
||||
- jbig2enc 0.29 or newer
|
||||
- pngquant 2.5 or newer
|
||||
- unpaper 6.1
|
||||
|
||||
We recommend 64-bit versions of all software. (32-bit versions are not
|
||||
supported, although on Linux, they may still work.)
|
||||
|
||||
jbig2enc, pngquant, and unpaper are optional. If missing certain
|
||||
features are disabled. OCRmyPDF will discover them as soon as they are
|
||||
available.
|
||||
|
||||
**jbig2enc**, if present, will be used to optimize the encoding of
|
||||
monochrome images. This can significantly reduce the file size of the
|
||||
output file. It is not required.
|
||||
`jbig2enc <https://github.com/agl/jbig2enc>`__ is not generally
|
||||
available for Ubuntu or Debian due to lingering concerns about patent
|
||||
issues, but can easily be built from source. To add JBIG2 encoding, see
|
||||
:ref:`jbig2`.
|
||||
|
||||
**pngquant**, if present, is optionally used to optimize the encoding of
|
||||
PNG-style images in PDFs (actually, any that are that losslessly
|
||||
encoded) by lossily quantizing to a smaller color palette. It is only
|
||||
activated then the ``--optimize`` argument is ``2`` or ``3``.
|
||||
|
||||
**unpaper**, if present, enables the ``--clean`` and ``--clean-final``
|
||||
command line options.
|
||||
|
||||
These are in addition to the Python packaging dependencies, meaning that
|
||||
unfortunately, the ``pip install`` command cannot satisfy all of them.
|
||||
|
||||
Installing HEAD revision from sources
|
||||
=====================================
|
||||
|
||||
If you have ``git`` and Python 3.10 or newer installed, you can install
|
||||
from source. When the ``pip`` installer runs, it will alert you if
|
||||
dependencies are missing.
|
||||
|
||||
If you prefer to build every from source, you will need to `build
|
||||
pikepdf from
|
||||
source <https://pikepdf.readthedocs.io/en/latest/installation.html#building-from-source>`__.
|
||||
First ensure you can build and install pikepdf.
|
||||
|
||||
To install the HEAD revision from sources in the current Python 3
|
||||
environment:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install git+https://github.com/ocrmypdf/OCRmyPDF.git
|
||||
|
||||
Or, to install in editable mode
|
||||
allowing customization of OCRmyPDF, use the ``-e`` flag:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install -e git+https://github.com/ocrmypdf/OCRmyPDF.git
|
||||
|
||||
You may find it easiest to install in a virtual environment, rather than
|
||||
system-wide:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
|
||||
python3 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
cd OCRmyPDF
|
||||
pip install .
|
||||
|
||||
However, ``ocrmypdf`` will only be accessible on the system PATH when
|
||||
you activate the virtual environment.
|
||||
|
||||
To run the program:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --help
|
||||
|
||||
If not yet installed, the script will notify you about dependencies that
|
||||
need to be installed. The script requires specific versions of the
|
||||
dependencies. Older version than the ones mentioned in the release notes
|
||||
are likely not to be compatible to OCRmyPDF.
|
||||
|
||||
For development
|
||||
---------------
|
||||
|
||||
To install all of the development and test requirements:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate
|
||||
cd OCRmyPDF
|
||||
pip install -e .[test]
|
||||
|
||||
To add JBIG2 encoding, see :ref:`jbig2`.
|
||||
|
||||
Shell completions
|
||||
=================
|
||||
|
||||
Completions for ``bash`` and ``fish`` are available in the project's
|
||||
``misc/completion`` folder. The ``bash`` completions are likely ``zsh``
|
||||
compatible but this has not been confirmed. Package maintainers, please
|
||||
install these at the appropriate locations for your system.
|
||||
|
||||
To manually install the ``bash`` completion, copy
|
||||
``misc/completion/ocrmypdf.bash`` to ``/etc/bash_completion.d/ocrmypdf``
|
||||
(rename the file).
|
||||
|
||||
To manually install the ``fish`` completion, copy
|
||||
``misc/completion/ocrmypdf.fish`` to
|
||||
``~/.config/fish/completions/ocrmypdf.fish``.
|
||||
|
||||
Note on 32-bit support
|
||||
======================
|
||||
|
||||
Many Python libraries no longer provide 32-bit binary wheels for Linux. This
|
||||
includes many of the libraries that OCRmyPDF depends on, such as
|
||||
Pillow. The easiest way to express this to end users is to say we don't
|
||||
support 32-bit Linux.
|
||||
|
||||
However, if your Linux distribution still supports 32-bit binaries, you
|
||||
can still install and use OCRmyPDF. A warning message will appear.
|
||||
In practice, OCRmyPDF may need more than 32-bit memory space to run when
|
||||
large documents are processed, so there are practical limitations to what
|
||||
users can accomplish with it. Still, for the common use case of an 32-bit
|
||||
ARM NAS or Raspberry Pi processing small documents, it should work.
|
||||
@@ -1,10 +1,14 @@
|
||||
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
..
|
||||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
---
|
||||
substitutions:
|
||||
image: |-
|
||||
```{image} images/bitmap_vs_svg.svg
|
||||
```
|
||||
---
|
||||
|
||||
============
|
||||
Introduction
|
||||
============
|
||||
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
# Introduction
|
||||
|
||||
OCRmyPDF is a Python application and library that adds text "layers" to images in
|
||||
PDFs, making scanned image PDFs searchable. It uses OCR to guess the text
|
||||
@@ -13,31 +17,30 @@ that enable customization of its processing steps, and it is highly tolerant
|
||||
of PDFs containing scanned images and "born digital" content that doesn't
|
||||
require text recognition.
|
||||
|
||||
About OCR
|
||||
=========
|
||||
## About OCR
|
||||
|
||||
`Optical character
|
||||
recognition <https://en.wikipedia.org/wiki/Optical_character_recognition>`__
|
||||
[Optical character
|
||||
recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)
|
||||
is a technology that converts images of typed or handwritten text, such as
|
||||
in a scanned document, into computer text that can be selected, searched and copied.
|
||||
|
||||
OCRmyPDF uses
|
||||
`Tesseract <https://github.com/tesseract-ocr/tesseract>`__, a widely
|
||||
[Tesseract](https://github.com/tesseract-ocr/tesseract), a widely
|
||||
available open source OCR engine, to perform OCR.
|
||||
|
||||
.. _raster-vector:
|
||||
(raster-vector)=
|
||||
|
||||
About PDFs
|
||||
==========
|
||||
## About PDFs
|
||||
|
||||
PDFs are page description files that attempt to preserve a layout
|
||||
exactly. They contain `vector
|
||||
graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
|
||||
exactly. They contain [vector
|
||||
graphics](http://vector-conversions.com/vectorizing/raster_vs_vector.html)
|
||||
that can contain raster objects, such as scanned images. Because PDFs can
|
||||
contain multiple pages (unlike many image formats) and can contain fonts
|
||||
and text, they are a suitable format for exchanging scanned documents.
|
||||
|
||||
|image|
|
||||
:::{image} images/bitmap_vs_svg.svg
|
||||
:::
|
||||
|
||||
A PDF page may contain multiple images, even if it appears to have only
|
||||
one image. Some scanners or scanning software may segment pages into
|
||||
@@ -48,10 +51,9 @@ Rasterizing a PDF is the process of generating corresponding raster images.
|
||||
OCR engines like Tesseract work with images, not scalable vector graphics
|
||||
or mixed raster-vector-text graphics such as PDF.
|
||||
|
||||
About PDF/A
|
||||
===========
|
||||
## About PDF/A
|
||||
|
||||
`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`__ is an ISO-standardized
|
||||
[PDF/A](https://en.wikipedia.org/wiki/PDF/A) is an ISO-standardized
|
||||
subset of the full PDF specification that is designed for archiving (the
|
||||
'A' stands for Archive). PDF/A differs from PDF primarily by omitting
|
||||
features that could complicate future file readability,
|
||||
@@ -63,8 +65,8 @@ of embedded content, it is likely more secure.
|
||||
There are various conformance levels and versions, such as "PDF/A-2b".
|
||||
|
||||
In general, the preferred format for scanned documents is PDF/A. Some
|
||||
governments and jurisdictions, US Courts in particular, `mandate the use
|
||||
of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`__ for scanned
|
||||
governments and jurisdictions, US Courts in particular, [mandate the use
|
||||
of PDF/A](https://pdfblog.com/2012/02/13/what-is-pdfa/) for scanned
|
||||
documents.
|
||||
|
||||
Since most individuals scanning documents aim for long-term readability,
|
||||
@@ -78,13 +80,12 @@ files can be digitally signed but may not be encrypted to ensure future
|
||||
readability. Fortunately, converting from PDF/A to a regular PDF is
|
||||
straightforward, and any PDF viewer can handle PDF/A files.
|
||||
|
||||
What OCRmyPDF does
|
||||
==================
|
||||
## What OCRmyPDF does
|
||||
|
||||
OCRmyPDF analyzes each page of a PDF to determine the required colorspace
|
||||
and resolution (DPI) for capturing all the information on that page without
|
||||
losing content. It uses
|
||||
`Ghostscript <http://ghostscript.com/>`__ to rasterize each page and subsequently
|
||||
[Ghostscript](http://ghostscript.com/) to rasterize each page and subsequently
|
||||
performs OCR on the rasterized image to generate an OCR "layer." This layer
|
||||
is then integrated back into the original PDF.
|
||||
|
||||
@@ -101,10 +102,9 @@ options are utilized, the OCR layer is integrated into the processed image.
|
||||
By default, OCRmyPDF generates archival PDFs in the PDF/A format, which is
|
||||
a more rigid subset of PDF features designed for long-term archives. If you
|
||||
prefer regular PDFs, you can disable this feature using the
|
||||
``--output-type pdf`` option.
|
||||
`--output-type pdf` option.
|
||||
|
||||
Why you shouldn't do this manually
|
||||
==================================
|
||||
## Why you shouldn't do this manually
|
||||
|
||||
A PDF is similar to an HTML file, in that it contains document structure
|
||||
along with images. While some PDFs may solely display a full-page image,
|
||||
@@ -142,55 +142,53 @@ like pikepdf and QPDF, it can auto-repair damaged PDFs. You don't need to
|
||||
understand the intricacies of these issues; you should be able to use
|
||||
OCRmyPDF with any PDF file, and expect reasonable results.
|
||||
|
||||
Limitations
|
||||
===========
|
||||
## Limitations
|
||||
|
||||
OCRmyPDF is subject to limitations imposed by the Tesseract OCR engine.
|
||||
These limitations are inherent to any software relying on Tesseract:
|
||||
|
||||
- The OCR accuracy may not match that of commercial OCR solutions.
|
||||
- It is incapable of recognizing handwriting.
|
||||
- It may detect gibberish and report it as OCR output.
|
||||
- Results may be subpar when a document contains languages not specified
|
||||
in the ``-l LANG`` argument.
|
||||
- Tesseract may struggle to analyze the natural reading order of documents.
|
||||
For instance, it might fail to recognize two columns in a document and
|
||||
attempt to join text across columns.
|
||||
- Poor quality scans can result in subpar OCR quality. In other words, the
|
||||
quality of the OCR output depends on the quality of the input.
|
||||
- Tesseract does not provide information about the font family to which text
|
||||
belongs.
|
||||
- Tesseract does not divide text into paragraphs or headings. It only provides
|
||||
the text and its bounding box. As such, the generated PDF does not
|
||||
contain any information about the document's structure.
|
||||
- The OCR accuracy may not match that of commercial OCR solutions.
|
||||
- It is incapable of recognizing handwriting.
|
||||
- It may detect gibberish and report it as OCR output.
|
||||
- Results may be subpar when a document contains languages not specified
|
||||
in the `-l LANG` argument.
|
||||
- Tesseract may struggle to analyze the natural reading order of documents.
|
||||
For instance, it might fail to recognize two columns in a document and
|
||||
attempt to join text across columns.
|
||||
- Poor quality scans can result in subpar OCR quality. In other words, the
|
||||
quality of the OCR output depends on the quality of the input.
|
||||
- Tesseract does not provide information about the font family to which text
|
||||
belongs.
|
||||
- Tesseract does not divide text into paragraphs or headings. It only provides
|
||||
the text and its bounding box. As such, the generated PDF does not
|
||||
contain any information about the document's structure.
|
||||
|
||||
Ghostscript also imposes some limitations:
|
||||
|
||||
- PDFs containing JPEG 2000-encoded content may be converted to JPEG
|
||||
encoding, which may introduce compression artifacts, if Ghostscript
|
||||
PDF/A is enabled.
|
||||
- Ghostscript may transcode grayscale and color images, potentially
|
||||
lossily, based on an internal algorithm. This
|
||||
behavior can be suppressed by setting ``--pdfa-image-compression`` to
|
||||
``jpeg`` or ``lossless`` to set all images to one type or the other.
|
||||
Ghostscript lacks an option to maintain the input image's format.
|
||||
(Modern Ghostscript can copy JPEG images without transcoding them.)
|
||||
- Ghostscript's PDF/A conversion removes any XMP metadata that is not
|
||||
one of the standard XMP metadata namespaces for PDFs. In particular,
|
||||
PRISM Metadata is removed.
|
||||
- Ghostscript's PDF/A conversion may remove or deactivate
|
||||
hyperlinks and other active content.
|
||||
- PDFs containing JPEG 2000-encoded content may be converted to JPEG
|
||||
encoding, which may introduce compression artifacts, if Ghostscript
|
||||
PDF/A is enabled.
|
||||
- Ghostscript may transcode grayscale and color images, potentially
|
||||
lossily, based on an internal algorithm. This
|
||||
behavior can be suppressed by setting `--pdfa-image-compression` to
|
||||
`jpeg` or `lossless` to set all images to one type or the other.
|
||||
Ghostscript lacks an option to maintain the input image's format.
|
||||
(Modern Ghostscript can copy JPEG images without transcoding them.)
|
||||
- Ghostscript's PDF/A conversion removes any XMP metadata that is not
|
||||
one of the standard XMP metadata namespaces for PDFs. In particular,
|
||||
PRISM Metadata is removed.
|
||||
- Ghostscript's PDF/A conversion may remove or deactivate
|
||||
hyperlinks and other active content.
|
||||
|
||||
You can use ``--output-type pdf`` to disable PDF/A conversion and produce
|
||||
You can use `--output-type pdf` to disable PDF/A conversion and produce
|
||||
a standard, non-archival PDF.
|
||||
|
||||
Regarding OCRmyPDF itself:
|
||||
|
||||
- PDFs using transparency are not currently represented in the test
|
||||
suite
|
||||
- PDFs using transparency are not currently represented in the test
|
||||
suite
|
||||
|
||||
Similar programs
|
||||
================
|
||||
## Similar programs
|
||||
|
||||
To the author's knowledge, OCRmyPDF is the most feature-rich and
|
||||
thoroughly tested command line OCR PDF conversion tool. If it does not
|
||||
@@ -199,8 +197,7 @@ meet your needs, contributions and suggestions are welcome.
|
||||
Ghostscript recently added three "pdfocr" output devices. They work by
|
||||
rasterizing all content and converting all pages to a single colour space.
|
||||
|
||||
Web front-ends
|
||||
==============
|
||||
## Web front-ends
|
||||
|
||||
The Docker image of OCRmyPDF provides a web service front-end
|
||||
that allows files to submitted over HTTP, and the results can be downloaded.
|
||||
@@ -210,16 +207,14 @@ public internet and does not provide any security measures.
|
||||
|
||||
In addition, the following third-party integrations are available:
|
||||
|
||||
- `Paperless-ngx <https://docs.paperless-ngx.com/>`__ is a free software
|
||||
document management system that uses OCRmyPDF to perform OCR on
|
||||
uploaded documents.
|
||||
- `Nextcloud OCR <https://github.com/janis91/ocr>`__ is a free software
|
||||
plugin for the Nextcloud private cloud software.
|
||||
- [Paperless-ngx](https://docs.paperless-ngx.com/) is a free software
|
||||
document management system that uses OCRmyPDF to perform OCR on
|
||||
uploaded documents.
|
||||
- [Nextcloud OCR](https://github.com/janis91/ocr) is a free software
|
||||
plugin for the Nextcloud private cloud software.
|
||||
|
||||
OCRmyPDF is not designed to be secure against malware-bearing PDFs (see
|
||||
`Using OCRmyPDF online <ocr-service>`__). Users should ensure they
|
||||
[Using OCRmyPDF online](ocr-service)). Users should ensure they
|
||||
comply with OCRmyPDF's licenses and the licenses of all dependencies. In
|
||||
particular, OCRmyPDF requires Ghostscript, which is licensed under
|
||||
AGPLv3.
|
||||
|
||||
.. |image| image:: images/bitmap_vs_svg.svg
|
||||
129
docs/languages.md
Normal file
129
docs/languages.md
Normal file
@@ -0,0 +1,129 @@
|
||||
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
(lang-packs)=
|
||||
|
||||
# Installing additional language packs
|
||||
|
||||
OCRmyPDF uses Tesseract for OCR, and relies on its language packs for all languages.
|
||||
On most platforms, English is installed with Tesseract by default, but not always.
|
||||
|
||||
Tesseract supports [most
|
||||
languages](https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages).
|
||||
Languages are identified by standardized three-letter codes (called ISO 639-2 Alpha-3).
|
||||
Tesseract's documentation also lists the three-letter code for your language.
|
||||
Some are anglicized, e.g. Spanish is `spa` rather than `esp`, while others
|
||||
are not, e.g. German is `deu` and French is `fra`.
|
||||
|
||||
Language packs (strictly speaking, Tesseract "traineddata" files) generally correspond
|
||||
to the language in question, but different language packs are used in certain
|
||||
situations. For German, the "Fraktur" language pack can assist with reading older
|
||||
materials in the Fraktur typeface family (`deu_frak`). Some communities have changed
|
||||
their script from Cyrillic to Latin; the Cyrillic version of Uzbek is available
|
||||
as `uzb_cyrl` and the Latin version is `uzb`.
|
||||
|
||||
After you have installed a language pack, you can use it with `ocrmypdf -l <language>`,
|
||||
for example `ocrmypdf -l spa`. For multilingual documents, you can specify
|
||||
all languages to be expected, e.g. `ocrmypdf -l eng+fra` for English and French.
|
||||
English is assumed by default unless other language(s) are specified.
|
||||
|
||||
For Linux users, you can often find packages that provide language
|
||||
packs.
|
||||
|
||||
## Platform install steps
|
||||
|
||||
### Debian and Ubuntu (apt)
|
||||
|
||||
```bash
|
||||
# Display a list of all Tesseract language packs
|
||||
apt-cache search tesseract-ocr
|
||||
|
||||
# Install Chinese Simplified language pack
|
||||
apt-get install tesseract-ocr-chi-sim
|
||||
```
|
||||
|
||||
You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
|
||||
to what languages it should search for. Multiple languages can be
|
||||
requested using either `-l eng+fra` (English and French) or
|
||||
`-l eng -l fra`.
|
||||
|
||||
### Fedora
|
||||
|
||||
```bash
|
||||
# Display a list of all Tesseract language packs
|
||||
dnf search tesseract
|
||||
|
||||
# Install Chinese Simplified language pack
|
||||
dnf install tesseract-langpack-chi_sim
|
||||
```
|
||||
|
||||
You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
|
||||
to what languages it should search for. Multiple languages can be
|
||||
requested using either `-l eng+fra` (English and French) or
|
||||
`-l eng -l fra`.
|
||||
|
||||
### Arch Linux
|
||||
|
||||
```bash
|
||||
# Display a list of all Tesseract language packs
|
||||
pacman -Ss tesseract-data
|
||||
|
||||
# Install German language pack
|
||||
pacman -S tesseract-data-deu
|
||||
```
|
||||
|
||||
You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
|
||||
to what languages it should search for. Multiple languages can be
|
||||
requested using either `-l eng+fra` (English and French) or
|
||||
`-l eng -l fra`.
|
||||
|
||||
### Gentoo
|
||||
|
||||
On Gentoo the package `app-text/tessdata_fast`, which `app-text/tesseract` depends on, handles Tesseract languages.
|
||||
It accepts USE flags to select what languages should be installed, these can be set in `/etc/portage/package.use`.
|
||||
Alternatively one can globally set the [L10N use extension](https://wiki.gentoo.org/wiki/Localization/Guide#L10N) in `/etc/portage/make.conf`.
|
||||
This enables these languages for all packages (e.g. including aspell).
|
||||
|
||||
```bash
|
||||
# Display a list of all Tesseract language packs
|
||||
equery uses app-text/tessdata_fast
|
||||
|
||||
# Add English and German language support for Tesseract only
|
||||
echo 'app-text/tessdata_fast l10n_de l10n_en' >> /etc/portage/package.use
|
||||
|
||||
# Add global English and German language support (the `l10n_` from equery has to be omitted)
|
||||
echo L10N="de en" >> /etc/portage/make.conf
|
||||
|
||||
# update system to reflect changed USE flags
|
||||
emerge --update --deep --newuse @world
|
||||
```
|
||||
|
||||
You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
|
||||
to what languages it should search for. Multiple languages can be
|
||||
requested using either `-l eng+fra` (English and French) or
|
||||
`-l eng -l fra`.
|
||||
|
||||
### macOS
|
||||
|
||||
You can install additional language packs by
|
||||
{ref}`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.
|
||||
|
||||
### Docker
|
||||
|
||||
Users of the OCRmyPDF Docker image should install language packs into a
|
||||
derived Docker image as
|
||||
{ref}`described in that section <docker-lang-packs>`.
|
||||
|
||||
### Windows
|
||||
|
||||
The Tesseract installer provided by Chocolatey currently includes only English language.
|
||||
To install other languages, download the respective language pack (`.traineddata` file)
|
||||
from <https://github.com/tesseract-ocr/tessdata/> and place it in
|
||||
`C:\\Program Files\\Tesseract-OCR\\tessdata` (or wherever Tesseract OCR is installed).
|
||||
|
||||
## Custom language packs
|
||||
|
||||
If you have fine-tuned or trained Tesseract and generated custom trained data, you can
|
||||
copy your `customlang.traineddata` file into your Tesseract "tessdata" folder, and
|
||||
then use the `-l customlang` argument to tell OCRmyPDF to pass that language on to
|
||||
Tesseract.
|
||||
@@ -1,141 +0,0 @@
|
||||
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
..
|
||||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
.. _lang-packs:
|
||||
|
||||
====================================
|
||||
Installing additional language packs
|
||||
====================================
|
||||
|
||||
OCRmyPDF uses Tesseract for OCR, and relies on its language packs for all languages.
|
||||
On most platforms, English is installed with Tesseract by default, but not always.
|
||||
|
||||
Tesseract supports `most
|
||||
languages <https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages>`__.
|
||||
Languages are identified by standardized three-letter codes (called ISO 639-2 Alpha-3).
|
||||
Tesseract's documentation also lists the three-letter code for your language.
|
||||
Some are anglicized, e.g. Spanish is ``spa`` rather than ``esp``, while others
|
||||
are not, e.g. German is ``deu`` and French is ``fra``.
|
||||
|
||||
Language packs (strictly speaking, Tesseract "traineddata" files) generally correspond
|
||||
to the language in question, but different language packs are used in certain
|
||||
situations. For German, the "Fraktur" language pack can assist with reading older
|
||||
materials in the Fraktur typeface family (``deu_frak``). Some communities have changed
|
||||
their script from Cyrillic to Latin; the Cyrillic version of Uzbek is available
|
||||
as ``uzb_cyrl`` and the Latin version is ``uzb``.
|
||||
|
||||
After you have installed a language pack, you can use it with ``ocrmypdf -l <language>``,
|
||||
for example ``ocrmypdf -l spa``. For multilingual documents, you can specify
|
||||
all languages to be expected, e.g. ``ocrmypdf -l eng+fra`` for English and French.
|
||||
English is assumed by default unless other language(s) are specified.
|
||||
|
||||
For Linux users, you can often find packages that provide language
|
||||
packs.
|
||||
|
||||
Platform install steps
|
||||
======================
|
||||
|
||||
Debian and Ubuntu (apt)
|
||||
-----------------------
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Display a list of all Tesseract language packs
|
||||
apt-cache search tesseract-ocr
|
||||
|
||||
# Install Chinese Simplified language pack
|
||||
apt-get install tesseract-ocr-chi-sim
|
||||
|
||||
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
|
||||
to what languages it should search for. Multiple languages can be
|
||||
requested using either ``-l eng+fra`` (English and French) or
|
||||
``-l eng -l fra``.
|
||||
|
||||
Fedora
|
||||
------
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Display a list of all Tesseract language packs
|
||||
dnf search tesseract
|
||||
|
||||
# Install Chinese Simplified language pack
|
||||
dnf install tesseract-langpack-chi_sim
|
||||
|
||||
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
|
||||
to what languages it should search for. Multiple languages can be
|
||||
requested using either ``-l eng+fra`` (English and French) or
|
||||
``-l eng -l fra``.
|
||||
|
||||
Arch Linux
|
||||
----------
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Display a list of all Tesseract language packs
|
||||
pacman -Ss tesseract-data
|
||||
|
||||
# Install German language pack
|
||||
pacman -S tesseract-data-deu
|
||||
|
||||
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
|
||||
to what languages it should search for. Multiple languages can be
|
||||
requested using either ``-l eng+fra`` (English and French) or
|
||||
``-l eng -l fra``.
|
||||
|
||||
Gentoo
|
||||
------
|
||||
|
||||
On Gentoo the package ``app-text/tessdata_fast``, which ``app-text/tesseract`` depends on, handles Tesseract languages.
|
||||
It accepts USE flags to select what languages should be installed, these can be set in ``/etc/portage/package.use``.
|
||||
Alternatively one can globally set the `L10N use extension <https://wiki.gentoo.org/wiki/Localization/Guide#L10N>`__ in ``/etc/portage/make.conf``.
|
||||
This enables these languages for all packages (e.g. including aspell).
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Display a list of all Tesseract language packs
|
||||
equery uses app-text/tessdata_fast
|
||||
|
||||
# Add English and German language support for Tesseract only
|
||||
echo 'app-text/tessdata_fast l10n_de l10n_en' >> /etc/portage/package.use
|
||||
|
||||
# Add global English and German language support (the `l10n_` from equery has to be omitted)
|
||||
echo L10N="de en" >> /etc/portage/make.conf
|
||||
|
||||
# update system to reflect changed USE flags
|
||||
emerge --update --deep --newuse @world
|
||||
|
||||
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
|
||||
to what languages it should search for. Multiple languages can be
|
||||
requested using either ``-l eng+fra`` (English and French) or
|
||||
``-l eng -l fra``.
|
||||
|
||||
macOS
|
||||
-----
|
||||
|
||||
You can install additional language packs by
|
||||
:ref:`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.
|
||||
|
||||
Docker
|
||||
------
|
||||
|
||||
Users of the OCRmyPDF Docker image should install language packs into a
|
||||
derived Docker image as
|
||||
:ref:`described in that section <docker-lang-packs>`.
|
||||
|
||||
Windows
|
||||
-------
|
||||
|
||||
The Tesseract installer provided by Chocolatey currently includes only English language.
|
||||
To install other languages, download the respective language pack (``.traineddata`` file)
|
||||
from https://github.com/tesseract-ocr/tessdata/ and place it in
|
||||
``C:\\Program Files\\Tesseract-OCR\\tessdata`` (or wherever Tesseract OCR is installed).
|
||||
|
||||
Custom language packs
|
||||
=====================
|
||||
|
||||
If you have fine-tuned or trained Tesseract and generated custom trained data, you can
|
||||
copy your ``customlang.traineddata`` file into your Tesseract "tessdata" folder, and
|
||||
then use the ``-l customlang`` argument to tell OCRmyPDF to pass that language on to
|
||||
Tesseract.
|
||||
24
docs/performance.md
Normal file
24
docs/performance.md
Normal file
@@ -0,0 +1,24 @@
|
||||
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
# Performance
|
||||
|
||||
Some users have noticed that current versions of OCRmyPDF do not run as
|
||||
quickly as some older versions (specifically 6.x and older). This is
|
||||
because OCRmyPDF added image optimization as a postprocessing step, and
|
||||
it is enabled by default.
|
||||
|
||||
## Speed
|
||||
|
||||
If running OCRmyPDF quickly is your main goal, you can use settings such
|
||||
as:
|
||||
|
||||
- `--optimize 0` to disable file size optimization
|
||||
- `--output-type pdf` to disable PDF/A generation
|
||||
- `--fast-web-view 999999` to disable fast web view optimization
|
||||
- `--skip-big` to skip large images, if some pages have large images
|
||||
|
||||
You can also avoid:
|
||||
|
||||
- `--force-ocr`
|
||||
- Image preprocessing
|
||||
@@ -1,26 +0,0 @@
|
||||
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
..
|
||||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
===========
|
||||
Performance
|
||||
===========
|
||||
|
||||
Some users have noticed that current versions of OCRmyPDF do not run as quickly
|
||||
as some older versions (specifically 6.x and older). This is because OCRmyPDF
|
||||
added image optimization as a postprocessing step, and it is enabled by default.
|
||||
|
||||
Speed
|
||||
=====
|
||||
|
||||
If running OCRmyPDF quickly is your main goal, you can use settings such as:
|
||||
|
||||
* ``--optimize 0`` to disable file size optimization
|
||||
* ``--output-type pdf`` to disable PDF/A generation
|
||||
* ``--fast-web-view 999999`` to disable fast web view optimization
|
||||
* ``--skip-big`` to skip large images, if some pages have large images
|
||||
|
||||
You can also avoid:
|
||||
|
||||
* ``--force-ocr``
|
||||
* Image preprocessing
|
||||
@@ -1,15 +1,12 @@
|
||||
.. SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
..
|
||||
.. SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
||||
% SPDX-License-Identifier: CC-BY-SA-4.0
|
||||
|
||||
=======
|
||||
Plugins
|
||||
=======
|
||||
# Plugins
|
||||
|
||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
|
||||
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
||||
"OPTIONAL" in this document are to be interpreted as described in
|
||||
RFC 2119.
|
||||
> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
|
||||
> NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
||||
> "OPTIONAL" in this document are to be interpreted as described in
|
||||
> RFC 2119.
|
||||
|
||||
You can use plugins to customize the behavior of OCRmyPDF at certain points of
|
||||
interest.
|
||||
@@ -24,75 +21,71 @@ Currently, it is possible to:
|
||||
- replace Ghostscript with another PDF to image converter (rasterizer) or
|
||||
PDF/A generator
|
||||
|
||||
OCRmyPDF plugins are based on the Python ``pluggy`` package and conform to its
|
||||
OCRmyPDF plugins are based on the Python `pluggy` package and conform to its
|
||||
conventions. Note that: plugins installed with as setuptools entrypoints are
|
||||
not checked currently, because OCRmyPDF assumes you may not want to enable
|
||||
plugins for all files.
|
||||
|
||||
See [OCRmyPDF-EasyOCR](https://github.com/ocrmypdf/OCRmyPDF-EasyOCR) for an
|
||||
See \[OCRmyPDF-EasyOCR\](<https://github.com/ocrmypdf/OCRmyPDF-EasyOCR>) for an
|
||||
example of a straightforward, fully working plugin.
|
||||
|
||||
Script plugins
|
||||
==============
|
||||
## Script plugins
|
||||
|
||||
Script plugins may be called from the command line, by specifying the name of a file.
|
||||
Script plugins may be convenient for informal or "one-off" plugins, when a certain
|
||||
batch of files needs a special processing step for example.
|
||||
|
||||
.. code-block:: bash
|
||||
```bash
|
||||
ocrmypdf --plugin ocrmypdf_example_plugin.py input.pdf output.pdf
|
||||
```
|
||||
|
||||
ocrmypdf --plugin ocrmypdf_example_plugin.py input.pdf output.pdf
|
||||
Multiple plugins may be installed by issuing the `--plugin` argument multiple times.
|
||||
|
||||
Multiple plugins may be installed by issuing the ``--plugin`` argument multiple times.
|
||||
|
||||
Packaged plugins
|
||||
================
|
||||
## Packaged plugins
|
||||
|
||||
Installed plugins may be installed into the same virtual environment as OCRmyPDF
|
||||
is installed into. They may be invoked using Python standard module naming.
|
||||
If you are intending to distribute a plugin, please package it.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf
|
||||
```bash
|
||||
ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf
|
||||
```
|
||||
|
||||
OCRmyPDF does not automatically import plugins, because the assumption is that
|
||||
plugins affect different files differently and you may not want them activated
|
||||
all the time. The command line or ``ocrmypdf.ocr(plugin='...')`` must call
|
||||
all the time. The command line or `ocrmypdf.ocr(plugin='...')` must call
|
||||
for them.
|
||||
|
||||
Third parties that wish to distribute packages for ocrmypdf should package them
|
||||
as packaged plugins, and these modules should begin with the name ``ocrmypdf_``
|
||||
similar to ``pytest`` packages such as ``pytest-cov`` (the package) and
|
||||
``pytest_cov`` (the module).
|
||||
as packaged plugins, and these modules should begin with the name `ocrmypdf_`
|
||||
similar to `pytest` packages such as `pytest-cov` (the package) and
|
||||
`pytest_cov` (the module).
|
||||
|
||||
.. note::
|
||||
:::{note}
|
||||
We recommend plugin authors name their plugins with the prefix
|
||||
`ocrmypdf-` (for the package name on PyPI) and `ocrmypdf_` (for the
|
||||
module), just like pytest plugins. At the same time, please make it clear
|
||||
that your package is not official.
|
||||
:::
|
||||
|
||||
We recommend plugin authors name their plugins with the prefix
|
||||
``ocrmypdf-`` (for the package name on PyPI) and ``ocrmypdf_`` (for the
|
||||
module), just like pytest plugins. At the same time, please make it clear
|
||||
that your package is not official.
|
||||
|
||||
Plugins
|
||||
=======
|
||||
## Plugins
|
||||
|
||||
You can also create a plugin that OCRmyPDF will always automatically load if both are
|
||||
installed in the same virtual environment, using a project entrypoint.
|
||||
OCRmyPDF uses the entrypoint namespace "ocrmypdf".
|
||||
|
||||
For example, ``pyproject.toml`` would need to contain the following, for a plugin named
|
||||
``ocrmypdf-exampleplugin``:
|
||||
For example, `pyproject.toml` would need to contain the following, for a plugin named
|
||||
`ocrmypdf-exampleplugin`:
|
||||
|
||||
.. code-block:: toml
|
||||
```toml
|
||||
[project]
|
||||
name = "ocrmypdf-exampleplugin"
|
||||
|
||||
[project]
|
||||
name = "ocrmypdf-exampleplugin"
|
||||
[project.entry-points."ocrmypdf"]
|
||||
exampleplugin = "exampleplugin.pluginmodule"
|
||||
```
|
||||
|
||||
[project.entry-points."ocrmypdf"]
|
||||
exampleplugin = "exampleplugin.pluginmodule"
|
||||
|
||||
Plugin requirements
|
||||
===================
|
||||
## Plugin requirements
|
||||
|
||||
OCRmyPDF generally uses multiple worker processes. When a new worker is started,
|
||||
Python will import all plugins again, including all plugins that were imported earlier.
|
||||
@@ -103,14 +96,14 @@ to obtain a reference to shared state prepared by another hook implementation.
|
||||
Plugins must expect that other instances of the plugin will be running
|
||||
simultaneously.
|
||||
|
||||
The ``context`` object that is passed to many hooks can be used to share information
|
||||
The `context` object that is passed to many hooks can be used to share information
|
||||
about a file being worked on. Plugins must write private, plugin-specific data to
|
||||
a subfolder named ``{options.work_folder}/ocrmypdf-plugin-name``. Plugins MAY
|
||||
read and write files in ``options.work_folder``, but should be aware that their
|
||||
a subfolder named `{options.work_folder}/ocrmypdf-plugin-name`. Plugins MAY
|
||||
read and write files in `options.work_folder`, but should be aware that their
|
||||
semantics are subject to change.
|
||||
|
||||
OCRmyPDF will delete ``options.work_folder`` when it has finished OCRing
|
||||
a file, unless invoked with ``--keep-temporary-files``.
|
||||
OCRmyPDF will delete `options.work_folder` when it has finished OCRing
|
||||
a file, unless invoked with `--keep-temporary-files`.
|
||||
|
||||
The documentation for some plugin hooks contain a detailed description of the
|
||||
execution context in which they will be called.
|
||||
@@ -119,114 +112,139 @@ Plugins should be prepared to work whether executed in worker threads or worker
|
||||
processes. Generally, OCRmyPDF uses processes, but has a semi-hidden threaded
|
||||
argument that simplifies debugging.
|
||||
|
||||
|
||||
Plugin hooks
|
||||
============
|
||||
## Plugin hooks
|
||||
|
||||
A plugin may provide the following hooks. Hooks must be decorated with
|
||||
``ocrmypdf.hookimpl``, for example:
|
||||
`ocrmypdf.hookimpl`, for example:
|
||||
|
||||
.. code-block:: python
|
||||
```python
|
||||
from ocrmpydf import hookimpl
|
||||
|
||||
from ocrmpydf import hookimpl
|
||||
|
||||
@hookimpl
|
||||
def add_options(parser):
|
||||
pass
|
||||
@hookimpl
|
||||
def add_options(parser):
|
||||
pass
|
||||
```
|
||||
|
||||
The following is a complete list of hooks that are available, and when
|
||||
they are called.
|
||||
|
||||
.. _firstresult:
|
||||
(firstresult)=
|
||||
|
||||
**Note on firstresult hooks**
|
||||
|
||||
If multiple plugins install implementations for this hook, they will be called in
|
||||
the reverse of the order in which they are installed (i.e., last plugin wins).
|
||||
When each hook implementation is called in order, the first implementation that
|
||||
returns a value other than ``None`` will "win" and prevent execution of all other
|
||||
returns a value other than `None` will "win" and prevent execution of all other
|
||||
hooks. As such, you cannot "chain" a series of plugin filters together in this
|
||||
way. Instead, a single hook implementation should be responsible for any such
|
||||
chaining operations.
|
||||
|
||||
Examples
|
||||
========
|
||||
## Examples
|
||||
|
||||
* OCRmyPDF's test suite contains several plugins that are used to simulate certain
|
||||
- OCRmyPDF's test suite contains several plugins that are used to simulate certain
|
||||
test conditions.
|
||||
* `ocrmypdf-papermerge <https://github.com/papermerge/OCRmyPDF_papermerge>`_ is
|
||||
- [ocrmypdf-papermerge](https://github.com/papermerge/OCRmyPDF_papermerge) is
|
||||
a production plugin that integrates OCRmyPDF and the Papermerge document
|
||||
management system.
|
||||
|
||||
### Suppressing or overriding other plugins
|
||||
|
||||
Suppressing or overriding other plugins
|
||||
---------------------------------------
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.initialize
|
||||
```
|
||||
|
||||
Custom command line arguments
|
||||
-----------------------------
|
||||
### Custom command line arguments
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.add_options
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.check_options
|
||||
```
|
||||
|
||||
Execution and progress reporting
|
||||
--------------------------------
|
||||
### Execution and progress reporting
|
||||
|
||||
```{eval-rst}
|
||||
.. autoclass:: ocrmypdf.pluginspec.ProgressBar
|
||||
:members:
|
||||
:special-members: __init__, __enter__, __exit__
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. autoclass:: ocrmypdf.pluginspec.Executor
|
||||
:members:
|
||||
:special-members: __call__
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.get_logging_console
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.get_executor
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.get_progressbar_class
|
||||
```
|
||||
|
||||
Applying special behavior before processing
|
||||
-------------------------------------------
|
||||
### Applying special behavior before processing
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.validate
|
||||
```
|
||||
|
||||
PDF page to image
|
||||
-----------------
|
||||
### PDF page to image
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.rasterize_pdf_page
|
||||
```
|
||||
|
||||
Modifying intermediate images
|
||||
-----------------------------
|
||||
### Modifying intermediate images
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.filter_ocr_image
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.filter_page_image
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.filter_pdf_page
|
||||
```
|
||||
|
||||
OCR engine
|
||||
----------
|
||||
### OCR engine
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.get_ocr_engine
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. autoclass:: ocrmypdf.pluginspec.OcrEngine
|
||||
:members:
|
||||
|
||||
.. automethod:: __str__
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. autoclass:: ocrmypdf.pluginspec.OrientationConfidence
|
||||
```
|
||||
|
||||
PDF/A production
|
||||
----------------
|
||||
### PDF/A production
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.generate_pdfa
|
||||
```
|
||||
|
||||
PDF optimization
|
||||
----------------
|
||||
### PDF optimization
|
||||
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.optimize_pdf
|
||||
```
|
||||
|
||||
.. autofunction:: ocrmypdf.pluginspec.is_optimization_enabled
|
||||
```{eval-rst}
|
||||
.. autofunction:: ocrmypdf.pluginspec.is_optimization_enabled
|
||||
```
|
||||
2840
docs/release_notes.md
Normal file
2840
docs/release_notes.md
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user