mirror/OCRmyPDF

Fork 0

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2026-02-07 21:03:59 -05:00

Files

James R. Barlow 2f4280b66c Comprrehensive documentation update in preparation for v17

2026-01-16 01:38:47 -08:00

12 KiB

Raw Blame History

% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0

Plugins

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

You can use plugins to customize the behavior of OCRmyPDF at certain points of interest.

Currently, it is possible to:

add new command line arguments
override the decision for whether or not to perform OCR on a particular file
modify the image is about to be sent for OCR
modify the page image before it is converted to PDF
replace the Tesseract OCR with another OCR engine that has similar behavior
replace Ghostscript with another PDF to image converter (rasterizer) or PDF/A generator

OCRmyPDF plugins are based on the Python pluggy package and conform to its conventions. Note that: plugins installed with as setuptools entrypoints are not checked currently, because OCRmyPDF assumes you may not want to enable plugins for all files.

See [OCRmyPDF-EasyOCR](https://github.com/ocrmypdf/OCRmyPDF-EasyOCR) for an example of a straightforward, fully working plugin.

Script plugins

Script plugins may be called from the command line, by specifying the name of a file. Script plugins may be convenient for informal or "one-off" plugins, when a certain batch of files needs a special processing step for example.

ocrmypdf --plugin ocrmypdf_example_plugin.py input.pdf output.pdf

Multiple plugins may be installed by issuing the --plugin argument multiple times.

Packaged plugins

Installed plugins may be installed into the same virtual environment as OCRmyPDF is installed into. They may be invoked using Python standard module naming. If you are intending to distribute a plugin, please package it.

ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf

OCRmyPDF does not automatically import plugins, because the assumption is that plugins affect different files differently and you may not want them activated all the time. The command line or ocrmypdf.ocr(plugin='...') must call for them.

Third parties that wish to distribute packages for ocrmypdf should package them as packaged plugins, and these modules should begin with the name ocrmypdf_ similar to pytest packages such as pytest-cov (the package) and pytest_cov (the module).

:::{note} We recommend plugin authors name their plugins with the prefix ocrmypdf- (for the package name on PyPI) and ocrmypdf_ (for the module), just like pytest plugins. At the same time, please make it clear that your package is not official. :::

Plugins

You can also create a plugin that OCRmyPDF will always automatically load if both are installed in the same virtual environment, using a project entrypoint. OCRmyPDF uses the entrypoint namespace "ocrmypdf".

For example, pyproject.toml would need to contain the following, for a plugin named ocrmypdf-exampleplugin:

[project]
name = "ocrmypdf-exampleplugin"

[project.entry-points."ocrmypdf"]
exampleplugin = "exampleplugin.pluginmodule"

Plugin requirements

OCRmyPDF generally uses multiple worker processes. When a new worker is started, Python will import all plugins again, including all plugins that were imported earlier. This means that the global state of a plugin in one worker will not be shared with other workers. As such, plugin hook implementations should be stateless, relying only on their inputs. Hook implementations may use their input parameters to to obtain a reference to shared state prepared by another hook implementation. Plugins must expect that other instances of the plugin will be running simultaneously.

The context object that is passed to many hooks can be used to share information about a file being worked on. Plugins must write private, plugin-specific data to a subfolder named {options.work_folder}/ocrmypdf-plugin-name. Plugins MAY read and write files in options.work_folder, but should be aware that their semantics are subject to change.

OCRmyPDF will delete options.work_folder when it has finished OCRing a file, unless invoked with --keep-temporary-files.

The documentation for some plugin hooks contain a detailed description of the execution context in which they will be called.

Plugins should be prepared to work whether executed in worker threads or worker processes. Generally, OCRmyPDF uses processes, but has a semi-hidden threaded argument that simplifies debugging.

Plugin hooks

A plugin may provide the following hooks. Hooks must be decorated with ocrmypdf.hookimpl, for example:

from ocrmypdf import hookimpl

@hookimpl
def add_options(parser):
    pass

The following is a complete list of hooks that are available, and when they are called.

(firstresult)=

Note on firstresult hooks

If multiple plugins install implementations for this hook, they will be called in the reverse of the order in which they are installed (i.e., last plugin wins). When each hook implementation is called in order, the first implementation that returns a value other than None will "win" and prevent execution of all other hooks. As such, you cannot "chain" a series of plugin filters together in this way. Instead, a single hook implementation should be responsible for any such chaining operations.

Examples

OCRmyPDF's test suite contains several plugins that are used to simulate certain test conditions.
ocrmypdf-papermerge is a production plugin that integrates OCRmyPDF and the Papermerge document management system.

Suppressing or overriding other plugins

.. autofunction:: ocrmypdf.pluginspec.initialize

Custom command line arguments

.. autofunction:: ocrmypdf.pluginspec.add_options

.. autofunction:: ocrmypdf.pluginspec.check_options

Plugin option models

Plugins can define their own option models using Pydantic. This allows plugins to:

Define type-safe option structures with validation
Add CLI arguments that map to their option model fields
Access options via nested namespaces (e.g., options.tesseract.timeout)

.. autofunction:: ocrmypdf.pluginspec.register_options

Plugin options can be accessed in two ways:

Flat access (backward compatible): options.tesseract_timeout
Nested access: options.tesseract.timeout

Both access patterns are equivalent and return the same values.

:::{note} Plugin Interface Change: Starting in OCRmyPDF v17.0.0, plugin hooks receive OcrOptions objects instead of argparse.Namespace objects. Most plugins will continue working due to duck-typing compatibility, but plugin developers should update their type hints accordingly. :::

Migration guide for plugin developers

:::{versionadded} 17.0.0 :::

Update imports:

from ocrmypdf._options import OcrOptions

Update type hints:

# Before (v16 and earlier)
def check_options(options: argparse.Namespace) -> None:
    ...

# After (v17+)
def check_options(options: OcrOptions) -> None:
    ...

Attribute access unchanged:

# These work exactly as before
options.languages
options.output_type
options.tesseract_timeout

Remove in-place modifications:

# Before (v16 pattern - no longer recommended)
def check_options(options):
    options.some_computed_value = compute_value(options)

# After (v17 pattern - compute at point of use)
def some_function(options):
    computed = compute_value(options)
    use_computed(computed)

Execution and progress reporting

.. autoclass:: ocrmypdf.pluginspec.ProgressBar
    :members:
    :special-members: __init__, __enter__, __exit__

.. autoclass:: ocrmypdf.pluginspec.Executor
    :members:
    :special-members: __call__

.. autofunction:: ocrmypdf.pluginspec.get_logging_console

.. autofunction:: ocrmypdf.pluginspec.get_executor

.. autofunction:: ocrmypdf.pluginspec.get_progressbar_class

Applying special behavior before processing

.. autofunction:: ocrmypdf.pluginspec.validate

PDF page to image

.. autofunction:: ocrmypdf.pluginspec.rasterize_pdf_page

Modifying intermediate images

.. autofunction:: ocrmypdf.pluginspec.filter_ocr_image

.. autofunction:: ocrmypdf.pluginspec.filter_page_image

.. autofunction:: ocrmypdf.pluginspec.filter_pdf_page

OCR engine

.. autofunction:: ocrmypdf.pluginspec.get_ocr_engine

.. autoclass:: ocrmypdf.pluginspec.OcrEngine
    :members:

    .. automethod:: __str__

.. autoclass:: ocrmypdf.pluginspec.OrientationConfidence

PDF/A production

.. autofunction:: ocrmypdf.pluginspec.generate_pdfa

PDF optimization

.. autofunction:: ocrmypdf.pluginspec.optimize_pdf

.. autofunction:: ocrmypdf.pluginspec.is_optimization_enabled

Working with OcrElement trees

:::{versionadded} 17.0.0 :::

OCRmyPDF v17 introduces the OcrElement dataclass for representing OCR output in an engine-agnostic format. This enables plugins to work with OCR results without parsing hOCR XML.

Key classes:

from ocrmypdf import OcrElement, OcrClass, BoundingBox

# OcrElement - represents any OCR structural unit
page = OcrElement(
    ocr_class=OcrClass.PAGE,
    bbox=BoundingBox(0, 0, 612, 792),
    children=[...]
)

# BoundingBox - axis-aligned bounding box (left, top, right, bottom)
bbox = BoundingBox(left=100, top=50, right=300, bottom=80)

# OcrClass - constants for element types
OcrClass.PAGE      # "ocr_page"
OcrClass.LINE      # "ocr_line"
OcrClass.WORD      # "ocrx_word"
OcrClass.PARAGRAPH # "ocr_par"

Navigating the tree:

# Get all words in a page
words = page.words  # Returns list[OcrElement]

# Get all lines
lines = page.lines

# Get combined text
text = page.get_text_recursive()

# Iterate by class
for para in page.paragraphs:
    print(para.get_text_recursive())

OCR engine plugins:

Plugins implementing custom OCR engines can now output OcrElement trees directly via the generate_ocr() method, bypassing hOCR entirely:

from pathlib import Path
from ocrmypdf.pluginspec import OcrEngine
from ocrmypdf import OcrElement, OcrClass, BoundingBox

class MyOcrEngine(OcrEngine):
    def generate_ocr(
        self,
        input_file: Path,
        options,
        context,
    ) -> OcrElement:
        # Perform OCR and return OcrElement tree directly
        # No need to generate hOCR XML
        return OcrElement(
            ocr_class=OcrClass.PAGE,
            bbox=BoundingBox(0, 0, width, height),
            dpi=300,
            children=[
                OcrElement(
                    ocr_class=OcrClass.LINE,
                    bbox=BoundingBox(100, 50, 500, 80),
                    children=[
                        OcrElement(
                            ocr_class=OcrClass.WORD,
                            bbox=BoundingBox(100, 50, 200, 80),
                            text="Hello",
                        ),
                        # ... more words
                    ]
                ),
                # ... more lines
            ]
        )

    def supports_generate_ocr(self) -> bool:
        return True  # Indicate this engine uses generate_ocr()

This approach is simpler than generating hOCR and allows modern OCR engines to integrate more naturally with OCRmyPDF.

12 KiB Raw Blame History

Plugins

Script plugins

Packaged plugins

Plugins

Plugin requirements

Plugin hooks

Examples

Suppressing or overriding other plugins

Custom command line arguments

Plugin option models

Migration guide for plugin developers

Execution and progress reporting

Applying special behavior before processing

PDF page to image

Modifying intermediate images

OCR engine

PDF/A production

PDF optimization

Working with OcrElement trees

12 KiB

Raw Blame History