mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-02-08 05:13:50 -05:00
Document the new v17 API style where OcrOptions can be passed directly to ocr(). Mark the positional argument style as legacy API for <v17 compatibility. Update examples to use modern syntax.
178 lines
6.1 KiB
Markdown
178 lines
6.1 KiB
Markdown
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
|
% SPDX-License-Identifier: CC-BY-SA-4.0
|
|
|
|
# Using the OCRmyPDF API
|
|
|
|
OCRmyPDF originated as a command line program and continues to have this
|
|
legacy, but parts of it can be imported and used in other Python
|
|
applications.
|
|
|
|
Some applications may want to consider running ocrmypdf from a
|
|
subprocess call anyway, as this provides isolation of its activities.
|
|
|
|
## Example
|
|
|
|
OCRmyPDF provides one high-level function to run its main engine from an
|
|
application.
|
|
|
|
```{versionchanged} 17.0
|
|
The {func}`ocrmypdf.ocr` function now accepts an {class}`~ocrmypdf.OcrOptions`
|
|
object as its first argument, providing a cleaner API with full type hints
|
|
and validation. The previous positional argument style remains supported.
|
|
```
|
|
|
|
### Modern API (recommended)
|
|
|
|
The recommended way to call {func}`ocrmypdf.ocr` is to construct an
|
|
{class}`~ocrmypdf.OcrOptions` object with all settings, then pass it
|
|
as the sole argument:
|
|
|
|
```python
|
|
import ocrmypdf
|
|
from ocrmypdf import OcrOptions
|
|
|
|
if __name__ == '__main__': # To ensure correct behavior on Windows and macOS
|
|
options = OcrOptions(
|
|
input_file='input.pdf',
|
|
output_file='output.pdf',
|
|
deskew=True,
|
|
languages=['eng'],
|
|
)
|
|
ocrmypdf.ocr(options)
|
|
```
|
|
|
|
{class}`~ocrmypdf.OcrOptions` is a Pydantic model that provides:
|
|
|
|
- Full type hints and IDE autocompletion
|
|
- Validation of option values at construction time
|
|
- Clear documentation of all available options
|
|
|
|
```{versionadded} 17.0
|
|
The {class}`~ocrmypdf.OcrOptions` class is now exported from the top-level
|
|
`ocrmypdf` module.
|
|
```
|
|
|
|
### Legacy API
|
|
|
|
For compatibility with OCRmyPDF < v17, the traditional calling style
|
|
with positional arguments is still fully supported:
|
|
|
|
```python
|
|
import ocrmypdf
|
|
|
|
if __name__ == '__main__': # To ensure correct behavior on Windows and macOS
|
|
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
|
|
```
|
|
|
|
With this style, all of the command line arguments are available
|
|
and may be passed as equivalent keywords.
|
|
|
|
A few differences are that `verbose` and `quiet` are not available.
|
|
Instead, output should be managed by configuring logging.
|
|
|
|
### Parent process requirements
|
|
|
|
The {func}`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
|
|
execution. To do this, it will:
|
|
|
|
- create worker processes or threads
|
|
- manage the signal flags of its worker processes
|
|
- execute other subprocesses (forking and executing other programs)
|
|
|
|
The Python process that calls {func}`ocrmypdf.ocr()` must be sufficiently
|
|
privileged to perform these actions.
|
|
|
|
There currently is no option to manage how jobs are scheduled other
|
|
than the argument `jobs=` which will limit the number of worker
|
|
processes.
|
|
|
|
Creating a child process to call {func}`ocrmypdf.ocr()` is suggested. That
|
|
way your application will survive and remain interactive even if
|
|
OCRmyPDF fails for any reason. For example:
|
|
|
|
```python
|
|
from multiprocessing import Process
|
|
import ocrmypdf
|
|
from ocrmypdf import OcrOptions
|
|
|
|
def ocrmypdf_process():
|
|
options = OcrOptions(input_file='input.pdf', output_file='output.pdf')
|
|
ocrmypdf.ocr(options)
|
|
|
|
def call_ocrmypdf_from_my_app():
|
|
p = Process(target=ocrmypdf_process)
|
|
p.start()
|
|
p.join()
|
|
```
|
|
|
|
Programs that call {func}`ocrmypdf.ocr()` should also install a SIGBUS signal
|
|
handler (except on Windows), to raise an exception if access to a memory
|
|
mapped file fails. OCRmyPDF may use memory mapping.
|
|
|
|
{func}`ocrmypdf.ocr()` will take a threading lock to prevent multiple runs of itself
|
|
in the same Python interpreter process. This is not thread-safe, because of how
|
|
OCRmyPDF's plugins and Python's library import system work. If you need to parallelize
|
|
OCRmyPDF, use processes.
|
|
|
|
:::{warning}
|
|
On Windows and macOS, the script that calls {func}`ocrmypdf.ocr()` must be
|
|
protected by an "ifmain" guard (`if __name__ == '__main__'`). If you do
|
|
not take at least one of these steps, process semantics will prevent
|
|
OCRmyPDF from working correctly.
|
|
:::
|
|
|
|
### Logging
|
|
|
|
OCRmyPDF will log under loggers named `ocrmypdf`. In addition, it
|
|
imports `pdfminer` and `PIL`, both of which post log messages under
|
|
those logging namespaces.
|
|
|
|
You can configure the logging as desired for your application or call
|
|
{func}`ocrmypdf.configure_logging` to configure logging the same way
|
|
OCRmyPDF itself does. The command line parameters such as `--quiet`
|
|
and `--verbose` have no equivalents in the API; you must use the
|
|
provided configuration function or do configuration in a way that suits
|
|
your use case.
|
|
|
|
### Progress monitoring
|
|
|
|
OCRmyPDF uses the `rich` package to implement its progress bars.
|
|
{func}`ocrmypdf.configure_logging` will set up logging output to
|
|
`sys.stderr` in a way that is compatible with the display of the
|
|
progress bar. Use `ocrmypdf.ocr(...progress_bar=False)` to disable
|
|
the progress bar.
|
|
|
|
### Standard output
|
|
|
|
OCRmyPDF is strict about not writing to standard output so that
|
|
users can safely use it in a pipeline and produce a valid output
|
|
file. A caller application will have to ensure it does not write to
|
|
standard output either, if it wants to be compatible with this
|
|
behavior and support piping to a file. Another benefit of running
|
|
OCRmyPDF in a child process, as recommended above, is that it will
|
|
not interfere with the parent process's standard output.
|
|
|
|
### Exceptions
|
|
|
|
OCRmyPDF may throw standard Python exceptions, `ocrmypdf.exceptions.*`
|
|
exceptions, some exceptions related to multiprocessing, and
|
|
{exc}`KeyboardInterrupt`. The parent process should provide an exception
|
|
handler. OCRmyPDF will clean up its temporary files and worker processes
|
|
automatically when an exception occurs.
|
|
|
|
When OCRmyPDF succeeds conditionally, it returns an integer exit code.
|
|
|
|
### Plugin Development Changes
|
|
|
|
```{versionchanged} 16.13
|
|
Plugin hooks now receive {class}`~ocrmypdf.OcrOptions` objects instead of
|
|
`argparse.Namespace`.
|
|
```
|
|
|
|
- {class}`~ocrmypdf.OcrOptions` provides the same attribute access as `Namespace` (duck-typing compatible)
|
|
- Plugin developers should update type hints: `from ocrmypdf import OcrOptions`
|
|
- Built-in plugins no longer modify options in-place for better immutability
|
|
|
|
Most existing plugins will continue working without modification due to the
|
|
duck-typing compatibility between {class}`~ocrmypdf.OcrOptions` and `Namespace`.
|