mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-01-27 23:44:23 -05:00
Add two missing available parameters for watcher.py (used with docker): - OCR_LOGLEVEL - OCR_JSON_SETTINGS
253 lines
8.4 KiB
Markdown
253 lines
8.4 KiB
Markdown
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
|
% SPDX-License-Identifier: CC-BY-SA-4.0
|
|
|
|
Batch processing
|
|
================
|
|
|
|
This article provides information about running OCRmyPDF on multiple
|
|
files or configuring it as a service triggered by file system events.
|
|
|
|
Batch jobs
|
|
----------
|
|
|
|
Consider using the excellent [GNU
|
|
Parallel](https://www.gnu.org/software/parallel/) to apply OCRmyPDF to
|
|
multiple files at once.
|
|
|
|
Both `parallel` and `ocrmypdf` will try to use all available processors.
|
|
To maximize parallelism without overloading your system with processes,
|
|
consider using `parallel -j 2` to limit parallel to running two jobs at
|
|
once.
|
|
|
|
This command will run `ocrmypdf` on all files named `*.pdf` in the
|
|
current directory and write them to the previously created `output/`
|
|
folder. It will not search subdirectories.
|
|
|
|
The `--tag` argument tells parallel to print the filename as a prefix
|
|
whenever a message is printed, so that one can trace any errors to the
|
|
file that produced them.
|
|
|
|
:::{code} bash
|
|
parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf
|
|
:::
|
|
|
|
OCRmyPDF automatically repairs PDFs before parsing and gathering
|
|
information from them.
|
|
|
|
Directory trees
|
|
---------------
|
|
|
|
This will walk through a directory tree and run OCR on all files in
|
|
place, and printing each filename in between runs:
|
|
|
|
:::{code} bash
|
|
find . -name '*.pdf' -printf '%p\n' -exec ocrmypdf '{}' '{}' \;
|
|
:::
|
|
|
|
This only runs one `ocrmypdf` process at a time. This variation uses
|
|
`find` to create a directory list and `parallel` to parallelize runs of
|
|
`ocrmypdf`, again updating files in place.
|
|
|
|
:::{code} bash
|
|
find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf '{}' '{}'
|
|
:::
|
|
|
|
In a Windows batch file, use
|
|
|
|
:::{code} bat
|
|
for /r %%f in (*.pdf) do ocrmypdf %%f %%f
|
|
:::
|
|
|
|
With a Docker container, you will need to stream through standard input
|
|
and output:
|
|
|
|
:::{code} bash
|
|
find . -name '*.pdf' -print0 | xargs -0 | while read pdf; do
|
|
pdfout=$(mktemp)
|
|
docker run --rm -i jbarlow83/ocrmypdf - - <$pdf >$pdfout && cp $pdfout $pdf
|
|
done
|
|
:::
|
|
|
|
### Sample script
|
|
|
|
This user contributed script also provides an example of batch
|
|
processing.
|
|
|
|
:::{literalinclude} ../misc/batch.py
|
|
---
|
|
caption: misc/batch.py
|
|
---
|
|
:::
|
|
|
|
### Synology DiskStations
|
|
|
|
Synology DiskStations (Network Attached Storage devices) can run the
|
|
Docker image of OCRmyPDF if the Synology [Docker
|
|
package](https://www.synology.com/en-global/dsm/packages/Docker) is
|
|
installed. Attached is a script to address particular quirks of using
|
|
OCRmyPDF on one of these devices.
|
|
|
|
At the time this script was written, it only worked for x86-based
|
|
Synology products. It is not known if it will work on ARM-based Synology
|
|
products. Further adjustments might be needed to deal with the
|
|
Synology\'s relatively limited CPU and RAM.
|
|
|
|
:::{literalinclude} ../misc/synology.py
|
|
---
|
|
caption: misc/synology.py - Sample script for Synology DiskStations
|
|
---
|
|
:::
|
|
|
|
### Huge batch jobs
|
|
|
|
If you have thousands of files to work with, contact the author.
|
|
Consulting work related to OCRmyPDF helps fund this open source project
|
|
and all inquiries are appreciated.
|
|
|
|
Hot (watched) folders
|
|
---------------------
|
|
|
|
### Watched folders with watcher.py
|
|
|
|
OCRmyPDF has a folder watcher called watcher.py, which is currently
|
|
included in source distributions but not part of the main program. It
|
|
may be used natively or may run in a Docker container. Native instances
|
|
tend to give better performance. watcher.py works on all platforms.
|
|
|
|
Users may need to customize the script to meet their requirements.
|
|
|
|
:::{code} bash
|
|
pip3 install ocrmypdf[watcher]
|
|
|
|
env OCR_INPUT_DIRECTORY=/mnt/input-pdfs \
|
|
OCR_OUTPUT_DIRECTORY=/mnt/output-pdfs \
|
|
OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
|
|
python3 watcher.py
|
|
:::
|
|
|
|
:::{list-table} watcher.py environment variables
|
|
---
|
|
header-rows: 1
|
|
---
|
|
|
|
* - Environment variable
|
|
- Description
|
|
* - OCR\_INPUT\_DIRECTORY
|
|
- Set input directory to monitor (recursive)
|
|
* - OCR\_OUTPUT\_DIRECTORY
|
|
- Set output directory (should not be under input)
|
|
* - OCR\_ARCHIVE\_DIRECTORY
|
|
- Set archive directory for processed originals (should not be under input, requires `OCR_ON_SUCCESS_ARCHIVE` to be set)
|
|
* - OCR\_ON\_SUCCESS\_DELETE
|
|
- This will move the processed original file to `OCR_ARCHIVE_DIRECTORY` if the exit code is 0 (OK). Note that `OCR_ON_SUCCESS_DELETE` takes precedence over this option, i.e. if both options are set, the input file will be deleted.
|
|
* - OCR\_OUTPUT\_DIRECTORY\_YEAR\_MONTH
|
|
- This will place files in the output in `{output}/{year}/{month}/{filename}`
|
|
* - OCR\_DESKEW
|
|
- Apply deskew to crooked input PDFs
|
|
* - OCR\_JSON\_SETTINGS
|
|
- A JSON string specifying any other arguments for `ocrmypdf.ocr`, e.g. `'OCR_JSON_SETTINGS={"rotate_pages": true, "optimize": "3"}'`.
|
|
* - OCR\_POLL\_NEW\_FILE\_SECONDS
|
|
- Polling interval
|
|
* - OCR\_LOGLEVEL
|
|
- Level of log messages t
|
|
:::
|
|
|
|
One could configure a networked scanner or scanning computer to drop
|
|
files in the watched folder.
|
|
|
|
### Watched folders with Docker
|
|
|
|
The watcher service is included in the OCRmyPDF Docker image. To run it:
|
|
|
|
:::{code} bash
|
|
docker run \
|
|
--volume <path to files to convert>:/input \
|
|
--volume <path to store results>:/output \
|
|
--volume <path to store processed originals>:/processed \
|
|
--env OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
|
|
--env OCR_ON_SUCCESS_ARCHIVE=1 \
|
|
--env OCR_DESKEW=1 \
|
|
--env PYTHONUNBUFFERED=1 \
|
|
--interactive --tty --entrypoint python3 \
|
|
jbarlow83/ocrmypdf \
|
|
watcher.py
|
|
:::
|
|
|
|
This service will watch for a file that matches `/input/\*.pdf`, convert
|
|
it to a OCRed PDF in `/output/`, and move the processed original to
|
|
`/processed`. The parameters to this image are:
|
|
|
|
:::{list-table} Watcher Docker Parameters
|
|
:header-rows: 1
|
|
|
|
* - Parameter
|
|
- Description
|
|
* - `--volume <path to files to convert>:/input`
|
|
- Files placed in this location will be OCRed
|
|
* - `--volume <path to store results>:/output`
|
|
- This is where OCRed files will be stored
|
|
* - `--volume <path to store processed originals>:/processed`
|
|
- Archive processed originals here
|
|
* - `--env OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1`
|
|
- Define environment variable `OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1` to place files in the output in `{output}/{year}/{month}/{filename}`
|
|
* - `--env OCR_ON_SUCCESS_ARCHIVE=1`
|
|
- Define environment variable `OCR_ON_SUCCESS_ARCHIVE` to move processed originals
|
|
* - `--env OCR_DESKEW=1`
|
|
- Define environment variable `OCR_DESKEW` to apply deskew to crooked input PDFs
|
|
* - `--env PYTHONBUFFERED=1`
|
|
- This will force `STDOUT` to be unbuffered and allow you to see messages in docker logs
|
|
* - `--env OCR_LOGLEVEL='DEBUG'`
|
|
- Level of log messages
|
|
* - `--env OCR_JSON_SETTINGS={"language":"deu+eng", "rotate_pages": true}`
|
|
- A JSON string specifying any other arguments for `ocrmypdf.ocr`
|
|
:::
|
|
|
|
This service relies on polling to check for changes to the filesystem.
|
|
It may not be suitable for some environments, such as filesystems shared
|
|
on a slow network.
|
|
|
|
A configuration manager such as Docker Compose could be used to ensure
|
|
that the service is always available.
|
|
|
|
:::{literalinclude} ../misc/docker-compose.example.yml
|
|
---
|
|
caption: misc/docker-compose.example.yml
|
|
---
|
|
:::
|
|
|
|
### Caveats
|
|
|
|
- `watchmedo` may not work properly on a networked file system,
|
|
depending on the capabilities of the file system client and server.
|
|
- This simple recipe does not filter for the type of file system
|
|
event, so file copies, deletes and moves, and directory operations,
|
|
will all be sent to ocrmypdf, producing errors in several cases.
|
|
Disable your watched folder if you are doing anything other than
|
|
copying files to it.
|
|
- If the source and destination directory are the same, watchmedo may
|
|
create an infinite loop.
|
|
- On BSD, FreeBSD and older versions of macOS, you may need to
|
|
increase the number of file descriptors to monitor more files, using
|
|
`ulimit -n 1024` to watch a folder of up to 1024 files.
|
|
|
|
### Alternatives
|
|
|
|
- On Linux, [systemd user
|
|
services](https://wiki.archlinux.org/index.php/Systemd/User) can be
|
|
configured to automatically perform OCR on a collection of files.
|
|
- [Watchman](https://facebook.github.io/watchman/) is a more powerful
|
|
alternative to `watchmedo`.
|
|
|
|
macOS Automator
|
|
---------------
|
|
|
|
You can use the Automator app with macOS, to create a Workflow or Quick
|
|
Action. Use a *Run Shell Script* action in your workflow. In the context
|
|
of Automator, the `PATH` may be set differently your Terminal\'s `PATH`;
|
|
you may need to explicitly set the PATH to include `ocrmypdf`. The
|
|
following example may serve as a starting point:
|
|
|
|

|
|
|
|
You may customize the command sent to ocrmypdf.
|