mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-06-11 15:36:11 -04:00
Expose Ghostscript's -dJPEGQ and image downsampling switches as advanced, plugin-scoped options for tuning PDF/A output, without polluting the central OcrOptions registry. The optimizer's existing --jpeg-quality remains the recommended JPEG quality control. - GhostscriptOptions gains jpeg_quality and jpeg_maxdpi fields and CLI args (advanced help text). jpeg_quality=0 is honored as Ghostscript's maximum compression rather than being silently coerced to the default. - _exec.ghostscript.generate_pdfa() forwards both values; when jpeg_maxdpi is set, downsample threshold is pinned at 1.0. - _get_plugin_options falls back to extra_attrs for namespaced fields so plugins can own their options without registering them centrally. - Documentation explains the rationale: Ghostscript is the legacy path (pypdfium + verapdf is preferred in v17+), the optimizer is the supported file-size lever, and lowering quality is almost always a better trade than downsampling.
114 lines
4.4 KiB
Markdown
114 lines
4.4 KiB
Markdown
% SPDX-FileCopyrightText: 2022 James R. Barlow
|
|
% SPDX-License-Identifier: CC-BY-SA-4.0
|
|
|
|
# PDF optimization
|
|
|
|
OCRmyPDF includes an image-oriented PDF optimizer. By default, the
|
|
optimizer runs with safe settings with the goal of improving compression
|
|
at no loss of quality. At higher optimization levels, lossy
|
|
optimizations may be applied and tuned. Optimization occurs after OCR,
|
|
and only if OCR succeeded. It does not perform other possible
|
|
optimizations such as deduplicating resources, consolidating fonts,
|
|
simplifying vector drawings, or anything of that nature.
|
|
|
|
:::{list-table} OCRmyPDF optimization settings
|
|
---
|
|
widths: 33 6 60
|
|
header-rows: 1
|
|
---
|
|
|
|
* - Optimization level
|
|
- Shorthand
|
|
- Description
|
|
* - ``--optimize 0``
|
|
- ``-O0``
|
|
- Disable most optimizations.
|
|
* - ``--optimize 1`` (default)
|
|
- ``-O1``
|
|
- Enables lossless optimizations, such as transcoding images to more
|
|
efficient formats. Also compress other uncompressed objects in the
|
|
PDF and enables the more efficient "object streams" within the PDF.
|
|
* - ``--optimize 2``
|
|
- ``-O2``
|
|
- All of the above, and enables lossy optimizations and color quantization.
|
|
* - ``--optimize 3``
|
|
- ``-O3``
|
|
- All of the above, and enables more aggressive optimizations and targets lower
|
|
image quality.
|
|
:::
|
|
|
|
The exact type of optimizations performed will vary over time, and
|
|
depend on what third party tools are installed.
|
|
|
|
Despite optimizations, OCRmyPDF might still increase the overall file
|
|
size, since it must embed information about the recognized text, and
|
|
depending on the settings chosen, may not be able to represent the
|
|
output file as compactly as the input file.
|
|
|
|
## Optimizations that always occurs
|
|
|
|
OCRmyPDF will automatically replace obsolete or inferior compression
|
|
schemes such as RLE or LZW with superior schemes such as Deflate, and
|
|
convert monochrome images to CCITT G4. Since this is lossless, it always
|
|
occurs and there is no way to disable it. Other non-image compressed
|
|
objects are compressed as well.
|
|
|
|
## Fast web view
|
|
|
|
OCRmyPDF automatically optimizes PDFs for \"fast web view\" in Adobe
|
|
Acrobat\'s parlance, or equivalently, linearizes PDFs so that the
|
|
resources they reference are presented in the order a viewer needs them
|
|
for sequential display. This reduces the latency of viewing a PDF both
|
|
online and from local storage, in exchange for a slight increase in file
|
|
size.
|
|
|
|
To disable this optimization and all others, use
|
|
`ocrmypdf --optimize 0 ...` or the shorthand `-O0`.
|
|
|
|
Adobe Acrobat might not report the file as being \"fast web view\".
|
|
|
|
## Lossless optimizations
|
|
|
|
At optimization level `-O1` (the default), OCRmyPDF will also attempt
|
|
lossless image optimization.
|
|
|
|
If a JBIG2 encoder is available, then monochrome images will be
|
|
converted to JBIG2, with the potential for huge savings on large black
|
|
and white images, since JBIG2 is far more efficient than any other
|
|
monochrome (bi-level) compression. (All known US patents related to
|
|
JBIG2 have probably expired, but it remains the responsibility of the
|
|
user to supply a JBIG2 encoder such as
|
|
[jbig2enc](https://github.com/agl/jbig2enc). OCRmyPDF does not implement
|
|
JBIG2 encoding on its own.)
|
|
|
|
OCRmyPDF currently does not attempt to recompress losslessly compressed
|
|
objects more aggressively.
|
|
|
|
## Lossy optimizations
|
|
|
|
At optimization level `-O1`, `-O2` and `-O3`, OCRmyPDF will some attempt
|
|
loss image optimization.
|
|
|
|
If Ghostscript is used to create a PDF/A (the default), Ghostscript will
|
|
optimize some images by converting them to JPEG, which are lossy. If
|
|
`--output-type pdf` is used, there are no lossy optimizations. Ghostscript's
|
|
JPEG conversion is quite safe.
|
|
|
|
If `pngquant` is installed, OCRmyPDF will use it to perform quantize
|
|
paletted images to reduce their size.
|
|
|
|
The quality of JPEGs may be lowered, on the assumption that a lower
|
|
quality image may be suitable for storage after OCR. Use `--jpeg-quality`
|
|
to control the optimizer's JPEG quality target. The optimizer is the
|
|
recommended way to reduce JPEG image sizes: it applies consistently
|
|
regardless of whether Ghostscript was used to produce a PDF/A.
|
|
|
|
If you specifically need to tune Ghostscript's own PDF/A image handling
|
|
(for example, to force a hard DPI cap), see
|
|
[Advanced Ghostscript tuning](advanced.md#advanced-ghostscript-tuning)
|
|
for the separate `--ghostscript-jpeg-quality` and
|
|
`--ghostscript-jpeg-maxdpi` options.
|
|
|
|
It is not possible to optimize all image types. Uncommon image types may
|
|
be skipped by the optimizer.
|