Convert remaining rst -> md

2026-02-13 15:52:50 -05:00 · 2025-04-17 15:03:21 -07:00
parent 3b9367fc69
commit d1a45e4abc
18 changed files with 4538 additions and 4783 deletions
--- a/docs/advanced.md
+++ b/docs/advanced.md
@@ -0,0 +1,461 @@
+% SPDX-FileCopyrightText: 2022 James R. Barlow
+% SPDX-License-Identifier: CC-BY-SA-4.0
+
+# Advanced features
+
+## Control of unpaper
+
+OCRmyPDF uses `unpaper` to provide the implementation of the
+`--clean` and `--clean-final` arguments.
+[unpaper](https://github.com/Flameeyes/unpaper/blob/main/doc/basic-concepts.md)
+provides a variety of image processing filters to improve images.
+
+By default, OCRmyPDF uses only `unpaper` arguments that were found to
+be safe to use on almost all files without having to inspect every page
+of the file afterwards. This is particularly true when only `--clean`
+is used, since that instructs OCRmyPDF to only clean the image before
+OCR and not the final image.
+
+However, if you wish to use the more aggressive options in `unpaper`,
+you may use `--unpaper-args '...'` to override the OCRmyPDF's defaults
+and forward other arguments to unpaper. This option will forward
+arguments to `unpaper` without any knowledge of what that program
+considers to be valid arguments. The string of arguments must be quoted
+as shown in the examples below. No filename arguments may be included.
+OCRmyPDF will assume it can append input and output filename of
+intermediate images to the `--unpaper-args` string.
+
+In this example, we tell `unpaper` to expect two pages of text on a
+sheet (image), such as occurs when two facing pages of a book are
+scanned. `unpaper` uses this information to deskew each independently
+and clean up the margins of both.
+
+```bash
+ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf output.pdf
+ocrmypdf --clean --clean-final --unpaper-args '--layout double --no-noisefilter' input.pdf output.pdf
+```
+
+:::{warning}
+Some `unpaper` features will reposition text within the image.
+`--clean-final` is recommended to avoid this issue.
+:::
+
+:::{warning}
+Some `unpaper` features cause multiple input or output files to be
+consumed or produced. OCRmyPDF requires `unpaper` to consume one
+file and produce one file; errors will result if this assumption is not
+met.
+:::
+
+:::{note}
+`unpaper` uses uncompressed PBM/PGM/PPM files for its intermediate
+files. For large images or documents, it can take a lot of temporary
+disk space.
+:::
+
+## Control of OCR options
+
+OCRmyPDF provides many features to control the behavior of the OCR
+engine, Tesseract.
+
+### When OCR is skipped
+
+If a page in a PDF seems to have text, by default OCRmyPDF will exit
+without modifying the PDF. This is to ensure that PDFs that were
+previously OCRed or were "born digital" rather than scanned are not
+processed.
+
+If `--skip-text` is issued, then no image processing or OCR will be
+performed on pages that already have text. The page will be copied to
+the output. This may be useful for documents that contain both "born
+digital" and scanned content, or to use OCRmyPDF to normalize and
+convert to PDF/A regardless of their contents.
+
+If `--redo-ocr` is issued, then a detailed text analysis is performed.
+Text is categorized as either visible or invisible. Invisible text (OCR)
+is stripped out. Then an image of each page is created with visible text
+masked out. The page image is sent for OCR, and any additional text is
+inserted as OCR. If a file contains a mix of text and bitmap images that
+contain text, OCRmyPDF will locate the additional text in images without
+disrupting the existing text. Some PDF OCR solutions render text as
+technically printable or visible in some way, perhaps by drawing it and
+then painting over it. OCRmyPDF cannot distinguish this type of OCR
+text from real text, so it will not be "redone".
+
+If `--force-ocr` is issued, then all pages will be rasterized to
+images, discarding any hidden OCR text, rasterizing any printable
+text, and flattening form fields or interactive objects into their visual
+representation. This is useful for redoing OCR, for fixing OCR text
+with a damaged character map (text is selectable but not searchable),
+and destroying redacted information.
+
+### Time and image size limits
+
+By default, OCRmyPDF permits tesseract to run for three minutes (180
+seconds) per page. This is usually more than enough time to find all
+text on a reasonably sized page with modern hardware.
+
+If a page is skipped, it will be inserted without OCR. If preprocessing
+was requested, the preprocessed image layer will be inserted.
+
+If you want to adjust the amount of time spent on OCR, change
+`--tesseract-timeout`. You can also automatically skip images that
+exceed a certain number of megapixels with `--skip-big`. (A 300 DPI,
+8.5×11" page image is 8.4 megapixels.)
+
+```bash
+# Allow 300 seconds for OCR; skip any page larger than 50 megapixels
+ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
+```
+
+### OCR for huge images
+
+Tesseract has internal limits on the size
+of images it will process. By default,
+`--tesseract-downsample-large-images` is enabled, and OCRmyPDF will
+downsample images to fit Tesseract limits. (The limits are usually encountered
+only for scanned images of oversized media, such as large maps or blueprints exceeding
+110 cm or 43 inches in either dimension, and at high DPI.) This feature can disabled
+using `--no-tesseract-downsample-large-images`.
+
+`--tesseract-downsample-above Npixels` adjusts the threshold at which images
+will be downsampled. By default, only images that exceed any of Tesseract's
+internal limits are downsampled (32767 pixels on either dimension).
+
+You will also need to set `--tesseract-timeout` high enough to allow
+for processing.
+
+Only the image sent for OCR is downsampled. The original image is
+preserved.
+
+```bash
+# Allow 600 seconds for OCR on huge images
+ocrmypdf --tesseract-timeout 600 \
+    --tesseract-downsample-large-images \
+    bigfile.pdf output.pdf
+
+# Downsample images above 5000 pixels on the longest dimension to
+# 5000 pixels
+ocrmypdf --tesseract-timeout 120 \
+    --tesseract-downsample-large-images \
+    --tesseract-downsample-above 5000 \
+    bigfile.pdf output_downsampled_ocr.pdf
+```
+
+### Overriding default tesseract
+
+OCRmyPDF checks the system `PATH` for the `tesseract` binary.
+
+Some relevant environment variables that influence Tesseract's behavior
+include:
+
+```{eval-rst}
+.. envvar:: TESSDATA_PREFIX
+
+   Overrides the path to Tesseract's data files. This can allow
+   simultaneous installation of the "best" and "fast" training data
+   sets. OCRmyPDF does not manage this environment variable.
+```
+
+```{eval-rst}
+.. envvar:: OMP_THREAD_LIMIT
+
+   Controls the number of threads Tesseract will use. OCRmyPDF will
+   manage this environment variable if it is not already set.
+```
+
+For example, if you have a development build of Tesseract don't wish to
+use the system installation, you can launch OCRmyPDF as follows:
+
+```bash
+env \
+    PATH=/home/user/src/tesseract/api:$PATH \
+    TESSDATA_PREFIX=/home/user/src/tesseract \
+    ocrmypdf input.pdf output.pdf
+```
+
+In this example `TESSDATA_PREFIX` is required to redirect Tesseract to
+an alternate folder for its "tessdata" files.
+
+### Overriding other support programs
+
+In addition to tesseract, OCRmyPDF uses the following external binaries:
+
+- `gs` (Ghostscript)
+- `unpaper`
+- `pngquant`
+- `jbig2`
+
+In each case OCRmyPDF will search the `PATH` environment variable to
+locate the binaries. By modifying the `PATH` environment variable, you
+can override the binaries that OCRmyPDF uses.
+
+### Changing Tesseract configuration variables
+
+You can override Tesseract's default [control
+parameters](https://tesseract-ocr.github.io/tessdoc/tess3/ControlParams.html)
+with a configuration file.
+
+As an example, this configuration will disable Tesseract's dictionary
+for current language. Normally the dictionary is helpful for
+interpolating words that are unclear, but it may interfere with OCR if
+the document does not contain many words (for example, a list of part
+numbers).
+
+Create a file named "no-dict.cfg" with these contents:
+
+```
+load_system_dawg 0
+language_model_penalty_non_dict_word 0
+language_model_penalty_non_freq_dict_word 0
+```
+
+then run ocrmypdf as follows (along with any other desired arguments):
+
+```bash
+ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf
+```
+
+:::{warning}
+Some combinations of control parameters will break Tesseract or break
+assumptions that OCRmyPDF makes about Tesseract's output.
+:::
+
+### Changing page segmentation mode
+
+The directive `--tesseract-pagesegmode Nmode` forwards the desired page segmentation
+mode to Tesseract OCR. The default is 3.
+
+Page segmentation can improve OCR results when you know that a PDF ought to be
+analyzed a particular way, such as PDFs whose pages contain only a single line of
+text. For the vast majority of users, changing the page segmentation mode will only
+make things worse.
+
+As of June 2024, the Tesseract page segmentation modes are:
+
+| ID  | Description                                                                                   |
+| --- | --------------------------------------------------------------------------------------------- |
+| 0   | Orientation and script detection (OSD) only.                                                  |
+| 1   | Automatic page segmentation with OSD.                                                         |
+| 2   | Automatic page segmentation, but no OSD, or OCR. (not implemented)                            |
+| 3   | Fully automatic page segmentation, but no OSD. (Default)                                      |
+| 4   | Assume a single column of text of variable sizes.                                             |
+| 5   | Assume a single uniform block of vertically aligned text.                                     |
+| 6   | Assume a single uniform block of text.                                                        |
+| 7   | Treat the image as a single text line.                                                        |
+| 8   | Treat the image as a single word.                                                             |
+| 9   | Treat the image as a single word in a circle.                                                 |
+| 10  | Treat the image as a single character.                                                        |
+| 11  | Sparse text. Find as much text as possible in no particular order.                            |
+| 12  | Sparse text with OSD.                                                                         |
+| 13  | Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. |
+
+Modes 0, 1, 2, and 12 (all of those that enable orientation and script detection)
+are not compatible with OCRmyPDF, which performs OSD in a separate step from OCR.
+Their use may interfere with `--rotate-pages` and other features.
+
+It is currently not possible to use advanced Tesseract OCR features, such as creating
+OCR information, when using Tesseract through OCRmyPDF.
+
+## Changing the PDF renderer
+
+rasterizing
+
+: Converting a PDF to an image for display.
+
+rendering
+
+: Creating a new PDF from other data (such as an existing PDF).
+
+OCRmyPDF has these PDF renderers: `sandwich` and `hocr`. The
+renderer may be selected using `--pdf-renderer`. The default is
+`auto` which lets OCRmyPDF select the renderer to use. Currently,
+`auto` always selects `hocr`.
+
+### The `hocr` renderer
+
+:::{versionchanged} 16.0.0
+:::
+
+In both renderers, a text-only layer is rendered and sandwiched (overlaid)
+on to either the original PDF page, or newly rasterized version of the
+original PDF page (when `--force-ocr` is used). In this way, loss
+of PDF information is generally avoided. (You may need to disable PDF/A
+conversion and optimization to eliminate all lossy transformations.)
+
+The current approach used by the new hOCR renderer is a re-implementation
+of Tesseract's PDF renderer, using the same Glyphless font and general
+ideas, but fixing many technical issues that impeded it. The new hocr
+provides better text placement accuracy, avoids issues with word
+segmentation, and provides better positioning of skewed text.
+
+Using the experimental API, it is also possible to edit the OCR output
+from Tesseract, using any tool that is capable of editing hOCR files.
+
+Older versions of this renderer did not support non-Latin languages, but
+it is now universal.
+
+### The `sandwich` renderer
+
+The `sandwich` renderer uses Tesseract's text-only PDF feature,
+which produces a PDF page that lays out the OCR in invisible text.
+
+Currently some problematic PDF viewers like Mozilla PDF.js and macOS
+Preview have problems with segmenting its text output, and
+mightrunseveralwordstogether. It also does not implement right to left
+fonts (Arabic, Hebrew, Persian). The output of this renderer cannot
+be edited. The sandwich renderer is retained for testing.
+
+When image preprocessing features like `--deskew` are used, the
+original PDF will be rendered as a full page and the OCR layer will be
+placed on top.
+
+## Rendering and rasterizing options
+
+:::{versionadded} 14.3.0
+:::
+
+The `--continue-on-soft-render-error` option allows OCRmyPDF to
+proceed if a page cannot be rasterized/rendered. This is useful if you are
+trying to get the best possible OCR from a PDF that is not well-formed,
+and you are willing to accept some pages that may not visually match the
+input, and that may not OCR well.
+
+## Color conversion strategy
+
+:::{versionadded} 15.0.0
+:::
+
+OCRmyPDF uses Ghostscript to convert PDF to PDF/A. In some cases, this
+conversion requires color conversion. The default strategy is to convert
+using the `LeaveColorUnchanged` strategy, which preserves the original
+color space wherever possible (some rare color spaces might still be
+converted).
+
+Usually document scanners produce PDFs in the sRGB color space, and do
+not need to be converted, so the default strategy is appropriate.
+
+Suppose that you have a document that was prepared for professional
+printing in a Separation or CMYK color space, and text was converted to
+curves. In this case, you may want to use a different color conversion
+strategy. The `--color-conversion-strategy` option allows you to select a
+different strategy, such as `RGB`.
+
+## Return code policy
+
+OCRmyPDF writes all messages to `stderr`. `stdout` is reserved for
+piping output files. `stdin` is reserved for piping input files.
+
+The return codes generated by the OCRmyPDF are considered part of the
+stable user interface. They may be imported from
+`ocrmypdf.exceptions`.
+
+```{eval-rst}
+.. list-table:: Return codes
+    :widths: 5 35 60
+    :header-rows: 1
+
+    *   - Code
+        - Name
+        - Interpretation
+    *   - 0
+        - ``ExitCode.ok``
+        - Everything worked as expected.
+    *   - 1
+        - ``ExitCode.bad_args``
+        - Invalid arguments, exited with an error.
+    *   - 2
+        - ``ExitCode.input_file``
+        - The input file does not seem to be a valid PDF.
+    *   - 3
+        - ``ExitCode.missing_dependency``
+        - An external program required by OCRmyPDF is missing.
+    *   - 4
+        - ``ExitCode.invalid_output_pdf``
+        - An output file was created, but it does not seem to be a valid PDF. The file will be available.
+    *   - 5
+        - ``ExitCode.file_access_error``
+        - The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file.
+    *   - 6
+        - ``ExitCode.already_done_ocr``
+        - The file already appears to contain text so it may not need OCR. See output message.
+    *   - 7
+        - ``ExitCode.child_process_error``
+        - An error occurred in an external program (child process) and OCRmyPDF cannot continue.
+    *   - 8
+        - ``ExitCode.encrypted_pdf``
+        - The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption.
+    *   - 9
+        - ``ExitCode.invalid_config``
+        - A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file.
+    *   - 10
+        - ``ExitCode.pdfa_conversion_failed``
+        - A valid PDF was created, PDF/A conversion failed. The file will be available.
+    *   - 15
+        - ``ExitCode.other_error``
+        - Some other error occurred.
+    *   - 130
+        - ``ExitCode.ctrl_c``
+        - The program was interrupted by pressing Ctrl+C.
+
+```
+
+(tmpdir)=
+
+## Changing temporary storage location
+
+OCRmyPDF generates many temporary files during processing.
+
+To change where temporary files are stored, change the `TMPDIR`
+environment variable for ocrmypdf's environment. (Python's
+`tempfile.gettempdir()` returns the root directory in which temporary
+files will be stored.) For example, one could redirect `TMPDIR` to a
+large RAM disk to avoid wear on HDD/SSD and potentially improve
+performance.
+
+On Windows, the `TEMP` environment variable is used instead.
+
+## Debugging the intermediate files
+
+OCRmyPDF normally saves its intermediate results to a temporary folder
+and deletes this folder when it exits, whether it succeeded or failed.
+
+If the `--keep-temporary-files` (`-k`) argument is issued on the
+command line, OCRmyPDF will keep the temporary folder and print the location,
+whether it succeeded or failed. An example message is:
+
+```none
+Temporary working files retained at:
+/tmp/ocrmypdf.io.u20wpz07
+```
+
+When OCRmyPDF is launched as a snap, this corresponds to the snap filesystem, for instance:
+
+> /tmp/snap-private-tmp/snap.ocrmypdf/tmp/ocrmypdf.io.u20wpz07
+
+The organization of this folder is an implementation detail and subject
+to change between releases. However the general organization is that
+working files on a per page basis have the page number as a prefix
+(starting with page 1), an infix indicates the processing stage, and a
+suffix indicates the file type. Some important files include:
+
+- `_rasterize.png` - what the input page looks like
+- `_ocr.png` - the file that is sent to Tesseract for OCR; depending
+  on arguments this may differ from the presentation image
+- `_pp_deskew.png` - the image, after deskewing
+- `_pp_clean.png` - the image, after cleaning with unpaper
+- `_ocr_hocr.pdf` - the OCR file; appears as a blank page with invisible
+  text embedded
+- `_ocr_hocr.txt` - the OCR text (not necessarily all text on the page,
+  if the page is mixed format)
+- `fix_docinfo.pdf` - a temporary file created to fix the PDF DocumentInfo
+  data structure
+- `graft_layers.pdf` - the rendered PDF with OCR layers grafted on
+- `pdfa.pdf` - `graft_layers.pdf` after conversion to PDF/A
+- `pdfa.ps` - a PostScript file used by Ghostscript for PDF/A conversion
+- `optimize.pdf` - the PDF generated before optimization
+- `optimize.out.pdf` - the PDF generated by optimization
+- `origin` - the input file
+- `origin.pdf` - the input file or the input image converted to PDF
+- `images/*` - images extracted during the optimization process; here
+  the prefix indicates a PDF object ID not a page number
--- a/docs/advanced.rst
+++ b/docs/advanced.rst
@@ -1,486 +0,0 @@
-.. SPDX-FileCopyrightText: 2022 James R. Barlow
-.. SPDX-License-Identifier: CC-BY-SA-4.0
-
-=================
-Advanced features
-=================
-
-Control of unpaper
-==================
-
-OCRmyPDF uses ``unpaper`` to provide the implementation of the
-``--clean`` and ``--clean-final`` arguments.
-`unpaper <https://github.com/Flameeyes/unpaper/blob/main/doc/basic-concepts.md>`__
-provides a variety of image processing filters to improve images.
-
-By default, OCRmyPDF uses only ``unpaper`` arguments that were found to
-be safe to use on almost all files without having to inspect every page
-of the file afterwards. This is particularly true when only ``--clean``
-is used, since that instructs OCRmyPDF to only clean the image before
-OCR and not the final image.
-
-However, if you wish to use the more aggressive options in ``unpaper``,
-you may use ``--unpaper-args '...'`` to override the OCRmyPDF's defaults
-and forward other arguments to unpaper. This option will forward
-arguments to ``unpaper`` without any knowledge of what that program
-considers to be valid arguments. The string of arguments must be quoted
-as shown in the examples below. No filename arguments may be included.
-OCRmyPDF will assume it can append input and output filename of
-intermediate images to the ``--unpaper-args`` string.
-
-In this example, we tell ``unpaper`` to expect two pages of text on a
-sheet (image), such as occurs when two facing pages of a book are
-scanned. ``unpaper`` uses this information to deskew each independently
-and clean up the margins of both.
-
-.. code-block:: bash
-
-    ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf output.pdf
-    ocrmypdf --clean --clean-final --unpaper-args '--layout double --no-noisefilter' input.pdf output.pdf
-
-.. warning::
-
-   Some ``unpaper`` features will reposition text within the image.
-   ``--clean-final`` is recommended to avoid this issue.
-
-.. warning::
-
-   Some ``unpaper`` features cause multiple input or output files to be
-   consumed or produced. OCRmyPDF requires ``unpaper`` to consume one
-   file and produce one file; errors will result if this assumption is not
-   met.
-
-.. note::
-
-   ``unpaper`` uses uncompressed PBM/PGM/PPM files for its intermediate
-   files. For large images or documents, it can take a lot of temporary
-   disk space.
-
-Control of OCR options
-======================
-
-OCRmyPDF provides many features to control the behavior of the OCR
-engine, Tesseract.
-
-When OCR is skipped
-------------------
-
-If a page in a PDF seems to have text, by default OCRmyPDF will exit
-without modifying the PDF. This is to ensure that PDFs that were
-previously OCRed or were "born digital" rather than scanned are not
-processed.
-
-If ``--skip-text`` is issued, then no image processing or OCR will be
-performed on pages that already have text. The page will be copied to
-the output. This may be useful for documents that contain both "born
-digital" and scanned content, or to use OCRmyPDF to normalize and
-convert to PDF/A regardless of their contents.
-
-If ``--redo-ocr`` is issued, then a detailed text analysis is performed.
-Text is categorized as either visible or invisible. Invisible text (OCR)
-is stripped out. Then an image of each page is created with visible text
-masked out. The page image is sent for OCR, and any additional text is
-inserted as OCR. If a file contains a mix of text and bitmap images that
-contain text, OCRmyPDF will locate the additional text in images without
-disrupting the existing text. Some PDF OCR solutions render text as
-technically printable or visible in some way, perhaps by drawing it and
-then painting over it. OCRmyPDF cannot distinguish this type of OCR
-text from real text, so it will not be "redone".
-
-If ``--force-ocr`` is issued, then all pages will be rasterized to
-images, discarding any hidden OCR text, rasterizing any printable
-text, and flattening form fields or interactive objects into their visual
-representation. This is useful for redoing OCR, for fixing OCR text
-with a damaged character map (text is selectable but not searchable),
-and destroying redacted information.
-
-Time and image size limits
--------------------------
-
-By default, OCRmyPDF permits tesseract to run for three minutes (180
-seconds) per page. This is usually more than enough time to find all
-text on a reasonably sized page with modern hardware.
-
-If a page is skipped, it will be inserted without OCR. If preprocessing
-was requested, the preprocessed image layer will be inserted.
-
-If you want to adjust the amount of time spent on OCR, change
-``--tesseract-timeout``. You can also automatically skip images that
-exceed a certain number of megapixels with ``--skip-big``. (A 300 DPI,
-8.5×11" page image is 8.4 megapixels.)
-
-.. code-block:: bash
-
-    # Allow 300 seconds for OCR; skip any page larger than 50 megapixels
-    ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf
-
-OCR for huge images
-------------------
-
-Tesseract has internal limits on the size
-of images it will process. By default,
-``--tesseract-downsample-large-images`` is enabled, and OCRmyPDF will
-downsample images to fit Tesseract limits. (The limits are usually encountered
-only for scanned images of oversized media, such as large maps or blueprints exceeding
-110 cm or 43 inches in either dimension, and at high DPI.) This feature can disabled
-using ``--no-tesseract-downsample-large-images``.
-
-``--tesseract-downsample-above Npixels`` adjusts the threshold at which images
-will be downsampled. By default, only images that exceed any of Tesseract's
-internal limits are downsampled (32767 pixels on either dimension).
-
-You will also need to set ``--tesseract-timeout`` high enough to allow
-for processing.
-
-Only the image sent for OCR is downsampled. The original image is
-preserved.
-
-.. code-block:: bash
-
-    # Allow 600 seconds for OCR on huge images
-    ocrmypdf --tesseract-timeout 600 \
-        --tesseract-downsample-large-images \
-        bigfile.pdf output.pdf
-
-    # Downsample images above 5000 pixels on the longest dimension to
-    # 5000 pixels
-    ocrmypdf --tesseract-timeout 120 \
-        --tesseract-downsample-large-images \
-        --tesseract-downsample-above 5000 \
-        bigfile.pdf output_downsampled_ocr.pdf
-
-
-Overriding default tesseract
----------------------------
-
-OCRmyPDF checks the system ``PATH`` for the ``tesseract`` binary.
-
-Some relevant environment variables that influence Tesseract's behavior
-include:
-
-.. envvar:: TESSDATA_PREFIX
-
-   Overrides the path to Tesseract's data files. This can allow
-   simultaneous installation of the "best" and "fast" training data
-   sets. OCRmyPDF does not manage this environment variable.
-
-.. envvar:: OMP_THREAD_LIMIT
-
-   Controls the number of threads Tesseract will use. OCRmyPDF will
-   manage this environment variable if it is not already set.
-
-For example, if you have a development build of Tesseract don't wish to
-use the system installation, you can launch OCRmyPDF as follows:
-
-.. code-block:: bash
-
-    env \
-        PATH=/home/user/src/tesseract/api:$PATH \
-        TESSDATA_PREFIX=/home/user/src/tesseract \
-        ocrmypdf input.pdf output.pdf
-
-In this example ``TESSDATA_PREFIX`` is required to redirect Tesseract to
-an alternate folder for its "tessdata" files.
-
-Overriding other support programs
---------------------------------
-
-In addition to tesseract, OCRmyPDF uses the following external binaries:
-
-  ``gs`` (Ghostscript)
-  ``unpaper``
-  ``pngquant``
-  ``jbig2``
-
-In each case OCRmyPDF will search the ``PATH`` environment variable to
-locate the binaries. By modifying the ``PATH`` environment variable, you
-can override the binaries that OCRmyPDF uses.
-
-Changing Tesseract configuration variables
------------------------------------------
-
-You can override Tesseract's default `control
-parameters <https://tesseract-ocr.github.io/tessdoc/tess3/ControlParams.html>`__
-with a configuration file.
-
-As an example, this configuration will disable Tesseract's dictionary
-for current language. Normally the dictionary is helpful for
-interpolating words that are unclear, but it may interfere with OCR if
-the document does not contain many words (for example, a list of part
-numbers).
-
-Create a file named "no-dict.cfg" with these contents:
-
-::
-
-    load_system_dawg 0
-    language_model_penalty_non_dict_word 0
-    language_model_penalty_non_freq_dict_word 0
-
-then run ocrmypdf as follows (along with any other desired arguments):
-
-.. code-block:: bash
-
-    ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf
-
-.. warning::
-
-   Some combinations of control parameters will break Tesseract or break
-   assumptions that OCRmyPDF makes about Tesseract's output.
-
-Changing page segmentation mode
-------------------------------
-
-The directive ``--tesseract-pagesegmode Nmode`` forwards the desired page segmentation
-mode to Tesseract OCR. The default is 3.
-
-Page segmentation can improve OCR results when you know that a PDF ought to be
-analyzed a particular way, such as PDFs whose pages contain only a single line of
-text. For the vast majority of users, changing the page segmentation mode will only
-make things worse.
-
-As of June 2024, the Tesseract page segmentation modes are:
-
-+-----+----------------------------------------------------------------------------------+
-| ID  | Description                                                                      |
-+=====+==================================================================================+
-|  0  | Orientation and script detection (OSD) only.                                     |
-+-----+----------------------------------------------------------------------------------+
-|  1  | Automatic page segmentation with OSD.                                            |
-+-----+----------------------------------------------------------------------------------+
-|  2  | Automatic page segmentation, but no OSD, or OCR. (not implemented)               |
-+-----+----------------------------------------------------------------------------------+
-|  3  | Fully automatic page segmentation, but no OSD. (Default)                         |
-+-----+----------------------------------------------------------------------------------+
-|  4  | Assume a single column of text of variable sizes.                                |
-+-----+----------------------------------------------------------------------------------+
-|  5  | Assume a single uniform block of vertically aligned text.                        |
-+-----+----------------------------------------------------------------------------------+
-|  6  | Assume a single uniform block of text.                                           |
-+-----+----------------------------------------------------------------------------------+
-|  7  | Treat the image as a single text line.                                           |
-+-----+----------------------------------------------------------------------------------+
-|  8  | Treat the image as a single word.                                                |
-+-----+----------------------------------------------------------------------------------+
-|  9  | Treat the image as a single word in a circle.                                    |
-+-----+----------------------------------------------------------------------------------+
-| 10  | Treat the image as a single character.                                           |
-+-----+----------------------------------------------------------------------------------+
-| 11  | Sparse text. Find as much text as possible in no particular order.               |
-+-----+----------------------------------------------------------------------------------+
-| 12  | Sparse text with OSD.                                                            |
-+-----+----------------------------------------------------------------------------------+
-| 13  | Raw line. Treat the image as a single text line, bypassing hacks that are        |
-|     | Tesseract-specific.                                                              |
-+-----+----------------------------------------------------------------------------------+
-
-Modes 0, 1, 2, and 12 (all of those that enable orientation and script detection)
-are not compatible with OCRmyPDF, which performs OSD in a separate step from OCR.
-Their use may interfere with ``--rotate-pages`` and other features.
-
-It is currently not possible to use advanced Tesseract OCR features, such as creating
-OCR information, when using Tesseract through OCRmyPDF.
-
-Changing the PDF renderer
-=========================
-
-rasterizing
-  Converting a PDF to an image for display.
-
-rendering
-  Creating a new PDF from other data (such as an existing PDF).
-
-OCRmyPDF has these PDF renderers: ``sandwich`` and ``hocr``. The
-renderer may be selected using ``--pdf-renderer``. The default is
-``auto`` which lets OCRmyPDF select the renderer to use. Currently,
-``auto`` always selects ``hocr``.
-
-The ``hocr`` renderer
---------------------
-
-.. versionchanged:: 16.0.0
-
-In both renderers, a text-only layer is rendered and sandwiched (overlaid)
-on to either the original PDF page, or newly rasterized version of the
-original PDF page (when ``--force-ocr`` is used). In this way, loss
-of PDF information is generally avoided. (You may need to disable PDF/A
-conversion and optimization to eliminate all lossy transformations.)
-
-The current approach used by the new hOCR renderer is a re-implementation
-of Tesseract's PDF renderer, using the same Glyphless font and general
-ideas, but fixing many technical issues that impeded it. The new hocr
-provides better text placement accuracy, avoids issues with word
-segmentation, and provides better positioning of skewed text.
-
-Using the experimental API, it is also possible to edit the OCR output
-from Tesseract, using any tool that is capable of editing hOCR files.
-
-Older versions of this renderer did not support non-Latin languages, but
-it is now universal.
-
-The ``sandwich`` renderer
-------------------------
-
-The ``sandwich`` renderer uses Tesseract's text-only PDF feature,
-which produces a PDF page that lays out the OCR in invisible text.
-
-Currently some problematic PDF viewers like Mozilla PDF.js and macOS
-Preview have problems with segmenting its text output, and
-mightrunseveralwordstogether. It also does not implement right to left
-fonts (Arabic, Hebrew, Persian). The output of this renderer cannot
-be edited. The sandwich renderer is retained for testing.
-
-When image preprocessing features like ``--deskew`` are used, the
-original PDF will be rendered as a full page and the OCR layer will be
-placed on top.
-
-Rendering and rasterizing options
-=================================
-
-.. versionadded:: 14.3.0
-
-The ``--continue-on-soft-render-error`` option allows OCRmyPDF to
-proceed if a page cannot be rasterized/rendered. This is useful if you are
-trying to get the best possible OCR from a PDF that is not well-formed,
-and you are willing to accept some pages that may not visually match the
-input, and that may not OCR well.
-
-Color conversion strategy
-=========================
-
-.. versionadded:: 15.0.0
-
-OCRmyPDF uses Ghostscript to convert PDF to PDF/A. In some cases, this
-conversion requires color conversion. The default strategy is to convert
-using the ``LeaveColorUnchanged`` strategy, which preserves the original
-color space wherever possible (some rare color spaces might still be
-converted).
-
-Usually document scanners produce PDFs in the sRGB color space, and do
-not need to be converted, so the default strategy is appropriate.
-
-Suppose that you have a document that was prepared for professional
-printing in a Separation or CMYK color space, and text was converted to
-curves. In this case, you may want to use a different color conversion
-strategy. The ``--color-conversion-strategy`` option allows you to select a
-different strategy, such as ``RGB``.
-
-Return code policy
-==================
-
-OCRmyPDF writes all messages to ``stderr``. ``stdout`` is reserved for
-piping output files. ``stdin`` is reserved for piping input files.
-
-The return codes generated by the OCRmyPDF are considered part of the
-stable user interface. They may be imported from
-``ocrmypdf.exceptions``.
-
-.. list-table:: Return codes
-    :widths: 5 35 60
-    :header-rows: 1
-
-    *	- Code
-        - Name
-        - Interpretation
-    *	- 0
-        - ``ExitCode.ok``
-        - Everything worked as expected.
-    *	- 1
-        - ``ExitCode.bad_args``
-        - Invalid arguments, exited with an error.
-    *	- 2
-        - ``ExitCode.input_file``
-        - The input file does not seem to be a valid PDF.
-    *	- 3
-        - ``ExitCode.missing_dependency``
-        - An external program required by OCRmyPDF is missing.
-    *	- 4
-        - ``ExitCode.invalid_output_pdf``
-        - An output file was created, but it does not seem to be a valid PDF. The file will be available.
-    *	- 5
-        - ``ExitCode.file_access_error``
-        - The user running OCRmyPDF does not have sufficient permissions to read the input file and write the output file.
-    *	- 6
-        - ``ExitCode.already_done_ocr``
-        - The file already appears to contain text so it may not need OCR. See output message.
-    *	- 7
-        - ``ExitCode.child_process_error``
-        - An error occurred in an external program (child process) and OCRmyPDF cannot continue.
-    *	- 8
-        - ``ExitCode.encrypted_pdf``
-        - The input PDF is encrypted. OCRmyPDF does not read encrypted PDFs. Use another program such as ``qpdf`` to remove encryption.
-    *	- 9
-        - ``ExitCode.invalid_config``
-        - A custom configuration file was forwarded to Tesseract using ``--tesseract-config``, and Tesseract rejected this file.
-    *   - 10
-        - ``ExitCode.pdfa_conversion_failed``
-        - A valid PDF was created, PDF/A conversion failed. The file will be available.
-    *	- 15
-        - ``ExitCode.other_error``
-        - Some other error occurred.
-    *	- 130
-        - ``ExitCode.ctrl_c``
-        - The program was interrupted by pressing Ctrl+C.
-
-
-.. _tmpdir:
-
-Changing temporary storage location
-===================================
-
-OCRmyPDF generates many temporary files during processing.
-
-To change where temporary files are stored, change the ``TMPDIR``
-environment variable for ocrmypdf's environment. (Python's
-``tempfile.gettempdir()`` returns the root directory in which temporary
-files will be stored.) For example, one could redirect ``TMPDIR`` to a
-large RAM disk to avoid wear on HDD/SSD and potentially improve
-performance.
-
-On Windows, the ``TEMP`` environment variable is used instead.
-
-Debugging the intermediate files
-================================
-
-OCRmyPDF normally saves its intermediate results to a temporary folder
-and deletes this folder when it exits, whether it succeeded or failed.
-
-If the ``--keep-temporary-files`` (``-k``) argument is issued on the
-command line, OCRmyPDF will keep the temporary folder and print the location,
-whether it succeeded or failed. An example message is:
-
-.. code-block:: none
-
-    Temporary working files retained at:
-    /tmp/ocrmypdf.io.u20wpz07
-
-When OCRmyPDF is launched as a snap, this corresponds to the snap filesystem, for instance:
-
-    /tmp/snap-private-tmp/snap.ocrmypdf/tmp/ocrmypdf.io.u20wpz07
-
-The organization of this folder is an implementation detail and subject
-to change between releases. However the general organization is that
-working files on a per page basis have the page number as a prefix
-(starting with page 1), an infix indicates the processing stage, and a
-suffix indicates the file type. Some important files include:
-
-  ``_rasterize.png`` - what the input page looks like
-  ``_ocr.png`` - the file that is sent to Tesseract for OCR; depending
-   on arguments this may differ from the presentation image
-  ``_pp_deskew.png`` - the image, after deskewing
-  ``_pp_clean.png`` - the image, after cleaning with unpaper
-  ``_ocr_hocr.pdf`` - the OCR file; appears as a blank page with invisible
-   text embedded
-  ``_ocr_hocr.txt`` - the OCR text (not necessarily all text on the page,
-   if the page is mixed format)
-  ``fix_docinfo.pdf`` - a temporary file created to fix the PDF DocumentInfo
-   data structure
-  ``graft_layers.pdf`` - the rendered PDF with OCR layers grafted on
-  ``pdfa.pdf`` - ``graft_layers.pdf`` after conversion to PDF/A
-  ``pdfa.ps`` - a PostScript file used by Ghostscript for PDF/A conversion
-  ``optimize.pdf`` - the PDF generated before optimization
-  ``optimize.out.pdf`` - the PDF generated by optimization
-  ``origin`` - the input file
-  ``origin.pdf`` - the input file or the input image converted to PDF
-  ``images/*`` - images extracted during the optimization process; here
-   the prefix indicates a PDF object ID not a page number
--- a/docs/api.rst
+++ b/docs/api.rst
@@ -1,10 +1,7 @@
-.. SPDX-FileCopyrightText: 2022 James R. Barlow
-..
-.. SPDX-License-Identifier: CC-BY-SA-4.0
+% SPDX-FileCopyrightText: 2022 James R. Barlow
+% SPDX-License-Identifier: CC-BY-SA-4.0

-======================
-Using the OCRmyPDF API
-======================
+# Using the OCRmyPDF API

 OCRmyPDF originated as a command line program and continues to have this
 legacy, but parts of it can be imported and used in other Python
@@ -13,100 +10,95 @@ applications.
 Some applications may want to consider running ocrmypdf from a
 subprocess call anyway, as this provides isolation of its activities.

-Example
-=======
+## Example

 OCRmyPDF provides one high-level function to run its main engine from an
 application. The parameters are symmetric to the command line arguments
 and largely have the same functions.

-.. code-block:: python
+```python
+import ocrmypdf

-    import ocrmypdf
-
-    if __name__ == '__main__':  # To ensure correct behavior on Windows and macOS
-        ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
+if __name__ == '__main__':  # To ensure correct behavior on Windows and macOS
+    ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
+```

 With some exceptions, all of the command line arguments are available
 and may be passed as equivalent keywords.

-A few differences are that ``verbose`` and ``quiet`` are not available.
+A few differences are that `verbose` and `quiet` are not available.
 Instead, output should be managed by configuring logging.

-Parent process requirements
---------------------------
+### Parent process requirements

-The :func:`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
+The {func}`ocrmypdf.ocr` function runs OCRmyPDF similar to command line
 execution. To do this, it will:

 - create worker processes or threads
 - manage the signal flags of its worker processes
 - execute other subprocesses (forking and executing other programs)

-The Python process that calls :func:`ocrmypdf.ocr()` must be sufficiently
+The Python process that calls {func}`ocrmypdf.ocr()` must be sufficiently
 privileged to perform these actions.

 There currently is no option to manage how jobs are scheduled other
-than the argument ``jobs=`` which will limit the number of worker
+than the argument `jobs=` which will limit the number of worker
 processes.

-Creating a child process to call :func:`ocrmypdf.ocr()` is suggested. That
+Creating a child process to call {func}`ocrmypdf.ocr()` is suggested. That
 way your application will survive and remain interactive even if
 OCRmyPDF fails for any reason. For example:

-.. code-block:: python
+```python
+from multiprocessing import Process

-    from multiprocessing import Process
+def ocrmypdf_process():
+    ocrmypdf.ocr('input.pdf', 'output.pdf')

-    def ocrmypdf_process():
-        ocrmypdf.ocr('input.pdf', 'output.pdf')
+def call_ocrmypdf_from_my_app():
+    p = Process(target=ocrmypdf_process)
+    p.start()
+    p.join()
+```

-    def call_ocrmypdf_from_my_app():
-        p = Process(target=ocrmypdf_process)
-        p.start()
-        p.join()
-
-Programs that call :func:`ocrmypdf.ocr()` should also install a SIGBUS signal
+Programs that call {func}`ocrmypdf.ocr()` should also install a SIGBUS signal
 handler (except on Windows), to raise an exception if access to a memory
 mapped file fails. OCRmyPDF may use memory mapping.

-:func:`ocrmypdf.ocr()` will take a threading lock to prevent multiple runs of itself
+{func}`ocrmypdf.ocr()` will take a threading lock to prevent multiple runs of itself
 in the same Python interpreter process. This is not thread-safe, because of how
 OCRmyPDF's plugins and Python's library import system work. If you need to parallelize
 OCRmyPDF, use processes.

-.. warning::
+:::{warning}
+On Windows and macOS, the script that calls {func}`ocrmypdf.ocr()` must be
+protected by an "ifmain" guard (`if __name__ == '__main__'`). If you do
+not take at least one of these steps, process semantics will prevent
+OCRmyPDF from working correctly.
+:::

-    On Windows and macOS, the script that calls :func:`ocrmypdf.ocr()` must be
-    protected by an "ifmain" guard (``if __name__ == '__main__'``). If you do
-    not take at least one of these steps, process semantics will prevent
-    OCRmyPDF from working correctly.
+### Logging

-Logging
-------
-
-OCRmyPDF will log under loggers named ``ocrmypdf``. In addition, it
-imports ``pdfminer`` and ``PIL``, both of which post log messages under
+OCRmyPDF will log under loggers named `ocrmypdf`. In addition, it
+imports `pdfminer` and `PIL`, both of which post log messages under
 those logging namespaces.

 You can configure the logging as desired for your application or call
-:func:`ocrmypdf.configure_logging` to configure logging the same way
-OCRmyPDF itself does. The command line parameters such as ``--quiet``
-and ``--verbose`` have no equivalents in the API; you must use the
+{func}`ocrmypdf.configure_logging` to configure logging the same way
+OCRmyPDF itself does. The command line parameters such as `--quiet`
+and `--verbose` have no equivalents in the API; you must use the
 provided configuration function or do configuration in a way that suits
 your use case.

-Progress monitoring
-------------------
+### Progress monitoring

-OCRmyPDF uses the ``rich`` package to implement its progress bars.
-:func:`ocrmypdf.configure_logging` will set up logging output to
-``sys.stderr`` in a way that is compatible with the display of the
-progress bar. Use ``ocrmypdf.ocr(...progress_bar=False)`` to disable
+OCRmyPDF uses the `rich` package to implement its progress bars.
+{func}`ocrmypdf.configure_logging` will set up logging output to
+`sys.stderr` in a way that is compatible with the display of the
+progress bar. Use `ocrmypdf.ocr(...progress_bar=False)` to disable
 the progress bar.

-Standard output
---------------
+### Standard output

 OCRmyPDF is strict about not writing to standard output so that
 users can safely use it in a pipeline and produce a valid output
@@ -116,12 +108,11 @@ behavior and support piping to a file. Another benefit of running
 OCRmyPDF in a child process, as recommended above, is that it will
 not interfere with the parent process's standard output.

-Exceptions
----------
+### Exceptions

-OCRmyPDF may throw standard Python exceptions, ``ocrmypdf.exceptions.*``
+OCRmyPDF may throw standard Python exceptions, `ocrmypdf.exceptions.*`
 exceptions, some exceptions related to multiprocessing, and
-:exc:`KeyboardInterrupt`. The parent process should provide an exception
+{exc}`KeyboardInterrupt`. The parent process should provide an exception
 handler. OCRmyPDF will clean up its temporary files and worker processes
 automatically when an exception occurs.

--- a/docs/apiref.rst
+++ b/docs/apiref.rst
@@ -1,56 +1,60 @@
-.. SPDX-FileCopyrightText: 2022 James R. Barlow
-..
-.. SPDX-License-Identifier: CC-BY-SA-4.0
+% SPDX-FileCopyrightText: 2022 James R. Barlow
+% SPDX-License-Identifier: CC-BY-SA-4.0

-=============
-API reference
-=============
+# API reference

 This page summarizes the rest of the public API. Generally speaking this
 should be mainly of interest to plugin developers.

-ocrmypdf.api
-============
+## ocrmypdf.api

+```{eval-rst}
 .. automodule:: ocrmypdf.api
    :members:
+```

-ocrmypdf.exceptions
-===================
+## ocrmypdf.exceptions

+```{eval-rst}
 .. automodule:: ocrmypdf.exceptions
    :members:
    :undoc-members:
+```

-ocrmypdf.helpers
-================
+## ocrmypdf.helpers

+```{eval-rst}
 .. automodule:: ocrmypdf.helpers
    :members:
    :noindex: deprecated

    .. autodecorator:: deprecated
+```

-ocrmypdf.hocrtransform
-======================
+## ocrmypdf.hocrtransform

+```{eval-rst}
 .. automodule:: ocrmypdf.hocrtransform
    :members:
+```

-ocrmypdf.pdfa
-=============
+## ocrmypdf.pdfa

+```{eval-rst}
 .. automodule:: ocrmypdf.pdfa
    :members:
+```

-ocrmypdf.quality
-================
+## ocrmypdf.quality

+```{eval-rst}
 .. automodule:: ocrmypdf.quality
    :members:
+```

-ocrmypdf.subprocess
-===================
+## ocrmypdf.subprocess

+```{eval-rst}
 .. automodule:: ocrmypdf.subprocess
    :members:
+```
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -45,7 +45,7 @@ extensions = [
    'sphinx_issues',
 ]

-myst_enable_extensions = ['colon_fence', 'attrs_block', 'attrs_inline']
+myst_enable_extensions = ['colon_fence', 'attrs_block', 'attrs_inline', 'substitution']

 # Extension settings
 intersphinx_mapping = {'python': ('https://docs.python.org/3', None)}
--- a/docs/cookbook.md
+++ b/docs/cookbook.md
@@ -1,45 +1,43 @@
 % SPDX-FileCopyrightText: 2025 James R. Barlow
 % SPDX-License-Identifier: CC-BY-SA-4.0

-Cookbook
-========
+# Cookbook

-Basic examples
--------------
+## Basic examples

 ### Help!

 ocrmypdf has built-in help.

-:::{code} bash
+```bash
 ocrmypdf --help
-:::
+```

 ### Add an OCR layer and convert to PDF/A

-:::{code} bash
+```bash
 ocrmypdf input.pdf output.pdf
-:::
+```

 ### Add an OCR layer and output a standard PDF

-:::{code} bash
+```bash
 ocrmypdf --output-type pdf input.pdf output.pdf
-:::
+```

 ### Create a PDF/A with all color and grayscale images converted to JPEG

-:::{code} bash
+```bash
 ocrmypdf --output-type pdfa --pdfa-image-compression jpeg input.pdf output.pdf
-:::
+```

 ### Modify a file in place

 The file will only be overwritten if OCRmyPDF is successful.

-:::{code} bash
+```bash
 ocrmypdf myfile.pdf myfile.pdf
-:::
+```

 ### Correct page rotation

@@ -47,9 +45,9 @@ OCR will attempt to automatic correct the rotation of each page. This
 can help fix a scanning job that contains a mix of landscape and
 portrait pages.

-:::{code} bash
+```bash
 ocrmypdf --rotate-pages myfile.pdf myfile.pdf
-:::
+```

 You can increase (decrease) the parameter `--rotate-pages-threshold` to
 make page rotation more (less) aggressive. The threshold number is the
@@ -70,10 +68,10 @@ angle is wrong.
 OCRmyPDF assumes the document is in English unless told otherwise. OCR
 quality may be poor if the wrong language is used.

-:::{code} bash
+```bash
 ocrmypdf -l fra LeParisien.pdf LeParisien.pdf
 ocrmypdf -l eng+fra Bilingual-English-French.pdf Bilingual-English-French.pdf
-:::
+```

 Language packs must be installed for all languages specified. See
 `Installing additional language packs <lang-packs>`{.interpreted-text
@@ -87,9 +85,9 @@ language when it is unknown.
 This produces a file named \"output.pdf\" and a companion text file
 named \"output.txt\".

-:::{code} bash
+```bash
 ocrmypdf --sidecar output.txt input.pdf output.pdf
-:::
+```

 :::{note}
 The sidecar file contains the **OCR text** found by OCRmyPDF. If the
@@ -114,14 +112,14 @@ use a program like Poppler\'s `pdftotext` or `pdfgrep`.
 If you are starting with images, you can just use Tesseract directly to
 convert images to PDFs:

-:::{code} bash
+```bash
 tesseract my-image.jpg output-prefix pdf
-:::
+```

-:::{code} bash
+```bash
 # When there are multiple images
 tesseract text-file-containing-list-of-image-filenames.txt output-prefix pdf
-:::
+```

 Tesseract\'s PDF output is quite good -- OCRmyPDF uses it internally, in
 some cases. However, OCRmyPDF has many features not available in
@@ -134,9 +132,9 @@ You can also use a program like
 images to PDFs, and then pipe the results to run ocrmypdf. The `-` tells
 ocrmypdf to read standard input.

-:::{code} bash
+```bash
 img2pdf my-images*.jpg | ocrmypdf - myfile.pdf
-:::
+```

 `img2pdf` is recommended because it does an excellent job at generating
 PDFs without transcoding images.
@@ -148,9 +146,9 @@ own. If the resolution (dots per inch, DPI) of an image is not set or is
 incorrect, it can be overridden with `--image-dpi`. (As 1 inch is 2.54
 cm, 1 dpi = 0.39 dpcm).

-:::{code} bash
+```bash
 ocrmypdf --image-dpi 300 image.png myfile.pdf
-:::
+```

 If you have multiple images, you must use `img2pdf` to convert the
 images to PDF.
@@ -161,8 +159,9 @@ We caution against using ImageMagick or Ghostscript to convert images to
 PDF, since they may transcode images or produce downsampled images,
 sometimes without warning.

-Image processing
----------------
+(image-processing)=
+
+## Image processing

 OCRmyPDF perform some image processing on each page of a PDF, if
 desired. The same processing is applied to each page. It is suggested
@@ -200,18 +199,18 @@ should be visually reviewed after using these options.

 Deskew:

-:::{code} bash
+```bash
 ocrmypdf --deskew input.pdf output.pdf
-:::
+```

 Image processing commands can be combined. The order in which options
 are given does not matter. OCRmyPDF always applies the steps of the
 image processing pipeline in the same order (rotate, remove background,
 deskew, clean).

-:::{code} bash
+```bash
 ocrmypdf --deskew --clean --rotate-pages input.pdf output.pdf
-:::
+```

 Don\'t actually OCR my PDF
 --------------------------
@@ -221,12 +220,11 @@ processing without performing OCR (by causing OCR to time out). This
 works if all you want to is to apply image processing or PDF/A
 conversion.

-:::{code} bash
+```bash
 ocrmypdf --tesseract-timeout=0 --remove-background input.pdf output.pdf
-:::
+```

-::: {.versionchanged}
-v14.1.0
+:::{versionchanged} v14.1.0

 Prior to this version, `--tesseract-timeout 0` would prevent other uses
 of Tesseract, such as deskewing, from working. This is no longer the
@@ -239,9 +237,9 @@ non-OCR operations, if needed.
 This is getting ridiculous, but OCRmyPDF can complete strip all textual
 information from a PDF and reconstruct it as a \"bag of images\" PDF.

-:::{code} bash
+```bash
 ocrmypdf --tesseract-timeout 0 --force-ocr input.pdf output.pdf
-:::
+```

 Why would you want to do this? Perhaps you have a PDF where OCR fails to
 produce useful results, and just want to get rid of all OCR information.
@@ -251,18 +249,18 @@ This command also removes OCR generated by third party tools.

 You can also optimize all images without performing any OCR:

-:::{code} bash
+```bash
 ocrmypdf --tesseract-timeout=0 --optimize 3 --skip-text input.pdf output.pdf
-:::
+```

 ### Process only certain pages

 You can ask OCRmyPDF to only apply [image processing](#image-processing)
 and OCR to certain pages.

-:::{code} bash
+```bash
 ocrmypdf --pages 2,3,13-17 input.pdf output.pdf
-:::
+```

 Hyphens denote a range of pages and commas separate page numbers. If you
 prefer to use spaces, quote all of the page numbers:
@@ -281,9 +279,9 @@ those options. Both of these steps are \"whole file\" operations. In
 this example, we want to OCR only the title and otherwise change the PDF
 as little as possible:

-:::{code} bash
+```bash
 ocrmypdf --pages 1 --output-type pdf --optimize 0 input.pdf output.pdf
-:::
+```

 Redo existing OCR
 -----------------
@@ -297,9 +295,9 @@ This may be helpful for users who want to take advantage of accuracy
 improvements in Tesseract for files they previously OCRed with an
 earlier version of Tesseract and OCRmyPDF.

-:::{code} bash
+```bash
 ocrmypdf --redo-ocr input.pdf output.pdf
-:::
+```

 This method will replace OCR without rasterizing, reducing quality or
 removing vector content. If a file contains a mix of pure digital text
@@ -351,18 +349,18 @@ header-rows: 1

 *   - Level
    - Comments
-*   - ``--optimize=0``
+*   - <nobr>``--optimize=0``</nobr>
    - Disables optimization.
-*   - ``--optimize 1``
+*   - <nobr>``--optimize 1``</nobr>
    - Enables lossless optimizations, such as transcoding images to more
        efficient formats. Also compress other uncompressed objects in the
        PDF and enables the more efficient "object streams" within the PDF.
        (If ``--jbig2-lossy`` is issued, then lossy JBIG2 optimization is used.
        The decision to use lossy JBIG2 is separate from standard optimization
        settings.)
-*   - ``--optimize 2``
+*   - <nobr>``--optimize 2``</nobr>
    - All of the above, and enables lossy optimizations and color quantization.
-*   - ``--optimize 3``
+*   - <nobr>``--optimize 3``</nobr>
    - All of the above, and enables more aggressive optimizations and targets lower image quality.
 :::

@@ -376,9 +374,9 @@ inefficient compression modes to more modern versions. A program like
 `qpdf` can be used to change encodings, e.g. to inspect the internals
 for a PDF.

-:::{code} bash
+```bash
 ocrmypdf --optimize 3 in.pdf out.pdf  # Make it small
-:::
+```

 Some users may consider enabling lossy JBIG2. See:
 `jbig2-lossy`{.interpreted-text role="ref"}.
--- a/docs/index.md
+++ b/docs/index.md
@@ -0,0 +1,57 @@
+% SPDX-FileCopyrightText: 2022 James R. Barlow
+% SPDX-License-Identifier: CC-BY-SA-4.0
+
+# OCRmyPDF documentation
+
+:::{figure} images/logo.svg
+:::
+
+OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF
+files, allowing them to be searched.
+
+PDF is the best format for storing and exchanging scanned documents.
+Unfortunately, PDFs can be difficult to modify. OCRmyPDF makes it easy to apply
+image processing and OCR (recognized, searchable text) to existing PDFs.
+
+```{toctree}
+:maxdepth: 1
+
+introduction
+release_notes
+installation
+languages
+jbig2
+```
+
+```{toctree}
+:caption: Usage
+:maxdepth: 2
+
+cookbook
+optimizer
+docker
+advanced
+batch
+cloud
+performance
+pdfsecurity
+errors
+```
+
+```{toctree}
+:caption: Developers
+:maxdepth: 2
+
+api
+plugins
+apiref
+design_notes
+contributing
+maintainers
+```
+
+# Indices and tables
+
+- {ref}`genindex`
+- {ref}`modindex`
+- {ref}`search`
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -1,56 +0,0 @@
-.. SPDX-FileCopyrightText: 2022 James R. Barlow
-..
-.. SPDX-License-Identifier: CC-BY-SA-4.0
-
-OCRmyPDF documentation
-======================
-
-.. figure:: images/logo.svg
-
-OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF
-files, allowing them to be searched.
-
-PDF is the best format for storing and exchanging scanned documents.
-Unfortunately, PDFs can be difficult to modify. OCRmyPDF makes it easy to apply
-image processing and OCR (recognized, searchable text) to existing PDFs.
-
-.. toctree::
-   :maxdepth: 1
-
-   introduction
-   release_notes
-   installation
-   languages
-   jbig2
-
-.. toctree::
-   :caption: Usage
-   :maxdepth: 2
-
-   cookbook
-   optimizer
-   docker
-   advanced
-   batch
-   cloud
-   performance
-   pdfsecurity
-   errors
-
-.. toctree::
-   :caption: Developers
-   :maxdepth: 2
-
-   api
-   plugins
-   apiref
-   design_notes
-   contributing
-   maintainers
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -0,0 +1,730 @@
+---
+myst:
+  substitutions:
+    deb_11: |-
+      :::{image} https://repology.org/badge/version-for-repo/debian_11/ocrmypdf.svg
+      :alt: Debian 11
+      :::
+    deb_12: |-
+      :::{image} https://repology.org/badge/version-for-repo/debian_12/ocrmypdf.svg
+      :alt: Debian 12
+      :::
+    deb_unstable: |-
+      :::{image} https://repology.org/badge/version-for-repo/debian_unstable/ocrmypdf.svg
+      :alt: Debian unstable
+      :::
+    fedora_38: |-
+      :::{image} https://repology.org/badge/version-for-repo/fedora_38/ocrmypdf.svg
+      :alt: Fedora 38
+      :::
+    fedora_39: |-
+      :::{image} https://repology.org/badge/version-for-repo/fedora_39/ocrmypdf.svg
+      :alt: Fedora 39
+      :::
+    fedora_rawhide: |-
+      :::{image} https://repology.org/badge/version-for-repo/fedora_rawhide/ocrmypdf.svg
+      :alt: Fedore Rawhide
+      :::
+    latest: |-
+      :::{image} https://img.shields.io/pypi/v/ocrmypdf.svg
+      :alt: OCRmyPDF latest released version on PyPI
+      :::
+    ubu_2004: |-
+      :::{image} https://repology.org/badge/version-for-repo/ubuntu_20_04/ocrmypdf.svg
+      :alt: Ubuntu 20.04 LTS
+      :::
+    ubu_2204: |-
+      :::{image} https://repology.org/badge/version-for-repo/ubuntu_22_04/ocrmypdf.svg
+      :alt: Ubuntu 22.04 LTS
+      :::
+---
+
+% SPDX-FileCopyrightText: 2022 James R. Barlow
+% SPDX-License-Identifier: CC-BY-SA-4.0
+
+# Installing OCRmyPDF
+
+(latest)=
+
+The easiest way to install OCRmyPDF is to follow the steps for your operating
+system/platform. This version may be out of date, however.
+
+These platforms have one-liner installs:
+
+:::{list-table}
+:header-rows: 0
+
+* - Debian, Ubuntu
+  - ``apt install ocrmypdf``
+* - Windows Subsystem for Linux
+  - ``apt install ocrmypdf``
+* - Fedora
+  - ``dnf install ocrmypdf tesseract-osd``
+* - macOS (Homebrew)
+  - ``brew install ocrmypdf``
+* - macOS (MacPorts)
+  - ``port install ocrmypdf``
+* - LinuxBrew
+  - ``brew install ocrmypdf``
+* - FreeBSD
+  - ``pkg install textproc/py-ocrmypdf``
+* - Snap (snapcraft packaging)
+  - ``snap install ocrmypdf``
+:::
+
+More detailed procedures are outlined below. If you want to do a manual
+install, or install a more recent version than your platform provides, read on.
+
+:::{contents} Platform-specific steps
+:depth: 2
+:local: true
+:::
+
+## Installing on Linux
+
+### Debian and Ubuntu 20.04 or newer
+
+:::{list-table}
+:header-rows: 1
+
+* - OCRmyPDF versions in Debian & Ubuntu
+* - {{ latest }}
+* - {{ deb_11 }} {{ deb_12 }} {{ deb_unstable }}
+* - {{ ubu_2004 }} {{ ubu_2204 }}
+:::
+
+Users of Debian or Ubuntu may simply
+
+```bash
+apt install ocrmypdf
+```
+
+As indicated in the table above, Debian and Ubuntu releases may lag
+behind the latest version. If the version available for your platform is
+out of date, you could opt to install the latest version from source.
+See [Installing HEAD revision from
+sources](#installing-head-revision-from-sources).
+
+For full details on version availability for your platform, check the
+[Debian Package Tracker](https://tracker.debian.org/pkg/ocrmypdf) or
+[Ubuntu launchpad.net](https://launchpad.net/ocrmypdf).
+
+:::{note}
+OCRmyPDF for Debian and Ubuntu currently omit the JBIG2 encoder.
+OCRmyPDF works fine without it but will produce larger output files.
+If you build jbig2enc from source, ocrmypdf will
+automatically detect it (specifically the `jbig2` binary) on the
+`PATH`. To add JBIG2 encoding, see {ref}`jbig2`.
+:::
+
+### Fedora
+
+:::{list-table}
+:header-rows: 1
+
+* - OCRmyPDF version
+* - {{latest}}
+* - {{fedora_38}} {{fedora_39}} {{fedora_rawhide}}
+:::
+
+Users of Fedora may simply
+
+```bash
+dnf install ocrmypdf tesseract-osd
+```
+
+For full details on version availability, check the [Fedora Package
+Tracker](https://packages.fedoraproject.org/pkgs/ocrmypdf/ocrmypdf/).
+
+If the version available for your platform is out of date, you could opt
+to install the latest version from source. See [Installing HEAD revision
+from sources](#installing-head-revision-from-sources).
+
+:::{note}
+OCRmyPDF for Fedora currently omits the JBIG2 encoder due to patent
+issues. OCRmyPDF works fine without it but will produce larger output
+files. If you build jbig2enc from source, ocrmypdf 7.0.0 and later
+will automatically detect it on the `PATH`. To add JBIG2 encoding,
+see {ref}`Installing the JBIG2 encoder <jbig2>`.
+:::
+
+(ubuntu-lts-latest)=
+
+### RHEL 9
+
+Prepare the environment by getting Python 3.11:
+
+```bash
+dnf install python3.11 python3.11-pip
+```
+
+Then, follow [Requirements for pip and HEAD install](#requirements-for-pip-and-head-install) to install dependencies:
+
+```bash
+dnf install ghostscript tesseract
+```
+
+and build ocrmypdf in virtual environment:
+
+```bash
+python3.11 -m venv .venv
+```
+
+To add JBIG2 encoding, see {ref}`Installing the JBIG2 encoder <jbig2>`.
+
+Note Fedora packages for language data haven't been branched for RHEL/EPEL, but you can get traineddata files directly from [tesseract](https://github.com/tesseract-ocr/tessdata/) and place them in `/usr/share/tesseract/tessdata`.
+
+### Installing the latest version on Ubuntu 22.04 LTS
+
+Ubuntu 22.04 includes ocrmypdf 13.4.0 - you can install that with
+`apt install ocrmypdf`. To install a more recent version for the current
+user, follow these steps:
+
+```bash
+sudo apt-get update
+sudo apt-get -y install ocrmypdf python3-pip
+
+pip install --user --upgrade ocrmypdf
+```
+
+If you get the message `WARNING: The script ocrmypdf is installed in
+'/home/$USER/.local/bin' which is not on PATH.`, you may need to re-login
+or open a new shell, or manually adjust your PATH.
+
+To add JBIG2 encoding, see {ref}`jbig2`.
+
+### Ubuntu 20.04 LTS
+
+Ubuntu 20.04 includes ocrmypdf 9.6.0 - you can install that with `apt`. The
+most convenient way to install recent OCRmyPDF on older Ubuntu is to use
+Homebrew on Linux (Linuxbrew).
+
+```bash
+brew install ocrmypdf
+```
+
+### Arch Linux (AUR)
+
+:::{image} https://repology.org/badge/version-for-repo/aur/ocrmypdf.svg
+:alt: ArchLinux
+:target: https://repology.org/metapackage/ocrmypdf
+:::
+
+There is an [Arch User Repository (AUR) package for OCRmyPDF](https://aur.archlinux.org/packages/ocrmypdf/).
+
+Installing AUR packages as root is not allowed, so you must first [setup a
+non-root user](https://wiki.archlinux.org/index.php/Users_and_groups#User_management) and
+[configure sudo](https://wiki.archlinux.org/index.php/Sudo#Configuration).
+The standard Docker image, `archlinux/base:latest`, does **not** have a
+non-root user configured, so users of that image must follow these guides. If
+you are using a VM image, such as [the official Vagrant image](https://app.vagrantup.com/archlinux/boxes/archlinux), this work may already
+be completed for you.
+
+Next you should install the [base-devel package group](https://archlinux.org/packages/core/any/base-devel/). This includes the
+standard tooling needed to build packages, such as a compiler and binary tools.
+
+```bash
+sudo pacman -S --needed base-devel
+```
+
+Now you are ready to install the OCRmyPDF package.
+
+```bash
+curl -O https://aur.archlinux.org/cgit/aur.git/snapshot/ocrmypdf.tar.gz
+tar xvzf ocrmypdf.tar.gz
+cd ocrmypdf
+makepkg -sri
+```
+
+At this point you will have a working install of OCRmyPDF, but the Tesseract
+install won’t include any OCR language data. You can install [the
+tesseract-data package group](https://www.archlinux.org/groups/any/tesseract-data/) to add all supported
+languages, or use that package listing to identify the appropriate package for
+your desired language.
+
+```bash
+sudo pacman -S tesseract-data-eng
+```
+
+As an alternative to this manual procedure, consider using an [AUR helper](https://wiki.archlinux.org/index.php/AUR_helpers). Such a tool will
+automatically fetch, build and install the AUR package, resolve dependencies
+(including dependencies on AUR packages), and ease the upgrade procedure.
+
+If you have any difficulties with installation, check the repository package
+page.
+
+:::{note}
+The OCRmyPDF AUR package currently omits the JBIG2 encoder. OCRmyPDF works
+fine without it but will produce larger output files. The encoder is
+available from [the jbig2enc-git AUR package](https://aur.archlinux.org/packages/jbig2enc-git/) and may be installed
+using the same series of steps as for the installation OCRmyPDF AUR
+package. Alternatively, it may be built manually from source following the
+instructions in {ref}`Installing the JBIG2 encoder <jbig2>`. If JBIG2 is
+installed, OCRmyPDF 7.0.0 and later will automatically detect it.
+:::
+
+### Alpine Linux
+
+:::{image} https://repology.org/badge/version-for-repo/alpine_edge/ocrmypdf.svg
+:alt: Alpine Linux
+:target: https://repology.org/metapackage/ocrmypdf
+:::
+
+To install OCRmyPDF for Alpine Linux:
+
+```bash
+apk add ocrmypdf
+```
+
+### Gentoo Linux
+
+:::{image} https://repology.org/badge/version-for-repo/gentoo_ovl_guru/ocrmypdf.svg
+:alt: Gentoo Linux
+:target: https://repology.org/metapackage/ocrmypdf
+:::
+
+To install OCRmyPDF on Gentoo Linux, use the following commands:
+
+```bash
+eselect repository enable guru
+emaint sync --repo guru
+emerge --ask app-text/OCRmyPDF
+```
+
+### Other Linux packages
+
+See the
+[Repology](https://repology.org/metapackage/ocrmypdf/versions) page.
+
+In general, first install the OCRmyPDF package for your system, then
+optionally use the procedure [Installing with Python
+pip](#installing-with-python-pip) to install a more recent version.
+
+## Installing on macOS
+
+### Homebrew
+
+:::{image} https://img.shields.io/homebrew/v/ocrmypdf.svg
+:alt: homebrew
+:target: https://formulae.brew.sh/formula/ocrmypdf
+:::
+
+OCRmyPDF is now a standard [Homebrew](https://brew.sh) formula. To
+install on macOS:
+
+```bash
+brew install ocrmypdf
+```
+
+This will include only the English language pack. If you need other
+languages you can optionally install them all:
+
+```bash
+brew install tesseract-lang  # Optional: Install all language packs
+```
+
+### MacPorts
+
+:::{image} https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fports.macports.org%2Fapi%2Fv1%2Fports%2Focrmypdf%2F%3Fformat%3Djson&query=version&label=MacPorts
+:alt: Macports Version Information
+:target: https://ports.macports.org/port/ocrmypdf
+:::
+
+OCRmyPDF is includes in MacPorts:
+
+```bash
+sudo port install ocrmypdf
+```
+
+Note that while this will install tesseract you will need to install
+the appropriate tesseract [language ports](https://ports.macports.org/search/?selected_facets=categories_exact%3Atextproc&installed_file=&q=tesseract&name=on).
+
+### Manual installation on macOS
+
+These instructions probably work on all macOS supported by Homebrew, and are
+for installing a more current version of OCRmyPDF than is available from
+Homebrew. Note that the Homebrew versions usually track the release versions
+fairly closely.
+
+If it's not already present, [install Homebrew](http://brew.sh/).
+
+Update Homebrew:
+
+```bash
+brew update
+```
+
+Install or upgrade the required Homebrew packages, if any are missing.
+To do this, use `brew edit ocrmypdf` to obtain a recent list of Homebrew
+dependencies. You could also check the `.workflows/build.yml`.
+
+This will include the English, French, German and Spanish language
+packs. If you need other languages you can optionally install them all:
+
+(macos-all-languages)=
+
+> ```bash
+> brew install tesseract-lang  # Option 2: for all language packs
+> ```
+
+Update the homebrew pip:
+
+```bash
+pip install --upgrade pip
+```
+
+You can then install OCRmyPDF from PyPI for the current user:
+
+```bash
+pip install --user ocrmypdf
+```
+
+The command line program should now be available:
+
+```bash
+ocrmypdf --help
+```
+
+## Installing on Windows
+
+### Native Windows
+
+% If you have a Windows that is not the Home edition, you can use Windows Sandbox to test on a blank Windows instance.
+% https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/
+
+:::{note}
+Administrator privileges will be required for some of these steps.
+:::
+
+You must install the following for Windows:
+
+- Python 64-bit
+- Tesseract 64-bit
+- Ghostscript 64-bit
+
+Using the [winget](https://docs.microsoft.com/en-us/windows/package-manager/winget/)
+package manager:
+
+- `winget install -e --id Python.Python.3.11`
+- `winget install -e --id UB-Mannheim.TesseractOCR`
+
+You will need to install Ghostscript manually, [since it does not support automated
+installs anymore](https://artifex.com/news/ghostscript-10.01.0-disabling-silent-install-option).
+
+- [Ghostscript download page](https://ghostscript.com/releases/gsdnld.html).\`
+
+(Or alternately, using the [Chocolatey](https://chocolatey.org/) package manager, install
+the following when running in an Administrator command prompt):
+
+- `choco install python3`
+- `choco install --pre tesseract`
+- `choco install pngquant` (optional)
+
+Either set of commands will install the required software. At the moment there is no
+single command to install Windows.
+
+You may then use `pip` to install ocrmypdf. (This can performed by a user or
+Administrator.):
+
+- `python3 -m pip install ocrmypdf`
+
+% The Windows Python versions do not place any python or python3 executable in the path.
+% They add the py launcher to the path:
+% https://docs.python.org/3/using/windows.html#python-launcher-for-windows
+
+If you installed Python using WinGet, then use the following command instead:
+
+- `py -m pip install ocrmypdf`
+
+and use:
+
+- `py -m ocrmypdf`
+
+To start OCRmyPDF.
+
+If you intend to use more Python software on your Windows machine, consider the use of
+[pipx](https://pipx.pypa.io/stable/) or a similar tool to create isolated Python
+environments for each Python software that you want to use.
+
+OCRmyPDF will check the Windows Registry and standard locations in your Program Files
+for third party software it needs (specifically, Tesseract and Ghostscript). To
+override the versions OCRmyPDF selects, you can modify the `PATH` environment
+variable. [Follow these directions](https://www.computerhope.com/issues/ch000549.htm#dospath)
+to change the PATH.
+
+:::{warning}
+As of early 2021, users have reported problems with the Microsoft Store version of
+Python and OCRmyPDF. These issues affect many other third party Python packages.
+Please download Python from Python.org or a package manager instead of the
+Microsoft Store version.
+:::
+
+:::{warning}
+32-bit Windows is not supported.
+:::
+
+### Windows Subsystem for Linux
+
+1. Install Ubuntu 22.04 for Windows Subsystem for Linux, if not already installed.
+2. Follow the procedure to install {ref}`OCRmyPDF on Ubuntu 22.04 <ubuntu-lts-latest>`.
+3. Open the Windows command prompt and create a symlink:
+
+```powershell
+wsl sudo ln -s  /home/$USER/.local/bin/ocrmypdf /usr/local/bin/ocrmypdf
+```
+
+Then confirm that the expected version from PyPI ({{ latest }}) is installed:
+
+```powershell
+wsl ocrmypdf --version
+```
+
+You can then run OCRmyPDF in the Windows command prompt or Powershell, prefixing
+`wsl`, and call it from Windows programs or batch files.
+
+### Cygwin64
+
+First install the the following prerequisite Cygwin packages using `setup-x86_64.exe`:
+
+```
+python310 (or later)
+python3?-devel
+python3?-pip
+python3?-lxml
+python3?-imaging
+
+   (where 3? means match the version of python3 you installed)
+
+gcc-g++
+ghostscript
+libexempi3
+libexempi-devel
+libffi6
+libffi-devel
+pngquant
+qpdf
+libqpdf-devel
+tesseract-ocr
+tesseract-ocr-devel
+```
+
+Then open a Cygwin terminal (i.e. `mintty`), run the following commands. Note
+that if you are using the version of `pip` that was installed with the Cygwin
+Python package, the command name will be `pip3`. If you have since updated
+`pip` (with, for instance `pip3 install --upgrade pip`) the the command is
+likely just `pip` instead of `pip3`:
+
+```bash
+pip3 install wheel
+pip3 install ocrmypdf
+```
+
+The optional dependency "unpaper" that is currently not available under Cygwin.
+Without it, certain options such as `--clean` will produce an error message.
+However, the OCR-to-text-layer functionality is available.
+
+### Docker
+
+You can also [Install the Docker image](docker) on Windows. Ensure that
+your command prompt can run the docker "hello world" container.
+
+## Installing on FreeBSD
+
+:::{image} https://repology.org/badge/version-for-repo/freebsd/ocrmypdf.svg
+:alt: FreeBSD
+:target: https://repology.org/project/ocrmypdf/versions
+:::
+
+```bash
+pkg install textproc/py-ocrmypdf
+```
+
+To install a more recent version, you could attempt to first install the system
+version with `pkg`, then use `pip install --user ocrmypdf`.
+
+## Installing the Docker image
+
+For some users, installing the Docker image will be easier than
+installing all of OCRmyPDF's dependencies.
+
+See [Installing the Docker image](docker) for more information.
+
+(installing-with-python-pip)=
+
+## Installing with Python pip
+
+OCRmyPDF is delivered by PyPI because it is a convenient way to install
+the latest version. However, PyPI and `pip` cannot address the fact
+that `ocrmypdf` depends on certain non-Python system libraries and
+programs being installed.
+
+For best results, first install [your platform's
+version](https://repology.org/metapackage/ocrmypdf/versions) of
+`ocrmypdf`, using the instructions elsewhere in this document. Then
+you can use `pip` to get the latest version if your platform version
+is out of date. Chances are that this will satisfy most dependencies.
+
+Use `ocrmypdf --version` to confirm what version was installed.
+
+Then you can install the latest OCRmyPDF from the Python wheels. First
+try:
+
+```bash
+pip install --user ocrmypdf
+```
+
+(If the message appears `Requirement already satisfied: ocrmypdf in...`,
+you will need to use `pip install --user --upgrade ocrmypdf`.)
+
+You should then be able to run `ocrmypdf --version` and see that the
+latest version was located.
+
+## Installing with pipx
+
+Some users may prefer pipx. As with the method above, you will need to
+satisfy all non-Python dependencies. Then if pipx is installed, you
+can use
+
+```bash
+pipx run ocrmypdf
+```
+
+(If not installed, pipx will install first.)
+
+(requirements-for-pip-and-head-install)=
+
+### Requirements for pip and HEAD install
+
+OCRmyPDF currently requires these external programs and libraries to be
+installed, and must be satisfied using the operating system package
+manager. `pip` cannot provide them.
+
+The following versions are required:
+
+- Python 3.10 or newer
+- Ghostscript 9.54 or newer
+- Tesseract 4.1.1 or newer
+- jbig2enc 0.29 or newer
+- pngquant 2.5 or newer
+- unpaper 6.1
+
+We recommend 64-bit versions of all software. (32-bit versions are not
+supported, although on Linux, they may still work.)
+
+jbig2enc, pngquant, and unpaper are optional. If missing certain
+features are disabled. OCRmyPDF will discover them as soon as they are
+available.
+
+**jbig2enc**, if present, will be used to optimize the encoding of
+monochrome images. This can significantly reduce the file size of the
+output file. It is not required.
+[jbig2enc](https://github.com/agl/jbig2enc) is not generally
+available for Ubuntu or Debian due to lingering concerns about patent
+issues, but can easily be built from source. To add JBIG2 encoding, see
+{ref}`jbig2`.
+
+**pngquant**, if present, is optionally used to optimize the encoding of
+PNG-style images in PDFs (actually, any that are that losslessly
+encoded) by lossily quantizing to a smaller color palette. It is only
+activated then the `--optimize` argument is `2` or `3`.
+
+**unpaper**, if present, enables the `--clean` and `--clean-final`
+command line options.
+
+These are in addition to the Python packaging dependencies, meaning that
+unfortunately, the `pip install` command cannot satisfy all of them.
+
+(installing-head-revision-from-sources)=
+
+## Installing HEAD revision from sources
+
+If you have `git` and Python 3.10 or newer installed, you can install
+from source. When the `pip` installer runs, it will alert you if
+dependencies are missing.
+
+If you prefer to build every from source, you will need to [build
+pikepdf from
+source](https://pikepdf.readthedocs.io/en/latest/installation.html#building-from-source).
+First ensure you can build and install pikepdf.
+
+To install the HEAD revision from sources in the current Python 3
+environment:
+
+```bash
+pip install git+https://github.com/ocrmypdf/OCRmyPDF.git
+```
+
+Or, to install in editable mode
+allowing customization of OCRmyPDF, use the `-e` flag:
+
+```bash
+pip install -e git+https://github.com/ocrmypdf/OCRmyPDF.git
+```
+
+You may find it easiest to install in a virtual environment, rather than
+system-wide:
+
+```bash
+git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
+python3 -m venv .venv
+source .venv/bin/activate
+cd OCRmyPDF
+pip install .
+```
+
+However, `ocrmypdf` will only be accessible on the system PATH when
+you activate the virtual environment.
+
+To run the program:
+
+```bash
+ocrmypdf --help
+```
+
+If not yet installed, the script will notify you about dependencies that
+need to be installed. The script requires specific versions of the
+dependencies. Older version than the ones mentioned in the release notes
+are likely not to be compatible to OCRmyPDF.
+
+### For development
+
+To install all of the development and test requirements:
+
+```bash
+git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
+python -m venv .venv
+source .venv/bin/activate
+cd OCRmyPDF
+pip install -e .[test]
+```
+
+To add JBIG2 encoding, see {ref}`jbig2`.
+
+## Shell completions
+
+Completions for `bash` and `fish` are available in the project's
+`misc/completion` folder. The `bash` completions are likely `zsh`
+compatible but this has not been confirmed. Package maintainers, please
+install these at the appropriate locations for your system.
+
+To manually install the `bash` completion, copy
+`misc/completion/ocrmypdf.bash` to `/etc/bash_completion.d/ocrmypdf`
+(rename the file).
+
+To manually install the `fish` completion, copy
+`misc/completion/ocrmypdf.fish` to
+`~/.config/fish/completions/ocrmypdf.fish`.
+
+## Note on 32-bit support
+
+Many Python libraries no longer provide 32-bit binary wheels for Linux. This
+includes many of the libraries that OCRmyPDF depends on, such as
+Pillow. The easiest way to express this to end users is to say we don't
+support 32-bit Linux.
+
+However, if your Linux distribution still supports 32-bit binaries, you
+can still install and use OCRmyPDF. A warning message will appear.
+In practice, OCRmyPDF may need more than 32-bit memory space to run when
+large documents are processed, so there are practical limitations to what
+users can accomplish with it. Still, for the common use case of an 32-bit
+ARM NAS or Raspberry Pi processing small documents, it should work.
--- a/docs/installation.rst
+++ b/docs/installation.rst
@@ -1,740 +0,0 @@
-.. SPDX-FileCopyrightText: 2022 James R. Barlow
-..
-.. SPDX-License-Identifier: CC-BY-SA-4.0
-
-===================
-Installing OCRmyPDF
-===================
-
-.. |latest| image:: https://img.shields.io/pypi/v/ocrmypdf.svg
-    :alt: OCRmyPDF latest released version on PyPI
-
-|latest|
-
-The easiest way to install OCRmyPDF is to follow the steps for your operating
-system/platform. This version may be out of date, however.
-
-These platforms have one-liner installs:
-
-+-------------------------------+-----------------------------------------+
-| Debian, Ubuntu                | ``apt install ocrmypdf``                |
-+-------------------------------+-----------------------------------------+
-| Windows Subsystem for Linux   | ``apt install ocrmypdf``                |
-+-------------------------------+-----------------------------------------+
-| Fedora                        | ``dnf install ocrmypdf tesseract-osd``  |
-+-------------------------------+-----------------------------------------+
-| macOS (Homebrew)              | ``brew install ocrmypdf``               |
-+-------------------------------+-----------------------------------------+
-| macOS (MacPorts)              | ``port install ocrmypdf``               |
-+-------------------------------+-----------------------------------------+
-| LinuxBrew                     | ``brew install ocrmypdf``               |
-+-------------------------------+-----------------------------------------+
-| FreeBSD                       | ``pkg install textproc/py-ocrmypdf``    |
-+-------------------------------+-----------------------------------------+
-| Snap (snapcraft packaging)    | ``snap install ocrmypdf``               |
-+-------------------------------+-----------------------------------------+
-
-More detailed procedures are outlined below. If you want to do a manual
-install, or install a more recent version than your platform provides, read on.
-
-.. contents:: Platform-specific steps
-    :depth: 2
-    :local:
-
-Installing on Linux
-===================
-
-Debian and Ubuntu 20.04 or newer
--------------------------------
-
-.. |deb-11| image:: https://repology.org/badge/version-for-repo/debian_11/ocrmypdf.svg
-    :alt: Debian 11
-
-.. |deb-12| image:: https://repology.org/badge/version-for-repo/debian_12/ocrmypdf.svg
-    :alt: Debian 12
-
-.. |deb-unstable| image:: https://repology.org/badge/version-for-repo/debian_unstable/ocrmypdf.svg
-    :alt: Debian unstable
-
-.. |ubu-2004| image:: https://repology.org/badge/version-for-repo/ubuntu_20_04/ocrmypdf.svg
-    :alt: Ubuntu 20.04 LTS
-
-.. |ubu-2204| image:: https://repology.org/badge/version-for-repo/ubuntu_22_04/ocrmypdf.svg
-    :alt: Ubuntu 22.04 LTS
-
-+-----------------------------------------------+
-| **OCRmyPDF versions in Debian & Ubuntu**      |
-+-----------------------------------------------+
-| |latest|                                      |
-+-----------------------------------------------+
-| |deb-11| |deb-12| |deb-unstable|              |
-+-----------------------------------------------+
-| |ubu-2004| |ubu-2204|                         |
-+-----------------------------------------------+
-
-Users of Debian or Ubuntu may simply
-
-.. code-block:: bash
-
-    apt install ocrmypdf
-
-As indicated in the table above, Debian and Ubuntu releases may lag
-behind the latest version. If the version available for your platform is
-out of date, you could opt to install the latest version from source.
-See `Installing HEAD revision from
-sources <#installing-head-revision-from-sources>`__.
-
-For full details on version availability for your platform, check the
-`Debian Package Tracker <https://tracker.debian.org/pkg/ocrmypdf>`__ or
-`Ubuntu launchpad.net <https://launchpad.net/ocrmypdf>`__.
-
-.. note::
-
-   OCRmyPDF for Debian and Ubuntu currently omit the JBIG2 encoder.
-   OCRmyPDF works fine without it but will produce larger output files.
-   If you build jbig2enc from source, ocrmypdf will
-   automatically detect it (specifically the ``jbig2`` binary) on the
-   ``PATH``. To add JBIG2 encoding, see :ref:`jbig2`.
-
-Fedora
------
-
-.. |fedora-38| image:: https://repology.org/badge/version-for-repo/fedora_38/ocrmypdf.svg
-    :alt: Fedora 38
-
-.. |fedora-39| image:: https://repology.org/badge/version-for-repo/fedora_39/ocrmypdf.svg
-    :alt: Fedora 39
-
-.. |fedora-rawhide| image:: https://repology.org/badge/version-for-repo/fedora_rawhide/ocrmypdf.svg
-    :alt: Fedore Rawhide
-
-+-----------------------------------------------+
-| **OCRmyPDF version**                          |
-+-----------------------------------------------+
-| |latest|                                      |
-+-----------------------------------------------+
-| |fedora-38| |fedora-39| |fedora-rawhide|      |
-+-----------------------------------------------+
-
-Users of Fedora may simply
-
-.. code-block:: bash
-
-    dnf install ocrmypdf tesseract-osd
-
-For full details on version availability, check the `Fedora Package
-Tracker <https://packages.fedoraproject.org/pkgs/ocrmypdf/ocrmypdf/>`__.
-
-If the version available for your platform is out of date, you could opt
-to install the latest version from source. See `Installing HEAD revision
-from sources <#installing-head-revision-from-sources>`__.
-
-.. note::
-
-   OCRmyPDF for Fedora currently omits the JBIG2 encoder due to patent
-   issues. OCRmyPDF works fine without it but will produce larger output
-   files. If you build jbig2enc from source, ocrmypdf 7.0.0 and later
-   will automatically detect it on the ``PATH``. To add JBIG2 encoding,
-   see :ref:`Installing the JBIG2 encoder <jbig2>`.
-
-.. _ubuntu-lts-latest:
-
-RHEL 9
------
-
-Prepare the environment by getting Python 3.11:
-
-.. code-block:: bash
-
-    dnf install python3.11 python3.11-pip
-
-Then, follow `Requirements for pip and HEAD install <#requirements-for-pip-and-head-install>`__ to install dependencies:
-
-.. code-block:: bash
-
-    dnf install ghostscript tesseract
-
-and build ocrmypdf in virtual environment:
-
-.. code-block:: bash
-
-    python3.11 -m venv .venv
-
-To add JBIG2 encoding, see :ref:`Installing the JBIG2 encoder <jbig2>`.
-
-Note Fedora packages for language data haven't been branched for RHEL/EPEL, but you can get traineddata files directly from `tesseract
-<https://github.com/tesseract-ocr/tessdata/>`__ and place them in ``/usr/share/tesseract/tessdata``.
-
-Installing the latest version on Ubuntu 22.04 LTS
-------------------------------------------------
-
-Ubuntu 22.04 includes ocrmypdf 13.4.0 - you can install that with
-``apt install ocrmypdf``. To install a more recent version for the current
-user, follow these steps:
-
-.. code-block:: bash
-
-    sudo apt-get update
-    sudo apt-get -y install ocrmypdf python3-pip
-
-    pip install --user --upgrade ocrmypdf
-
-If you get the message ``WARNING: The script ocrmypdf is installed in
-'/home/$USER/.local/bin' which is not on PATH.``, you may need to re-login
-or open a new shell, or manually adjust your PATH.
-
-To add JBIG2 encoding, see :ref:`jbig2`.
-
-Ubuntu 20.04 LTS
----------------
-
-Ubuntu 20.04 includes ocrmypdf 9.6.0 - you can install that with ``apt``. The
-most convenient way to install recent OCRmyPDF on older Ubuntu is to use
-Homebrew on Linux (Linuxbrew).
-
-.. code-block:: bash
-
-    brew install ocrmypdf
-
-Arch Linux (AUR)
----------------
-
-.. image:: https://repology.org/badge/version-for-repo/aur/ocrmypdf.svg
-    :alt: ArchLinux
-    :target: https://repology.org/metapackage/ocrmypdf
-
-There is an `Arch User Repository (AUR) package for OCRmyPDF
-<https://aur.archlinux.org/packages/ocrmypdf/>`__.
-
-Installing AUR packages as root is not allowed, so you must first `setup a
-non-root user
-<https://wiki.archlinux.org/index.php/Users_and_groups#User_management>`__ and
-`configure sudo <https://wiki.archlinux.org/index.php/Sudo#Configuration>`__.
-The standard Docker image, ``archlinux/base:latest``, does **not** have a
-non-root user configured, so users of that image must follow these guides. If
-you are using a VM image, such as `the official Vagrant image
-<https://app.vagrantup.com/archlinux/boxes/archlinux>`__, this work may already
-be completed for you.
-
-Next you should install the `base-devel package group
-<https://archlinux.org/packages/core/any/base-devel/>`__. This includes the
-standard tooling needed to build packages, such as a compiler and binary tools.
-
-.. code-block:: bash
-
-   sudo pacman -S --needed base-devel
-
-Now you are ready to install the OCRmyPDF package.
-
-.. code-block:: bash
-
-   curl -O https://aur.archlinux.org/cgit/aur.git/snapshot/ocrmypdf.tar.gz
-   tar xvzf ocrmypdf.tar.gz
-   cd ocrmypdf
-   makepkg -sri
-
-At this point you will have a working install of OCRmyPDF, but the Tesseract
-install won’t include any OCR language data. You can install `the
-tesseract-data package group
-<https://www.archlinux.org/groups/any/tesseract-data/>`__ to add all supported
-languages, or use that package listing to identify the appropriate package for
-your desired language.
-
-.. code-block:: bash
-
-   sudo pacman -S tesseract-data-eng
-
-As an alternative to this manual procedure, consider using an `AUR helper
-<https://wiki.archlinux.org/index.php/AUR_helpers>`__. Such a tool will
-automatically fetch, build and install the AUR package, resolve dependencies
-(including dependencies on AUR packages), and ease the upgrade procedure.
-
-If you have any difficulties with installation, check the repository package
-page.
-
-.. note::
-
-    The OCRmyPDF AUR package currently omits the JBIG2 encoder. OCRmyPDF works
-    fine without it but will produce larger output files. The encoder is
-    available from `the jbig2enc-git AUR package
-    <https://aur.archlinux.org/packages/jbig2enc-git/>`__ and may be installed
-    using the same series of steps as for the installation OCRmyPDF AUR
-    package. Alternatively, it may be built manually from source following the
-    instructions in :ref:`Installing the JBIG2 encoder <jbig2>`.  If JBIG2 is
-    installed, OCRmyPDF 7.0.0 and later will automatically detect it.
-
-Alpine Linux
------------
-
-.. image:: https://repology.org/badge/version-for-repo/alpine_edge/ocrmypdf.svg
-    :alt: Alpine Linux
-    :target: https://repology.org/metapackage/ocrmypdf
-
-To install OCRmyPDF for Alpine Linux:
-
-.. code-block:: bash
-
-    apk add ocrmypdf
-
-Gentoo Linux
------------
-
-.. image:: https://repology.org/badge/version-for-repo/gentoo_ovl_guru/ocrmypdf.svg
-    :alt: Gentoo Linux
-    :target: https://repology.org/metapackage/ocrmypdf
-
-To install OCRmyPDF on Gentoo Linux, use the following commands:
-
-.. code-block:: bash
-
-    eselect repository enable guru
-    emaint sync --repo guru
-    emerge --ask app-text/OCRmyPDF
-
-Other Linux packages
--------------------
-
-See the
-`Repology <https://repology.org/metapackage/ocrmypdf/versions>`__ page.
-
-In general, first install the OCRmyPDF package for your system, then
-optionally use the procedure `Installing with Python
-pip <#installing-with-python-pip>`__ to install a more recent version.
-
-Installing on macOS
-===================
-
-Homebrew
--------
-
-.. image:: https://img.shields.io/homebrew/v/ocrmypdf.svg
-    :alt: homebrew
-    :target: https://formulae.brew.sh/formula/ocrmypdf
-
-OCRmyPDF is now a standard `Homebrew <https://brew.sh>`__ formula. To
-install on macOS:
-
-.. code-block:: bash
-
-    brew install ocrmypdf
-
-This will include only the English language pack. If you need other
-languages you can optionally install them all:
-
-.. code-block:: bash
-
-    brew install tesseract-lang  # Optional: Install all language packs
-
-MacPorts
--------
-
-.. image:: https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fports.macports.org%2Fapi%2Fv1%2Fports%2Focrmypdf%2F%3Fformat%3Djson&query=version&label=MacPorts
-   :alt: Macports Version Information
-   :target: https://ports.macports.org/port/ocrmypdf
-
-OCRmyPDF is includes in MacPorts:
-
-.. code-block:: bash
-
-    sudo port install ocrmypdf
-
-Note that while this will install tesseract you will need to install
-the appropriate tesseract `language ports <https://ports.macports.org/search/?selected_facets=categories_exact%3Atextproc&installed_file=&q=tesseract&name=on>`__.
-
-Manual installation on macOS
----------------------------
-
-These instructions probably work on all macOS supported by Homebrew, and are
-for installing a more current version of OCRmyPDF than is available from
-Homebrew. Note that the Homebrew versions usually track the release versions
-fairly closely.
-
-If it's not already present, `install Homebrew <http://brew.sh/>`__.
-
-Update Homebrew:
-
-.. code-block:: bash
-
-    brew update
-
-Install or upgrade the required Homebrew packages, if any are missing.
-To do this, use ``brew edit ocrmypdf`` to obtain a recent list of Homebrew
-dependencies. You could also check the ``.workflows/build.yml``.
-
-This will include the English, French, German and Spanish language
-packs. If you need other languages you can optionally install them all:
-
-.. _macos-all-languages:
-
-   .. code-block:: bash
-
-    brew install tesseract-lang  # Option 2: for all language packs
-
-Update the homebrew pip:
-
-.. code-block:: bash
-
-    pip install --upgrade pip
-
-You can then install OCRmyPDF from PyPI for the current user:
-
-.. code-block:: bash
-
-    pip install --user ocrmypdf
-
-The command line program should now be available:
-
-.. code-block:: bash
-
-    ocrmypdf --help
-
-Installing on Windows
-=====================
-
-Native Windows
--------------
-
-..
-  If you have a Windows that is not the Home edition, you can use Windows Sandbox to test on a blank Windows instance.
-  https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/
-
-.. note::
-
-    Administrator privileges will be required for some of these steps.
-
-You must install the following for Windows:
-
-* Python 64-bit
-* Tesseract 64-bit
-* Ghostscript 64-bit
-
-Using the `winget <https://docs.microsoft.com/en-us/windows/package-manager/winget/>`_
-package manager:
-
-* ``winget install -e --id Python.Python.3.11``
-* ``winget install -e --id UB-Mannheim.TesseractOCR``
-
-You will need to install Ghostscript manually, `since it does not support automated
-installs anymore <https://artifex.com/news/ghostscript-10.01.0-disabling-silent-install-option>`_.
-
-* `Ghostscript download page <https://ghostscript.com/releases/gsdnld.html>`_.`
-
-(Or alternately, using the `Chocolatey <https://chocolatey.org/>`_ package manager, install
-the following when running in an Administrator command prompt):
-
-* ``choco install python3``
-* ``choco install --pre tesseract``
-* ``choco install pngquant`` (optional)
-
-Either set of commands will install the required software. At the moment there is no
-single command to install Windows.
-
-You may then use ``pip`` to install ocrmypdf. (This can performed by a user or
-Administrator.):
-
-* ``python3 -m pip install ocrmypdf``
-
-..
-  The Windows Python versions do not place any python or python3 executable in the path.
-  They add the py launcher to the path:
-  https://docs.python.org/3/using/windows.html#python-launcher-for-windows
-
-If you installed Python using WinGet, then use the following command instead:
-
-* ``py -m pip install ocrmypdf``
-
-and use:
-
-* ``py -m ocrmypdf``
-
-To start OCRmyPDF.
-
-If you intend to use more Python software on your Windows machine, consider the use of
-`pipx <https://pipx.pypa.io/stable/>`_ or a similar tool to create isolated Python
-environments for each Python software that you want to use.
-
-OCRmyPDF will check the Windows Registry and standard locations in your Program Files
-for third party software it needs (specifically, Tesseract and Ghostscript). To
-override the versions OCRmyPDF selects, you can modify the ``PATH`` environment
-variable. `Follow these directions <https://www.computerhope.com/issues/ch000549.htm#dospath>`_
-to change the PATH.
-
-.. warning::
-
-    As of early 2021, users have reported problems with the Microsoft Store version of
-    Python and OCRmyPDF. These issues affect many other third party Python packages.
-    Please download Python from Python.org or a package manager instead of the
-    Microsoft Store version.
-
-.. warning::
-
-    32-bit Windows is not supported.
-
-Windows Subsystem for Linux
---------------------------
-
-#. Install Ubuntu 22.04 for Windows Subsystem for Linux, if not already installed.
-#. Follow the procedure to install :ref:`OCRmyPDF on Ubuntu 22.04 <ubuntu-lts-latest>`.
-#. Open the Windows command prompt and create a symlink:
-
-.. code-block:: powershell
-
-    wsl sudo ln -s  /home/$USER/.local/bin/ocrmypdf /usr/local/bin/ocrmypdf
-
-Then confirm that the expected version from PyPI (|latest|) is installed:
-
-.. code-block:: powershell
-
-    wsl ocrmypdf --version
-
-You can then run OCRmyPDF in the Windows command prompt or Powershell, prefixing
-``wsl``, and call it from Windows programs or batch files.
-
-Cygwin64
--------
-
-First install the the following prerequisite Cygwin packages using ``setup-x86_64.exe``::
-
-    python310 (or later)
-    python3?-devel
-    python3?-pip
-    python3?-lxml
-    python3?-imaging
-
-       (where 3? means match the version of python3 you installed)
-
-    gcc-g++
-    ghostscript
-    libexempi3
-    libexempi-devel
-    libffi6
-    libffi-devel
-    pngquant
-    qpdf
-    libqpdf-devel
-    tesseract-ocr
-    tesseract-ocr-devel
-
-Then open a Cygwin terminal (i.e. ``mintty``), run the following commands. Note
-that if you are using the version of ``pip`` that was installed with the Cygwin
-Python package, the command name will be ``pip3``.  If you have since updated
-``pip`` (with, for instance ``pip3 install --upgrade pip``) the the command is
-likely just ``pip`` instead of ``pip3``:
-
-.. code-block:: bash
-
-    pip3 install wheel
-    pip3 install ocrmypdf
-
-The optional dependency "unpaper" that is currently not available under Cygwin.
-Without it, certain options such as ``--clean`` will produce an error message.
-However, the OCR-to-text-layer functionality is available.
-
-Docker
------
-
-You can also :ref:`Install the Docker <docker>` container on Windows. Ensure that
-your command prompt can run the docker "hello world" container.
-
-Installing on FreeBSD
-=====================
-
-.. image:: https://repology.org/badge/version-for-repo/freebsd/ocrmypdf.svg
-    :alt: FreeBSD
-    :target: https://repology.org/project/ocrmypdf/versions
-
-.. code-block:: bash
-
-    pkg install textproc/py-ocrmypdf
-
-To install a more recent version, you could attempt to first install the system
-version with ``pkg``, then use ``pip install --user ocrmypdf``.
-
-Installing the Docker image
-===========================
-
-For some users, installing the Docker image will be easier than
-installing all of OCRmyPDF's dependencies.
-
-See :ref:`docker` for more information.
-
-Installing with Python pip
-==========================
-
-OCRmyPDF is delivered by PyPI because it is a convenient way to install
-the latest version. However, PyPI and ``pip`` cannot address the fact
-that ``ocrmypdf`` depends on certain non-Python system libraries and
-programs being installed.
-
-For best results, first install `your platform's
-version <https://repology.org/metapackage/ocrmypdf/versions>`__ of
-``ocrmypdf``, using the instructions elsewhere in this document. Then
-you can use ``pip`` to get the latest version if your platform version
-is out of date. Chances are that this will satisfy most dependencies.
-
-Use ``ocrmypdf --version`` to confirm what version was installed.
-
-Then you can install the latest OCRmyPDF from the Python wheels. First
-try:
-
-.. code-block:: bash
-
-    pip install --user ocrmypdf
-
-(If the message appears ``Requirement already satisfied: ocrmypdf in...``,
-you will need to use ``pip install --user --upgrade ocrmypdf``.)
-
-You should then be able to run ``ocrmypdf --version`` and see that the
-latest version was located.
-
-Installing with pipx
-====================
-
-Some users may prefer pipx. As with the method above, you will need to
-satisfy all non-Python dependencies. Then if pipx is installed, you
-can use
-
-.. code-block:: bash
-
-    pipx run ocrmypdf
-
-(If not installed, pipx will install first.)
-
-Requirements for pip and HEAD install
-------------------------------------
-
-OCRmyPDF currently requires these external programs and libraries to be
-installed, and must be satisfied using the operating system package
-manager. ``pip`` cannot provide them.
-
-The following versions are required:
-
-  Python 3.10 or newer
-  Ghostscript 9.54 or newer
-  Tesseract 4.1.1 or newer
-  jbig2enc 0.29 or newer
-  pngquant 2.5 or newer
-  unpaper 6.1
-
-We recommend 64-bit versions of all software. (32-bit versions are not
-supported, although on Linux, they may still work.)
-
-jbig2enc, pngquant, and unpaper are optional. If missing certain
-features are disabled. OCRmyPDF will discover them as soon as they are
-available.
-
-**jbig2enc**, if present, will be used to optimize the encoding of
-monochrome images. This can significantly reduce the file size of the
-output file. It is not required.
-`jbig2enc <https://github.com/agl/jbig2enc>`__ is not generally
-available for Ubuntu or Debian due to lingering concerns about patent
-issues, but can easily be built from source. To add JBIG2 encoding, see
-:ref:`jbig2`.
-
-**pngquant**, if present, is optionally used to optimize the encoding of
-PNG-style images in PDFs (actually, any that are that losslessly
-encoded) by lossily quantizing to a smaller color palette. It is only
-activated then the ``--optimize`` argument is ``2`` or ``3``.
-
-**unpaper**, if present, enables the ``--clean`` and ``--clean-final``
-command line options.
-
-These are in addition to the Python packaging dependencies, meaning that
-unfortunately, the ``pip install`` command cannot satisfy all of them.
-
-Installing HEAD revision from sources
-=====================================
-
-If you have ``git`` and Python 3.10 or newer installed, you can install
-from source. When the ``pip`` installer runs, it will alert you if
-dependencies are missing.
-
-If you prefer to build every from source, you will need to `build
-pikepdf from
-source <https://pikepdf.readthedocs.io/en/latest/installation.html#building-from-source>`__.
-First ensure you can build and install pikepdf.
-
-To install the HEAD revision from sources in the current Python 3
-environment:
-
-.. code-block:: bash
-
-    pip install git+https://github.com/ocrmypdf/OCRmyPDF.git
-
-Or, to install in editable mode
-allowing customization of OCRmyPDF, use the ``-e`` flag:
-
-.. code-block:: bash
-
-    pip install -e git+https://github.com/ocrmypdf/OCRmyPDF.git
-
-You may find it easiest to install in a virtual environment, rather than
-system-wide:
-
-.. code-block:: bash
-
-    git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
-    python3 -m venv .venv
-    source .venv/bin/activate
-    cd OCRmyPDF
-    pip install .
-
-However, ``ocrmypdf`` will only be accessible on the system PATH when
-you activate the virtual environment.
-
-To run the program:
-
-.. code-block:: bash
-
-    ocrmypdf --help
-
-If not yet installed, the script will notify you about dependencies that
-need to be installed. The script requires specific versions of the
-dependencies. Older version than the ones mentioned in the release notes
-are likely not to be compatible to OCRmyPDF.
-
-For development
---------------
-
-To install all of the development and test requirements:
-
-.. code-block:: bash
-
-    git clone -b main https://github.com/ocrmypdf/OCRmyPDF.git
-    python -m venv .venv
-    source .venv/bin/activate
-    cd OCRmyPDF
-    pip install -e .[test]
-
-To add JBIG2 encoding, see :ref:`jbig2`.
-
-Shell completions
-=================
-
-Completions for ``bash`` and ``fish`` are available in the project's
-``misc/completion`` folder. The ``bash`` completions are likely ``zsh``
-compatible but this has not been confirmed. Package maintainers, please
-install these at the appropriate locations for your system.
-
-To manually install the ``bash`` completion, copy
-``misc/completion/ocrmypdf.bash`` to ``/etc/bash_completion.d/ocrmypdf``
-(rename the file).
-
-To manually install the ``fish`` completion, copy
-``misc/completion/ocrmypdf.fish`` to
-``~/.config/fish/completions/ocrmypdf.fish``.
-
-Note on 32-bit support
-======================
-
-Many Python libraries no longer provide 32-bit binary wheels for Linux. This
-includes many of the libraries that OCRmyPDF depends on, such as
-Pillow. The easiest way to express this to end users is to say we don't
-support 32-bit Linux.
-
-However, if your Linux distribution still supports 32-bit binaries, you
-can still install and use OCRmyPDF. A warning message will appear.
-In practice, OCRmyPDF may need more than 32-bit memory space to run when
-large documents are processed, so there are practical limitations to what
-users can accomplish with it. Still, for the common use case of an 32-bit
-ARM NAS or Raspberry Pi processing small documents, it should work.
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@@ -1,10 +1,14 @@
-.. SPDX-FileCopyrightText: 2022 James R. Barlow
-..
-.. SPDX-License-Identifier: CC-BY-SA-4.0
+---
+substitutions:
+  image: |-
+    ```{image} images/bitmap_vs_svg.svg
+    ```
+---

-============
-Introduction
-============
+% SPDX-FileCopyrightText: 2022 James R. Barlow
+% SPDX-License-Identifier: CC-BY-SA-4.0
+
+# Introduction

 OCRmyPDF is a Python application and library that adds text "layers" to images in
 PDFs, making scanned image PDFs searchable. It uses OCR to guess the text
@@ -13,31 +17,30 @@ that enable customization of its processing steps, and it is highly tolerant
 of PDFs containing scanned images and "born digital" content that doesn't
 require text recognition.

-About OCR
-=========
+## About OCR

-`Optical character
-recognition <https://en.wikipedia.org/wiki/Optical_character_recognition>`__
+[Optical character
+recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)
 is a technology that converts images of typed or handwritten text, such as
 in a scanned document, into computer text that can be selected, searched and copied.

 OCRmyPDF uses
-`Tesseract <https://github.com/tesseract-ocr/tesseract>`__, a widely
+[Tesseract](https://github.com/tesseract-ocr/tesseract), a widely
 available open source OCR engine, to perform OCR.

-.. _raster-vector:
+(raster-vector)=

-About PDFs
-==========
+## About PDFs

 PDFs are page description files that attempt to preserve a layout
-exactly. They contain `vector
-graphics <http://vector-conversions.com/vectorizing/raster_vs_vector.html>`__
+exactly. They contain [vector
+graphics](http://vector-conversions.com/vectorizing/raster_vs_vector.html)
 that can contain raster objects, such as scanned images. Because PDFs can
 contain multiple pages (unlike many image formats) and can contain fonts
 and text, they are a suitable format for exchanging scanned documents.

-|image|
+:::{image} images/bitmap_vs_svg.svg
+:::

 A PDF page may contain multiple images, even if it appears to have only
 one image. Some scanners or scanning software may segment pages into
@@ -48,10 +51,9 @@ Rasterizing a PDF is the process of generating corresponding raster images.
 OCR engines like Tesseract work with images, not scalable vector graphics
 or mixed raster-vector-text graphics such as PDF.

-About PDF/A
-===========
+## About PDF/A

-`PDF/A <https://en.wikipedia.org/wiki/PDF/A>`__ is an ISO-standardized
+[PDF/A](https://en.wikipedia.org/wiki/PDF/A) is an ISO-standardized
 subset of the full PDF specification that is designed for archiving (the
 'A' stands for Archive). PDF/A differs from PDF primarily by omitting
 features that could complicate future file readability,
@@ -63,8 +65,8 @@ of embedded content, it is likely more secure.
 There are various conformance levels and versions, such as "PDF/A-2b".

 In general, the preferred format for scanned documents is PDF/A. Some
-governments and jurisdictions, US Courts in particular, `mandate the use
-of PDF/A <https://pdfblog.com/2012/02/13/what-is-pdfa/>`__ for scanned
+governments and jurisdictions, US Courts in particular, [mandate the use
+of PDF/A](https://pdfblog.com/2012/02/13/what-is-pdfa/) for scanned
 documents.

 Since most individuals scanning documents aim for long-term readability,
@@ -78,13 +80,12 @@ files can be digitally signed but may not be encrypted to ensure future
 readability. Fortunately, converting from PDF/A to a regular PDF is
 straightforward, and any PDF viewer can handle PDF/A files.

-What OCRmyPDF does
-==================
+## What OCRmyPDF does

 OCRmyPDF analyzes each page of a PDF to determine the required colorspace
 and resolution (DPI) for capturing all the information on that page without
 losing content. It uses
-`Ghostscript <http://ghostscript.com/>`__ to rasterize each page and subsequently
+[Ghostscript](http://ghostscript.com/) to rasterize each page and subsequently
 performs OCR on the rasterized image to generate an OCR "layer." This layer
 is then integrated back into the original PDF.

@@ -101,10 +102,9 @@ options are utilized, the OCR layer is integrated into the processed image.
 By default, OCRmyPDF generates archival PDFs in the PDF/A format, which is
 a more rigid subset of PDF features designed for long-term archives. If you
 prefer regular PDFs, you can disable this feature using the
-``--output-type pdf`` option.
+`--output-type pdf` option.

-Why you shouldn't do this manually
-==================================
+## Why you shouldn't do this manually

 A PDF is similar to an HTML file, in that it contains document structure
 along with images. While some PDFs may solely display a full-page image,
@@ -142,55 +142,53 @@ like pikepdf and QPDF, it can auto-repair damaged PDFs. You don't need to
 understand the intricacies of these issues; you should be able to use
 OCRmyPDF with any PDF file, and expect reasonable results.

-Limitations
-===========
+## Limitations

 OCRmyPDF is subject to limitations imposed by the Tesseract OCR engine.
 These limitations are inherent to any software relying on Tesseract:

-  The OCR accuracy may not match that of commercial OCR solutions.
-  It is incapable of recognizing handwriting.
-  It may detect gibberish and report it as OCR output.
-  Results may be subpar when a document contains languages not specified
-   in the ``-l LANG`` argument.
-  Tesseract may struggle to analyze the natural reading order of documents.
-   For instance, it might fail to recognize two columns in a document and
-   attempt to join text across columns.
-  Poor quality scans can result in subpar OCR quality. In other words, the
-   quality of the OCR output depends on the quality of the input.
-  Tesseract does not provide information about the font family to which text
-   belongs.
-  Tesseract does not divide text into paragraphs or headings. It only provides
-   the text and its bounding box. As such, the generated PDF does not
-   contain any information about the document's structure.
+- The OCR accuracy may not match that of commercial OCR solutions.
+- It is incapable of recognizing handwriting.
+- It may detect gibberish and report it as OCR output.
+- Results may be subpar when a document contains languages not specified
+  in the `-l LANG` argument.
+- Tesseract may struggle to analyze the natural reading order of documents.
+  For instance, it might fail to recognize two columns in a document and
+  attempt to join text across columns.
+- Poor quality scans can result in subpar OCR quality. In other words, the
+  quality of the OCR output depends on the quality of the input.
+- Tesseract does not provide information about the font family to which text
+  belongs.
+- Tesseract does not divide text into paragraphs or headings. It only provides
+  the text and its bounding box. As such, the generated PDF does not
+  contain any information about the document's structure.

 Ghostscript also imposes some limitations:

-  PDFs containing JPEG 2000-encoded content may be converted to JPEG
-   encoding, which may introduce compression artifacts, if Ghostscript
-   PDF/A is enabled.
-  Ghostscript may transcode grayscale and color images, potentially
-   lossily, based on an internal algorithm. This
-   behavior can be suppressed by setting ``--pdfa-image-compression`` to
-   ``jpeg`` or ``lossless`` to set all images to one type or the other.
-   Ghostscript lacks an option to maintain the input image's format.
-   (Modern Ghostscript can copy JPEG images without transcoding them.)
-  Ghostscript's PDF/A conversion removes any XMP metadata that is not
-   one of the standard XMP metadata namespaces for PDFs. In particular,
-   PRISM Metadata is removed.
-  Ghostscript's PDF/A conversion may remove or deactivate
-   hyperlinks and other active content.
+- PDFs containing JPEG 2000-encoded content may be converted to JPEG
+  encoding, which may introduce compression artifacts, if Ghostscript
+  PDF/A is enabled.
+- Ghostscript may transcode grayscale and color images, potentially
+  lossily, based on an internal algorithm. This
+  behavior can be suppressed by setting `--pdfa-image-compression` to
+  `jpeg` or `lossless` to set all images to one type or the other.
+  Ghostscript lacks an option to maintain the input image's format.
+  (Modern Ghostscript can copy JPEG images without transcoding them.)
+- Ghostscript's PDF/A conversion removes any XMP metadata that is not
+  one of the standard XMP metadata namespaces for PDFs. In particular,
+  PRISM Metadata is removed.
+- Ghostscript's PDF/A conversion may remove or deactivate
+  hyperlinks and other active content.

-You can use ``--output-type pdf`` to disable PDF/A conversion and produce
+You can use `--output-type pdf` to disable PDF/A conversion and produce
 a standard, non-archival PDF.

 Regarding OCRmyPDF itself:

-  PDFs using transparency are not currently represented in the test
-   suite
+- PDFs using transparency are not currently represented in the test
+  suite

-Similar programs
-================
+## Similar programs

 To the author's knowledge, OCRmyPDF is the most feature-rich and
 thoroughly tested command line OCR PDF conversion tool. If it does not
@@ -199,8 +197,7 @@ meet your needs, contributions and suggestions are welcome.
 Ghostscript recently added three "pdfocr" output devices. They work by
 rasterizing all content and converting all pages to a single colour space.

-Web front-ends
-==============
+## Web front-ends

 The Docker image of OCRmyPDF provides a web service front-end
 that allows files to submitted over HTTP, and the results can be downloaded.
@@ -210,16 +207,14 @@ public internet and does not provide any security measures.

 In addition, the following third-party integrations are available:

-  `Paperless-ngx <https://docs.paperless-ngx.com/>`__ is a free software
-   document management system that uses OCRmyPDF to perform OCR on
-   uploaded documents.
-  `Nextcloud OCR <https://github.com/janis91/ocr>`__ is a free software
-   plugin for the Nextcloud private cloud software.
+- [Paperless-ngx](https://docs.paperless-ngx.com/) is a free software
+  document management system that uses OCRmyPDF to perform OCR on
+  uploaded documents.
+- [Nextcloud OCR](https://github.com/janis91/ocr) is a free software
+  plugin for the Nextcloud private cloud software.

 OCRmyPDF is not designed to be secure against malware-bearing PDFs (see
-`Using OCRmyPDF online <ocr-service>`__). Users should ensure they
+[Using OCRmyPDF online](ocr-service)). Users should ensure they
 comply with OCRmyPDF's licenses and the licenses of all dependencies. In
 particular, OCRmyPDF requires Ghostscript, which is licensed under
 AGPLv3.
-
-.. |image| image:: images/bitmap_vs_svg.svg
--- a/docs/languages.md
+++ b/docs/languages.md
@@ -0,0 +1,129 @@
+% SPDX-FileCopyrightText: 2022 James R. Barlow
+% SPDX-License-Identifier: CC-BY-SA-4.0
+
+(lang-packs)=
+
+# Installing additional language packs
+
+OCRmyPDF uses Tesseract for OCR, and relies on its language packs for all languages.
+On most platforms, English is installed with Tesseract by default, but not always.
+
+Tesseract supports [most
+languages](https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages).
+Languages are identified by standardized three-letter codes (called ISO 639-2 Alpha-3).
+Tesseract's documentation also lists the three-letter code for your language.
+Some are anglicized, e.g. Spanish is `spa` rather than `esp`, while others
+are not, e.g. German is `deu` and French is `fra`.
+
+Language packs (strictly speaking, Tesseract "traineddata" files) generally correspond
+to the language in question, but different language packs are used in certain
+situations. For German, the "Fraktur" language pack can assist with reading older
+materials in the Fraktur typeface family (`deu_frak`). Some communities have changed
+their script from Cyrillic to Latin; the Cyrillic version of Uzbek is available
+as `uzb_cyrl` and the Latin version is `uzb`.
+
+After you have installed a language pack, you can use it with `ocrmypdf -l <language>`,
+for example `ocrmypdf -l spa`. For multilingual documents, you can specify
+all languages to be expected, e.g. `ocrmypdf -l eng+fra` for English and French.
+English is assumed by default unless other language(s) are specified.
+
+For Linux users, you can often find packages that provide language
+packs.
+
+## Platform install steps
+
+### Debian and Ubuntu (apt)
+
+```bash
+# Display a list of all Tesseract language packs
+apt-cache search tesseract-ocr
+
+# Install Chinese Simplified language pack
+apt-get install tesseract-ocr-chi-sim
+```
+
+You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
+to what languages it should search for. Multiple languages can be
+requested using either `-l eng+fra` (English and French) or
+`-l eng -l fra`.
+
+### Fedora
+
+```bash
+# Display a list of all Tesseract language packs
+dnf search tesseract
+
+# Install Chinese Simplified language pack
+dnf install tesseract-langpack-chi_sim
+```
+
+You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
+to what languages it should search for. Multiple languages can be
+requested using either `-l eng+fra` (English and French) or
+`-l eng -l fra`.
+
+### Arch Linux
+
+```bash
+# Display a list of all Tesseract language packs
+pacman -Ss tesseract-data
+
+# Install German language pack
+pacman -S tesseract-data-deu
+```
+
+You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
+to what languages it should search for. Multiple languages can be
+requested using either `-l eng+fra` (English and French) or
+`-l eng -l fra`.
+
+### Gentoo
+
+On Gentoo the package `app-text/tessdata_fast`, which `app-text/tesseract` depends on, handles Tesseract languages.
+It accepts USE flags to select what languages should be installed, these can be set in `/etc/portage/package.use`.
+Alternatively one can globally set the [L10N use extension](https://wiki.gentoo.org/wiki/Localization/Guide#L10N) in `/etc/portage/make.conf`.
+This enables these languages for all packages (e.g. including aspell).
+
+```bash
+# Display a list of all Tesseract language packs
+equery uses app-text/tessdata_fast
+
+# Add English and German language support for Tesseract only
+echo 'app-text/tessdata_fast l10n_de l10n_en' >> /etc/portage/package.use
+
+# Add global English and German language support (the `l10n_` from equery has to be omitted)
+echo L10N="de en" >> /etc/portage/make.conf
+
+# update system to reflect changed USE flags
+emerge --update --deep --newuse @world
+```
+
+You can then pass the `-l LANG` argument to OCRmyPDF to give a hint as
+to what languages it should search for. Multiple languages can be
+requested using either `-l eng+fra` (English and French) or
+`-l eng -l fra`.
+
+### macOS
+
+You can install additional language packs by
+{ref}`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.
+
+### Docker
+
+Users of the OCRmyPDF Docker image should install language packs into a
+derived Docker image as
+{ref}`described in that section <docker-lang-packs>`.
+
+### Windows
+
+The Tesseract installer provided by Chocolatey currently includes only English language.
+To install other languages, download the respective language pack (`.traineddata` file)
+from <https://github.com/tesseract-ocr/tessdata/> and place it in
+`C:\\Program Files\\Tesseract-OCR\\tessdata` (or wherever Tesseract OCR is installed).
+
+## Custom language packs
+
+If you have fine-tuned or trained Tesseract and generated custom trained data, you can
+copy your `customlang.traineddata` file into your Tesseract "tessdata" folder, and
+then use the `-l customlang` argument to tell OCRmyPDF to pass that language on to
+Tesseract.
--- a/docs/languages.rst
+++ b/docs/languages.rst
@@ -1,141 +0,0 @@
-.. SPDX-FileCopyrightText: 2022 James R. Barlow
-..
-.. SPDX-License-Identifier: CC-BY-SA-4.0
-
-.. _lang-packs:
-
-====================================
-Installing additional language packs
-====================================
-
-OCRmyPDF uses Tesseract for OCR, and relies on its language packs for all languages.
-On most platforms, English is installed with Tesseract by default, but not always.
-
-Tesseract supports `most
-languages <https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#languages>`__.
-Languages are identified by standardized three-letter codes (called ISO 639-2 Alpha-3).
-Tesseract's documentation also lists the three-letter code for your language.
-Some are anglicized, e.g. Spanish is ``spa`` rather than ``esp``, while others
-are not, e.g. German is ``deu`` and French is ``fra``.
-
-Language packs (strictly speaking, Tesseract "traineddata" files) generally correspond
-to the language in question, but different language packs are used in certain
-situations. For German, the "Fraktur" language pack can assist with reading older
-materials in the Fraktur typeface family (``deu_frak``). Some communities have changed
-their script from Cyrillic to Latin; the Cyrillic version of Uzbek is available
-as ``uzb_cyrl`` and the Latin version is ``uzb``.
-
-After you have installed a language pack, you can use it with ``ocrmypdf -l <language>``,
-for example ``ocrmypdf -l spa``. For multilingual documents, you can specify
-all languages to be expected, e.g. ``ocrmypdf -l eng+fra`` for English and French.
-English is assumed by default unless other language(s) are specified.
-
-For Linux users, you can often find packages that provide language
-packs.
-
-Platform install steps
-======================
-
-Debian and Ubuntu (apt)
-----------------------
-
-.. code-block:: bash
-
-   # Display a list of all Tesseract language packs
-   apt-cache search tesseract-ocr
-
-   # Install Chinese Simplified language pack
-   apt-get install tesseract-ocr-chi-sim
-
-You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
-to what languages it should search for. Multiple languages can be
-requested using either ``-l eng+fra`` (English and French) or
-``-l eng -l fra``.
-
-Fedora
------
-
-.. code-block:: bash
-
-   # Display a list of all Tesseract language packs
-   dnf search tesseract
-
-   # Install Chinese Simplified language pack
-   dnf install tesseract-langpack-chi_sim
-
-You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
-to what languages it should search for. Multiple languages can be
-requested using either ``-l eng+fra`` (English and French) or
-``-l eng -l fra``.
-
-Arch Linux
----------
-
-.. code-block:: bash
-
-   # Display a list of all Tesseract language packs
-   pacman -Ss tesseract-data
-
-   # Install German language pack
-   pacman -S tesseract-data-deu
-
-You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
-to what languages it should search for. Multiple languages can be
-requested using either ``-l eng+fra`` (English and French) or
-``-l eng -l fra``.
-
-Gentoo
------
-
-On Gentoo the package ``app-text/tessdata_fast``, which ``app-text/tesseract`` depends on, handles Tesseract languages.
-It accepts USE flags to select what languages should be installed, these can be set in ``/etc/portage/package.use``.
-Alternatively one can globally set the `L10N use extension <https://wiki.gentoo.org/wiki/Localization/Guide#L10N>`__ in ``/etc/portage/make.conf``.
-This enables these languages for all packages (e.g. including aspell).
-
-.. code-block:: bash
-
-   # Display a list of all Tesseract language packs
-   equery uses app-text/tessdata_fast
-
-   # Add English and German language support for Tesseract only
-   echo 'app-text/tessdata_fast l10n_de l10n_en' >> /etc/portage/package.use
-
-   # Add global English and German language support (the `l10n_` from equery has to be omitted)
-   echo L10N="de en" >> /etc/portage/make.conf
-
-   # update system to reflect changed USE flags
-   emerge --update --deep --newuse @world
-
-You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as
-to what languages it should search for. Multiple languages can be
-requested using either ``-l eng+fra`` (English and French) or
-``-l eng -l fra``.
-
-macOS
-----
-
-You can install additional language packs by
-:ref:`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.
-
-Docker
------
-
-Users of the OCRmyPDF Docker image should install language packs into a
-derived Docker image as
-:ref:`described in that section <docker-lang-packs>`.
-
-Windows
-------
-
-The Tesseract installer provided by Chocolatey currently includes only English language.
-To install other languages, download the respective language pack (``.traineddata`` file)
-from https://github.com/tesseract-ocr/tessdata/ and place it in
-``C:\\Program Files\\Tesseract-OCR\\tessdata`` (or wherever Tesseract OCR is installed).
-
-Custom language packs
-=====================
-
-If you have fine-tuned or trained Tesseract and generated custom trained data, you can
-copy your ``customlang.traineddata`` file into your Tesseract "tessdata" folder, and
-then use the ``-l customlang`` argument to tell OCRmyPDF to pass that language on to
-Tesseract.
--- a/docs/performance.md
+++ b/docs/performance.md
@@ -0,0 +1,24 @@
+% SPDX-FileCopyrightText: 2022 James R. Barlow
+% SPDX-License-Identifier: CC-BY-SA-4.0
+
+# Performance
+
+Some users have noticed that current versions of OCRmyPDF do not run as
+quickly as some older versions (specifically 6.x and older). This is
+because OCRmyPDF added image optimization as a postprocessing step, and
+it is enabled by default.
+
+## Speed
+
+If running OCRmyPDF quickly is your main goal, you can use settings such
+as:
+
+-   `--optimize 0` to disable file size optimization
+-   `--output-type pdf` to disable PDF/A generation
+-   `--fast-web-view 999999` to disable fast web view optimization
+-   `--skip-big` to skip large images, if some pages have large images
+
+You can also avoid:
+
+-   `--force-ocr`
+-   Image preprocessing
--- a/docs/performance.rst
+++ b/docs/performance.rst
@@ -1,26 +0,0 @@
-.. SPDX-FileCopyrightText: 2022 James R. Barlow
-..
-.. SPDX-License-Identifier: CC-BY-SA-4.0
-
-===========
-Performance
-===========
-
-Some users have noticed that current versions of OCRmyPDF do not run as quickly
-as some older versions (specifically 6.x and older). This is because OCRmyPDF
-added image optimization as a postprocessing step, and it is enabled by default.
-
-Speed
-=====
-
-If running OCRmyPDF quickly is your main goal, you can use settings such as:
-
-* ``--optimize 0`` to disable file size optimization
-* ``--output-type pdf`` to disable PDF/A generation
-* ``--fast-web-view 999999`` to disable fast web view optimization
-* ``--skip-big`` to skip large images, if some pages have large images
-
-You can also avoid:
-
-* ``--force-ocr``
-* Image preprocessing
--- a/docs/plugins.rst
+++ b/docs/plugins.rst
@@ -1,15 +1,12 @@
-.. SPDX-FileCopyrightText: 2022 James R. Barlow
-..
-.. SPDX-License-Identifier: CC-BY-SA-4.0
+% SPDX-FileCopyrightText: 2022 James R. Barlow
+% SPDX-License-Identifier: CC-BY-SA-4.0

-=======
-Plugins
-=======
+# Plugins

-    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
-    NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and
-    "OPTIONAL" in this document are to be interpreted as described in
-    RFC 2119.
+> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
+> NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
+> "OPTIONAL" in this document are to be interpreted as described in
+> RFC 2119.

 You can use plugins to customize the behavior of OCRmyPDF at certain points of
 interest.
@@ -24,75 +21,71 @@ Currently, it is possible to:
 - replace Ghostscript with another PDF to image converter (rasterizer) or
  PDF/A generator

-OCRmyPDF plugins are based on the Python ``pluggy`` package and conform to its
+OCRmyPDF plugins are based on the Python `pluggy` package and conform to its
 conventions. Note that: plugins installed with as setuptools entrypoints are
 not checked currently, because OCRmyPDF assumes you may not want to enable
 plugins for all files.

-See [OCRmyPDF-EasyOCR](https://github.com/ocrmypdf/OCRmyPDF-EasyOCR) for an
+See \[OCRmyPDF-EasyOCR\](<https://github.com/ocrmypdf/OCRmyPDF-EasyOCR>) for an
 example of a straightforward, fully working plugin.

-Script plugins
-==============
+## Script plugins

 Script plugins may be called from the command line, by specifying the name of a file.
 Script plugins may be convenient for informal or "one-off" plugins, when a certain
 batch of files needs a special processing step for example.

-.. code-block:: bash
+```bash
+ocrmypdf --plugin ocrmypdf_example_plugin.py input.pdf output.pdf
+```

-    ocrmypdf --plugin ocrmypdf_example_plugin.py input.pdf output.pdf
+Multiple plugins may be installed by issuing the `--plugin` argument multiple times.

-Multiple plugins may be installed by issuing the ``--plugin`` argument multiple times.
-
-Packaged plugins
-================
+## Packaged plugins

 Installed plugins may be installed into the same virtual environment as OCRmyPDF
 is installed into. They may be invoked using Python standard module naming.
 If you are intending to distribute a plugin, please package it.

-.. code-block:: bash
-
-    ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf
+```bash
+ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf
+```

 OCRmyPDF does not automatically import plugins, because the assumption is that
 plugins affect different files differently and you may not want them activated
-all the time. The command line or ``ocrmypdf.ocr(plugin='...')`` must call
+all the time. The command line or `ocrmypdf.ocr(plugin='...')` must call
 for them.

 Third parties that wish to distribute packages for ocrmypdf should package them
-as packaged plugins, and these modules should begin with the name ``ocrmypdf_``
-similar to ``pytest`` packages such as ``pytest-cov`` (the package) and
-``pytest_cov`` (the module).
+as packaged plugins, and these modules should begin with the name `ocrmypdf_`
+similar to `pytest` packages such as `pytest-cov` (the package) and
+`pytest_cov` (the module).

-.. note::
+:::{note}
+We recommend plugin authors name their plugins with the prefix
+`ocrmypdf-` (for the package name on PyPI) and `ocrmypdf_` (for the
+module), just like pytest plugins. At the same time, please make it clear
+that your package is not official.
+:::

-    We recommend plugin authors name their plugins with the prefix
-    ``ocrmypdf-`` (for the package name on PyPI) and ``ocrmypdf_`` (for the
-    module), just like pytest plugins. At the same time, please make it clear
-    that your package is not official.
-
-Plugins
-=======
+## Plugins

 You can also create a plugin that OCRmyPDF will always automatically load if both are
 installed in the same virtual environment, using a project entrypoint.
 OCRmyPDF uses the entrypoint namespace "ocrmypdf".

-For example, ``pyproject.toml`` would need to contain the following, for a plugin named
-``ocrmypdf-exampleplugin``:
+For example, `pyproject.toml` would need to contain the following, for a plugin named
+`ocrmypdf-exampleplugin`:

-.. code-block:: toml
+```toml
+[project]
+name = "ocrmypdf-exampleplugin"

-    [project]
-    name = "ocrmypdf-exampleplugin"
+[project.entry-points."ocrmypdf"]
+exampleplugin = "exampleplugin.pluginmodule"
+```

-    [project.entry-points."ocrmypdf"]
-    exampleplugin = "exampleplugin.pluginmodule"
-
-Plugin requirements
-===================
+## Plugin requirements

 OCRmyPDF generally uses multiple worker processes. When a new worker is started,
 Python will import all plugins again, including all plugins that were imported earlier.
@@ -103,14 +96,14 @@ to obtain a reference to shared state prepared by another hook implementation.
 Plugins must expect that other instances of the plugin will be running
 simultaneously.

-The ``context`` object that is passed to many hooks can be used to share information
+The `context` object that is passed to many hooks can be used to share information
 about a file being worked on. Plugins must write private, plugin-specific data to
-a subfolder named ``{options.work_folder}/ocrmypdf-plugin-name``. Plugins MAY
-read and write files in ``options.work_folder``, but should be aware that their
+a subfolder named `{options.work_folder}/ocrmypdf-plugin-name`. Plugins MAY
+read and write files in `options.work_folder`, but should be aware that their
 semantics are subject to change.

-OCRmyPDF will delete ``options.work_folder`` when it has finished OCRing
-a file, unless invoked with ``--keep-temporary-files``.
+OCRmyPDF will delete `options.work_folder` when it has finished OCRing
+a file, unless invoked with `--keep-temporary-files`.

 The documentation for some plugin hooks contain a detailed description of the
 execution context in which they will be called.
@@ -119,114 +112,139 @@ Plugins should be prepared to work whether executed in worker threads or worker
 processes. Generally, OCRmyPDF uses processes, but has a semi-hidden threaded
 argument that simplifies debugging.

-
-Plugin hooks
-============
+## Plugin hooks

 A plugin may provide the following hooks. Hooks must be decorated with
-``ocrmypdf.hookimpl``, for example:
+`ocrmypdf.hookimpl`, for example:

-.. code-block:: python
+```python
+from ocrmpydf import hookimpl

-    from ocrmpydf import hookimpl
-
-    @hookimpl
-    def add_options(parser):
-        pass
+@hookimpl
+def add_options(parser):
+    pass
+```

 The following is a complete list of hooks that are available, and when
 they are called.

-.. _firstresult:
+(firstresult)=

 **Note on firstresult hooks**

 If multiple plugins install implementations for this hook, they will be called in
 the reverse of the order in which they are installed (i.e., last plugin wins).
 When each hook implementation is called in order, the first implementation that
-returns a value other than ``None`` will "win" and prevent execution of all other
+returns a value other than `None` will "win" and prevent execution of all other
 hooks. As such, you cannot "chain" a series of plugin filters together in this
 way. Instead, a single hook implementation should be responsible for any such
 chaining operations.

-Examples
-========
+## Examples

-* OCRmyPDF's test suite contains several plugins that are used to simulate certain
+- OCRmyPDF's test suite contains several plugins that are used to simulate certain
  test conditions.
-* `ocrmypdf-papermerge <https://github.com/papermerge/OCRmyPDF_papermerge>`_ is
+- [ocrmypdf-papermerge](https://github.com/papermerge/OCRmyPDF_papermerge) is
  a production plugin that integrates OCRmyPDF and the Papermerge document
  management system.

+### Suppressing or overriding other plugins

-Suppressing or overriding other plugins
---------------------------------------
-
+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.initialize
+```

-Custom command line arguments
-----------------------------
+### Custom command line arguments

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.add_options
+```

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.check_options
+```

-Execution and progress reporting
--------------------------------
+### Execution and progress reporting

+```{eval-rst}
 .. autoclass:: ocrmypdf.pluginspec.ProgressBar
    :members:
    :special-members: __init__, __enter__, __exit__
+```

+```{eval-rst}
 .. autoclass:: ocrmypdf.pluginspec.Executor
    :members:
    :special-members: __call__
+```

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.get_logging_console
+```

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.get_executor
+```

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.get_progressbar_class
+```

-Applying special behavior before processing
-------------------------------------------
+### Applying special behavior before processing

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.validate
+```

-PDF page to image
-----------------
+### PDF page to image

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.rasterize_pdf_page
+```

-Modifying intermediate images
-----------------------------
+### Modifying intermediate images

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.filter_ocr_image
+```

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.filter_page_image
+```

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.filter_pdf_page
+```

-OCR engine
----------
+### OCR engine

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.get_ocr_engine
+```

+```{eval-rst}
 .. autoclass:: ocrmypdf.pluginspec.OcrEngine
    :members:

    .. automethod:: __str__
+```

+```{eval-rst}
 .. autoclass:: ocrmypdf.pluginspec.OrientationConfidence
+```

-PDF/A production
----------------
+### PDF/A production

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.generate_pdfa
+```

-PDF optimization
----------------
+### PDF optimization

+```{eval-rst}
 .. autofunction:: ocrmypdf.pluginspec.optimize_pdf
+```

-.. autofunction:: ocrmypdf.pluginspec.is_optimization_enabled
+```{eval-rst}
+.. autofunction:: ocrmypdf.pluginspec.is_optimization_enabled
+```
--- a/docs/release_notes.md
+++ b/docs/release_notes.md
--- a/docs/release_notes.rst
+++ b/docs/release_notes.rst