diff --git a/docs/installation.rst b/docs/installation.rst index 8b0149ad..cd9871ab 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -12,6 +12,35 @@ If you want to use the latest version of OCRmyPDF, your best bet is to install t Installing on Debian and Ubuntu 16.10 or newer ---------------------------------------------- +.. |deb-stable| image:: https://repology.org/badge/version-only-for-repo/debian_stable/ocrmypdf.svg + :alt: Debian 9 stable ("stretch") + +.. |deb-testing| image:: https://repology.org/badge/version-only-for-repo/debian_testing/ocrmypdf.svg + :alt: Debian 10 testing ("buster") + +.. |deb-unstable| image:: https://repology.org/badge/version-only-for-repo/debian_unstable/ocrmypdf.svg + :alt: Debian unstable + +.. |ubu-1710| image:: https://repology.org/badge/version-only-for-repo/ubuntu_17_10/ocrmypdf.svg + :alt: Ubuntu 17.10 + +.. |ubu-1804| image:: https://repology.org/badge/version-only-for-repo/ubuntu_17_10/ocrmypdf.svg + :alt: Ubuntu 18.04 LTS + ++------------------------------+-------------------------+ +| OS | OCRmyPDF Version | ++------------------------------+-------------------------+ +| Debian 9 stable ("stretch") | |deb-stable| | ++------------------------------+-------------------------+ +| Debian 10 testing ("buster") | |deb-testing| | ++------------------------------+-------------------------+ +| Debian unstable ("sid") | |deb-unstable| | ++------------------------------+-------------------------+ +| Ubuntu 17.10 | |ubu-1710| | ++------------------------------+-------------------------+ +| Ubuntu 18.04 LTS | |ubu-1804| | ++------------------------------+-------------------------+ + Users of Debian 9 ("stretch") or later or Ubuntu 16.10 or later may simply .. code-block:: bash @@ -157,24 +186,14 @@ Update Homebrew: brew update -Install or upgrade the required Homebrew packages, if any are missing: +Install or upgrade the required Homebrew packages, if any are missing. To do this, download the ``Brewfile`` that lists all of the dependencies to the current directory, and run ``brew bundle`` to process them (installing or upgrading as needed). ``Brewfile`` is a plain text file. .. code-block:: bash - brew install libpng openjpeg jbig2dec libtiff # image libraries - brew install qpdf - brew install ghostscript - brew install python3 - brew install libxml2 libffi leptonica - brew install unpaper # optional + wget https://github.com/jbarlow83/OCRmyPDF/raw/master/.travis/Brewfile + brew bundle -Python 3.5, 3.6 and 3.7 are supported. - -Install the required Tesseract OCR engine with the language packs you plan to use: - -.. code-block:: bash - - brew install tesseract # Option 1: for English, French, German, Spanish +This will include the English, French, German and Spanish language packs. If you need other languages you can optionally install them all: .. _macos-all-languages: @@ -210,7 +229,7 @@ Installing the latest version on Ubuntu 18.04 LTS ------------------------------------------------- Ubuntu 18.04 includes ocrmypdf 6.1.2. To install a more recent version, first -install the system version to get all the dependencies: +install the system version to get most of the dependencies: .. code-block:: bash @@ -219,6 +238,15 @@ install the system version to get all the dependencies: ocrmypdf \ python3-pip +There are a few dependency changes between ocrmypdf 6.1.2 and 7.x. Let's get +these, too. + +.. code-block:: bash + + sudo apt-get install \ + libexempi3 \ + pngquant + Then install the most recent ocrmypdf for the local user and set the user's ``PATH`` to check for the user's Python packages. .. code-block:: bash @@ -226,6 +254,7 @@ Then install the most recent ocrmypdf for the local user and set the user's ``PA export PATH=$HOME/.local/bin:$PATH pip3 install --user ocrmypdf +To add JBIG2 encoding, see `Optional: installing the JBIG2 encoder`_. Installing on Ubuntu 16.04 LTS ------------------------------ @@ -236,32 +265,39 @@ No package is currently available for Ubuntu 16.04, but you can install the depe sudo apt-get update sudo apt-get install \ - unpaper \ ghostscript \ - tesseract-ocr \ - qpdf \ + libexempi3 \ + pngquant \ + python3-cffi \ python3-pip \ - python3-cffi + qpdf \ + tesseract-ocr \ + unpaper -If you wish install OCRmyPDF for the current user: +If you wish install OCRmyPDF for the current user, and ensure that the ``PATH`` +environment variable contains ``$HOME/.local/bin``. .. code-block:: bash + export PATH=$HOME/.local/bin:$PATH pip3 install --user ocrmypdf -Alternately, system-wide. Note that this may modify the system Python environment: +Alternately, you can install ocrmypdf system-wide. (Not recommended.) .. code-block:: bash sudo pip3 install ocrmypdf -If you wish to install OCRmyPDF to a virtual environment to isolate the system Python, you can follow these steps. +At your option, you may upgrade Ubuntu 16.04 LTS to Tesseract 4.0 for improved OCR results. .. code-block:: bash - python3 -m venv venv-ocrmypdf - source venv-ocrmypdf/bin/activate - pip3 install ocrmypdf + sudo apt-get install -y software-properties-common python-software-properties + sudo add-apt-repository ppa:alex-p/tesseract-ocr -y + sudo apt-get update + sudo apt-get upgrade tesseract-ocr + +To add JBIG2 encoding, see `Optional: installing the JBIG2 encoder`_. Installing on Ubuntu 14.04 LTS ------------------------------ @@ -281,11 +317,13 @@ Install system dependencies: sudo apt-get install \ software-properties-common python-software-properties \ zlib1g-dev \ + libexempi3 \ libjpeg-dev \ libffi-dev \ + pngquant \ qpdf -We will need backports of Ghostscript 9.16, libav-11 (for unpaper 6.1), Tesseract 4.00 (alpha), and Python 3.6. This will replace Ghostscript and Tesseract 3.x on your system. Python 3.6 will be installed alongside the system Python 3. +We will need backports of Ghostscript 9.16, libav-11 (for unpaper 6.1), Tesseract 4.00 (alpha), and Python 3.6. This will replace Ghostscript and Tesseract 3.x on your system. Python 3.6 will be installed alongside the system Python 3.4. If you prefer to not modify your system in this matter, consider using a Docker container. @@ -322,6 +360,8 @@ These installation instructions omit the optional dependency ``unpaper``, which wget -q 'https://www.dropbox.com/s/vaq0kbwi6e6au80/unpaper_6.1-1.deb?raw=1' -O unpaper_6.1-1.deb sudo dpkg -i unpaper_6.1-1.deb +To add JBIG2 encoding, see `Optional: installing the JBIG2 encoder`_. + Installing on ArchLinux ----------------------- @@ -377,27 +417,34 @@ Since ``pip3 install --user`` does not work correctly on some platforms, notably Requirements for pip and HEAD install ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -OCRmyPDF currently requires these external programs to be installed: +OCRmyPDF currently requires these external programs and libraries to be installed: - Python 3.5 or newer -- Tesseract 3.04 or newer - Ghostscript 9.15 or newer +- libexempi3 2.2.0 or newer - qpdf 7.0.0 or newer +- Tesseract 3.04 or newer -The following dependencies are recommended: +As of ocrmypdf 7.0.0, the following dependencies are recommended: -- Python 3.6 -- Tesseract 4.00 or newer -- Ghostscript 9.22 or newer -- qpdf 8.0.2 or newer -- unpaper 6.1 +- Python 3.7 +- Ghostscript 9.23 +- jbig2enc 0.29 or newer +- pngquant 2.5 or newer - PyMuPDF 1.12.5 or newer +- qpdf 8.0.2 or newer +- Tesseract 4.0.0-beta1 or newer +- unpaper 6.1 These are in addition to the Python packaging dependencies, meaning that unfortunately, the ``pip install`` command cannot satisfy all of them. -Python 3.6 and Tesseract 4.0.0-beta.1 are recommended for best OCR results and best performance. +Python 3.7 and Tesseract 4.0.0-beta.1 are recommended for best OCR results and best performance. -The library PyMuPDF is not widely available in platform distributions, and it improves OCRmyPDF in certain conditions. Consider installing OCRmyPDF from the Python binary wheels, which include a precompiled version of this library. +**jbig2enc**, if present, will be used to optimize the encoding of monochrome images. This can significantly reduce the file size of the output file. It is not required. `jbig2enc `_ is not generally available for Ubuntu or Debian due to lingering concerns about patent issues, but can easily be built from source. To add JBIG2 encoding, see `Optional: installing the JBIG2 encoder`_. + +**pngquant**, if present, is optionally used to optimize the encoding of PNG-style images in PDFs (actually, any that are that losslessly encoded) by lossily quantizing to a smaller color palette. It is only activated then the ``--optimize`` argument is ``2`` or ``3``. + +**unpaper**, if present, enables the ``--clean`` and ``--clean-final`` command line options. Installing HEAD revision from sources @@ -441,8 +488,25 @@ need to be installed. The script requires specific versions of the dependencies. Older version than the ones mentioned in the release notes are likely not to be compatible to OCRmyPDF. +To add JBIG2 encoding, see `Optional: installing the JBIG2 encoder`_. Other Linux packages -------------------- See the `Repology `_ page. + + +Optional: installing the JBIG2 encoder +-------------------------------------- + +Most Linux distributions do not include a JBIG2 encoder since JBIG2 encoding was patented for a long time. All known JBIG2 US patents have expired as of 2017, but it is possible that unknown patents exist. + +To build a JBIG2 encoder from source: + +1. ``git clone https://github.com/agl/jbig2enc`` +2. ``cd jbig2enc`` +3. ``./autogen.sh`` +4. ``./configure && make`` +5. ``[sudo] make install`` + +On macOS, Homebrew packages jbig2enc and OCRmyPDF includes it by default.