mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-05-24 06:25:26 -04:00
Merge commit '68cf9cbd87c188823027f9d1bfe9029017e7281f' into develop
This commit is contained in:
116
README.rst
116
README.rst
@@ -54,6 +54,8 @@ Debian and Ubuntu
|
||||
Users of Debian 9 or later or Ubuntu 16.10 or later may simply
|
||||
``apt-get install ocrmypdf``.
|
||||
|
||||
.. _Docker:
|
||||
|
||||
Installing the Docker image
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
@@ -63,11 +65,15 @@ If you have `Docker <https://docs.docker.com/>`__ installed on your system, you
|
||||
a Docker image of the latest release.
|
||||
|
||||
Follow the Docker installation instructions for your platform. If you can run this command
|
||||
successfully, your system is ready to download and execute the image::
|
||||
successfully, your system is ready to download and execute the image:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run hello-world
|
||||
|
||||
OCRmyPDF will use all available CPU cores. By default, the VirtualBox machine instance on Windows and OS X has only a single CPU core enabled. Use the VirtualBox Manager to determine the name of your Docker engine host, and then follow these optional steps to enable multiple CPUs::
|
||||
OCRmyPDF will use all available CPU cores. By default, the VirtualBox machine instance on Windows and OS X has only a single CPU core enabled. Use the VirtualBox Manager to determine the name of your Docker engine host, and then follow these optional steps to enable multiple CPUs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Optional step for Mac OS X users
|
||||
docker-machine stop "yourVM"
|
||||
@@ -76,29 +82,41 @@ OCRmyPDF will use all available CPU cores. By default, the VirtualBox machine i
|
||||
eval $(docker-machine env "yourVM")
|
||||
|
||||
Assuming you have a Docker engine running somewhere, you can run these commands to download
|
||||
the image::
|
||||
the image:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker pull jbarlow83/ocrmypdf
|
||||
|
||||
Then tag it to give a more convenient name, just ocrmypdf::
|
||||
Then tag it to give a more convenient name, just ocrmypdf:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker tag jbarlow83/ocrmypdf ocrmypdf
|
||||
|
||||
This image contains language packs for English, French, Spanish and German. The alternative "polyglot" image provides `all available language packs <https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages>`__::
|
||||
This image contains language packs for English, French, Spanish and German. The alternative "polyglot" image provides `all available language packs <https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages>`__:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Alternative step: If you need all language packs
|
||||
docker pull jbarlow83/ocrmypdf-polyglot
|
||||
docker tag jbarlow83/ocrmypdf-polyglot ocrmypdf
|
||||
|
||||
You can then run ocrmypdf using the command::
|
||||
You can then run ocrmypdf using the command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run ocrmypdf --help
|
||||
|
||||
To execute the OCRmyPDF on a local file, you must `provide a writable volume to the Docker image <https://docs.docker.com/userguide/dockervolumes/>`__, such as this in this template::
|
||||
To execute the OCRmyPDF on a local file, you must `provide a writable volume to the Docker image <https://docs.docker.com/userguide/dockervolumes/>`__, such as this in this template:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run -v "$(pwd):/home/docker" <other docker arguments> ocrmypdf <your arguments to ocrmypdf>
|
||||
|
||||
In this worked example, the current working directory contains an input file called ``test.pdf`` and the output will go to ``output.pdf``::
|
||||
In this worked example, the current working directory contains an input file called ``test.pdf`` and the output will go to ``output.pdf``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run -v "$(pwd):/home/docker" ocrmypdf --skip-text test.pdf output.pdf
|
||||
|
||||
@@ -112,11 +130,15 @@ These instructions probably work on all Mac OS X versions later than 10.7 (Lion)
|
||||
|
||||
If it's not already present, `install Homebrew <http://brew.sh/>`__.
|
||||
|
||||
Update Homebrew::
|
||||
Update Homebrew:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew update
|
||||
|
||||
Install or upgrade the required Homebrew packages, if any are missing::
|
||||
Install or upgrade the required Homebrew packages, if any are missing:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install libpng openjpeg jbig2dec # image libraries
|
||||
brew install qpdf
|
||||
@@ -126,16 +148,22 @@ Install or upgrade the required Homebrew packages, if any are missing::
|
||||
brew install unpaper # optional
|
||||
brew install tesseract
|
||||
|
||||
Update the homebrew pip and install Pillow::
|
||||
Update the homebrew pip and install Pillow:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip3 install --upgrade pip
|
||||
pip3 install --upgrade pillow
|
||||
|
||||
You can then install OCRmyPDF from PyPI::
|
||||
You can then install OCRmyPDF from PyPI:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip3 install ocrmypdf
|
||||
|
||||
The command line program should now be available::
|
||||
The command line program should now be available:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --help
|
||||
|
||||
@@ -144,12 +172,16 @@ Installing on Ubuntu 14.04 LTS
|
||||
|
||||
Installing on Ubuntu 14.04 LTS (trusty) is more difficult than other options, because of certain bugs in Python package installation.
|
||||
|
||||
Update apt-get::
|
||||
Update apt-get:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo apt-get update
|
||||
sudo apt-get upgrade
|
||||
|
||||
Install system dependencies::
|
||||
Install system dependencies:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo apt-get install \
|
||||
zlib1g-dev \
|
||||
@@ -165,13 +197,17 @@ Install system dependencies::
|
||||
python3-reportlab
|
||||
|
||||
If you wish install OCRmyPDF to the system Python, then install as follows (note this installs new packages
|
||||
into your system Python, which could interfere with other programs)::
|
||||
into your system Python, which could interfere with other programs):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo pip3 install ocrmypdf
|
||||
|
||||
If you wish to install OCRmyPDF to a virtual environment to isolate system Python from modified, you can
|
||||
follow these steps. This includes a workaround `for a known, unresolved issue in Ubuntu 14.04's ensurepip
|
||||
package <http://www.thefourtheye.in/2014/12/Python-venv-problem-with-ensurepip-in-Ubuntu.html>`__::
|
||||
package <http://www.thefourtheye.in/2014/12/Python-venv-problem-with-ensurepip-in-Ubuntu.html>`__:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo apt-get install python3-venv
|
||||
python3 -m venv venv-ocrmypdf --without-pip
|
||||
@@ -187,14 +223,18 @@ Ubuntu 14.04 only installs ``unpaper`` version 0.4.2, which is not supported by
|
||||
Installing on Windows
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Direct installation on Windows is not possible. Install the Docker container as described above. Ensure that your command prompt can run the docker "hello world" container.
|
||||
Direct installation on Windows is not possible. Install the _`Docker` container as described above. Ensure that your command prompt can run the docker "hello world" container.
|
||||
|
||||
The command line syntax to run ocrmypdf from a command prompt will resemble::
|
||||
Running on Windows
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The command line syntax to run ocrmypdf from a command prompt will resemble:
|
||||
|
||||
.. code-block:: bat
|
||||
|
||||
docker run -v /c/Users/sampleuser:/home/docker ocrmypdf --skip-text test.pdf output.pdf
|
||||
|
||||
where /c/Users/sampleuser is a Unix representation of the Windows path C:\Users\sampleuser, assuming a user named "sampleuser" is running ocrmypdf on a file in their home directory, and the files "test.pdf" and "output.pdf" are in the sampleuser folder. The Windows user must have read and write permissions.
|
||||
|
||||
where /c/Users/sampleuser is a Unix representation of the Windows path C:\\Users\\sampleuser, assuming a user named "sampleuser" is running ocrmypdf on a file in their home directory, and the files "test.pdf" and "output.pdf" are in the sampleuser folder. The Windows user must have read and write permissions.
|
||||
|
||||
Installing HEAD revision from sources
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -202,21 +242,29 @@ Installing HEAD revision from sources
|
||||
If you have ``git`` and ``python3.4`` or ``python3.5`` installed, you can install from source. When the ``pip`` installer runs,
|
||||
it will alert you if dependencies are missing.
|
||||
|
||||
To install the HEAD revision from sources in the current Python 3 environment::
|
||||
To install the HEAD revision from sources in the current Python 3 environment:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip3 install git+https://github.com/jbarlow83/OCRmyPDF.git
|
||||
|
||||
Or, to install in `development mode <https://pythonhosted.org/setuptools/setuptools.html#development-mode>`__, allowing customization of OCRmyPDF, use the ``-e`` flag::
|
||||
Or, to install in `development mode <https://pythonhosted.org/setuptools/setuptools.html#development-mode>`__, allowing customization of OCRmyPDF, use the ``-e`` flag:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip3 install -e git+https://github.com/jbarlow83/OCRmyPDF.git
|
||||
|
||||
On certain Linux distributions such as Ubuntu, you may need to use
|
||||
run the install command as superuser::
|
||||
run the install command as superuser:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo pip3 install [-e] git+https://github.com/jbarlow83/OCRmyPDF.git
|
||||
|
||||
Note that this will alter your system's Python distribution. If you prefer
|
||||
to not install as superuser, you can install the package in a Python virtual environment::
|
||||
to not install as superuser, you can install the package in a Python virtual environment:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone -b master https://github.com/jbarlow83/OCRmyPDF.git
|
||||
pyvenv venv
|
||||
@@ -227,7 +275,9 @@ to not install as superuser, you can install the package in a Python virtual env
|
||||
However, ``ocrmypdf`` will only be accessible on the system PATH after
|
||||
you activate the virtual environment.
|
||||
|
||||
To run the program::
|
||||
To run the program:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --help
|
||||
|
||||
@@ -240,7 +290,9 @@ Languages
|
||||
---------
|
||||
|
||||
OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users,
|
||||
you can often find packages that provide language packs::
|
||||
you can often find packages that provide language packs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Debian/Ubuntu users
|
||||
sudo apt-get install tesseract-ocr-chi-sim
|
||||
@@ -251,9 +303,15 @@ languages can be requested.
|
||||
Support
|
||||
-------
|
||||
|
||||
In case you detect an issue, please:
|
||||
Once ocrmypdf is installed, the built-in help which explains the command syntax and options can be accessed via:
|
||||
|
||||
- Check if your issue is already known
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --help
|
||||
|
||||
If you detect an issue, please:
|
||||
|
||||
- Check whether your issue is already known
|
||||
- If no problem report exists on github, please create one here:
|
||||
https://github.com/jbarlow83/OCRmyPDF/issues
|
||||
- Describe your problem thoroughly
|
||||
|
||||
@@ -47,12 +47,13 @@ These test resources are assemblies from other previously mentioned files, relea
|
||||
|
||||
- cardinal.pdf (four cardinal directions, rotated copies of LinnSequencer.jpg)
|
||||
- ccitt.pdf (LinnSequencer.jpg, converted to CCITT encoding)
|
||||
- encrypted_algo4.pdf (congress.jpg, encrypted with algorithm 4 - not supported by PyPDF2)
|
||||
- graph_ocred.pdf (from graph.pdf)
|
||||
- jbig2.pdf (congress.jpg, converted to JBIG2 encoding)
|
||||
- multipage.pdf (from several other files)
|
||||
- palette.pdf (congress.jpg, converted to a 256-color palette)
|
||||
- skew.pdf (from c02-22.pdf)
|
||||
- skew-encrypted.pdf (skew.pdf with encrypted applied)
|
||||
- skew-encrypted.pdf (skew.pdf with encryption - access supported by PyPDF2)
|
||||
|
||||
|
||||
.. _`Wikimedia: LinnSequencer`: https://upload.wikimedia.org/wikipedia/en/b/b7/LinnSequencer_hardware_MIDI_sequencer_brochure_page_2_300dpi.jpg
|
||||
@@ -63,4 +64,4 @@ These test resources are assemblies from other previously mentioned files, relea
|
||||
|
||||
.. _`Wikimedia: Pandas text analysis.png`: https://en.wikipedia.org/wiki/File:Pandas_text_analysis.png
|
||||
|
||||
.. _`Wikimedia: JPEG2000 Lichtenstein`: https://en.wikipedia.org/wiki/JPEG_2000#/media/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png
|
||||
.. _`Wikimedia: JPEG2000 Lichtenstein`: https://en.wikipedia.org/wiki/JPEG_2000#/media/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png
|
||||
|
||||
Reference in New Issue
Block a user