Merge commit '68cf9cbd87c188823027f9d1bfe9029017e7281f' into develop

This commit is contained in:
James R. Barlow
2016-07-17 00:29:48 -07:00
2 changed files with 90 additions and 31 deletions

View File

@@ -54,6 +54,8 @@ Debian and Ubuntu
Users of Debian 9 or later or Ubuntu 16.10 or later may simply
``apt-get install ocrmypdf``.
.. _Docker:
Installing the Docker image
~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -63,11 +65,15 @@ If you have `Docker <https://docs.docker.com/>`__ installed on your system, you
a Docker image of the latest release.
Follow the Docker installation instructions for your platform. If you can run this command
successfully, your system is ready to download and execute the image::
successfully, your system is ready to download and execute the image:
.. code-block:: bash
docker run hello-world
OCRmyPDF will use all available CPU cores. By default, the VirtualBox machine instance on Windows and OS X has only a single CPU core enabled. Use the VirtualBox Manager to determine the name of your Docker engine host, and then follow these optional steps to enable multiple CPUs::
OCRmyPDF will use all available CPU cores. By default, the VirtualBox machine instance on Windows and OS X has only a single CPU core enabled. Use the VirtualBox Manager to determine the name of your Docker engine host, and then follow these optional steps to enable multiple CPUs:
.. code-block:: bash
# Optional step for Mac OS X users
docker-machine stop "yourVM"
@@ -76,29 +82,41 @@ OCRmyPDF will use all available CPU cores. By default, the VirtualBox machine i
eval $(docker-machine env "yourVM")
Assuming you have a Docker engine running somewhere, you can run these commands to download
the image::
the image:
.. code-block:: bash
docker pull jbarlow83/ocrmypdf
Then tag it to give a more convenient name, just ocrmypdf::
Then tag it to give a more convenient name, just ocrmypdf:
.. code-block:: bash
docker tag jbarlow83/ocrmypdf ocrmypdf
This image contains language packs for English, French, Spanish and German. The alternative "polyglot" image provides `all available language packs <https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages>`__::
This image contains language packs for English, French, Spanish and German. The alternative "polyglot" image provides `all available language packs <https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages>`__:
.. code-block:: bash
# Alternative step: If you need all language packs
docker pull jbarlow83/ocrmypdf-polyglot
docker tag jbarlow83/ocrmypdf-polyglot ocrmypdf
You can then run ocrmypdf using the command::
You can then run ocrmypdf using the command:
.. code-block:: bash
docker run ocrmypdf --help
To execute the OCRmyPDF on a local file, you must `provide a writable volume to the Docker image <https://docs.docker.com/userguide/dockervolumes/>`__, such as this in this template::
To execute the OCRmyPDF on a local file, you must `provide a writable volume to the Docker image <https://docs.docker.com/userguide/dockervolumes/>`__, such as this in this template:
.. code-block:: bash
docker run -v "$(pwd):/home/docker" <other docker arguments> ocrmypdf <your arguments to ocrmypdf>
In this worked example, the current working directory contains an input file called ``test.pdf`` and the output will go to ``output.pdf``::
In this worked example, the current working directory contains an input file called ``test.pdf`` and the output will go to ``output.pdf``:
.. code-block:: bash
docker run -v "$(pwd):/home/docker" ocrmypdf --skip-text test.pdf output.pdf
@@ -112,11 +130,15 @@ These instructions probably work on all Mac OS X versions later than 10.7 (Lion)
If it's not already present, `install Homebrew <http://brew.sh/>`__.
Update Homebrew::
Update Homebrew:
.. code-block:: bash
brew update
Install or upgrade the required Homebrew packages, if any are missing::
Install or upgrade the required Homebrew packages, if any are missing:
.. code-block:: bash
brew install libpng openjpeg jbig2dec # image libraries
brew install qpdf
@@ -126,16 +148,22 @@ Install or upgrade the required Homebrew packages, if any are missing::
brew install unpaper # optional
brew install tesseract
Update the homebrew pip and install Pillow::
Update the homebrew pip and install Pillow:
.. code-block:: bash
pip3 install --upgrade pip
pip3 install --upgrade pillow
You can then install OCRmyPDF from PyPI::
You can then install OCRmyPDF from PyPI:
.. code-block:: bash
pip3 install ocrmypdf
The command line program should now be available::
The command line program should now be available:
.. code-block:: bash
ocrmypdf --help
@@ -144,12 +172,16 @@ Installing on Ubuntu 14.04 LTS
Installing on Ubuntu 14.04 LTS (trusty) is more difficult than other options, because of certain bugs in Python package installation.
Update apt-get::
Update apt-get:
.. code-block:: bash
sudo apt-get update
sudo apt-get upgrade
Install system dependencies::
Install system dependencies:
.. code-block:: bash
sudo apt-get install \
zlib1g-dev \
@@ -165,13 +197,17 @@ Install system dependencies::
python3-reportlab
If you wish install OCRmyPDF to the system Python, then install as follows (note this installs new packages
into your system Python, which could interfere with other programs)::
into your system Python, which could interfere with other programs):
.. code-block:: bash
sudo pip3 install ocrmypdf
If you wish to install OCRmyPDF to a virtual environment to isolate system Python from modified, you can
follow these steps. This includes a workaround `for a known, unresolved issue in Ubuntu 14.04's ensurepip
package <http://www.thefourtheye.in/2014/12/Python-venv-problem-with-ensurepip-in-Ubuntu.html>`__::
package <http://www.thefourtheye.in/2014/12/Python-venv-problem-with-ensurepip-in-Ubuntu.html>`__:
.. code-block:: bash
sudo apt-get install python3-venv
python3 -m venv venv-ocrmypdf --without-pip
@@ -187,14 +223,18 @@ Ubuntu 14.04 only installs ``unpaper`` version 0.4.2, which is not supported by
Installing on Windows
~~~~~~~~~~~~~~~~~~~~~
Direct installation on Windows is not possible. Install the Docker container as described above. Ensure that your command prompt can run the docker "hello world" container.
Direct installation on Windows is not possible. Install the _`Docker` container as described above. Ensure that your command prompt can run the docker "hello world" container.
The command line syntax to run ocrmypdf from a command prompt will resemble::
Running on Windows
~~~~~~~~~~~~~~~~~~
The command line syntax to run ocrmypdf from a command prompt will resemble:
.. code-block:: bat
docker run -v /c/Users/sampleuser:/home/docker ocrmypdf --skip-text test.pdf output.pdf
where /c/Users/sampleuser is a Unix representation of the Windows path C:\Users\sampleuser, assuming a user named "sampleuser" is running ocrmypdf on a file in their home directory, and the files "test.pdf" and "output.pdf" are in the sampleuser folder. The Windows user must have read and write permissions.
where /c/Users/sampleuser is a Unix representation of the Windows path C:\\Users\\sampleuser, assuming a user named "sampleuser" is running ocrmypdf on a file in their home directory, and the files "test.pdf" and "output.pdf" are in the sampleuser folder. The Windows user must have read and write permissions.
Installing HEAD revision from sources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -202,21 +242,29 @@ Installing HEAD revision from sources
If you have ``git`` and ``python3.4`` or ``python3.5`` installed, you can install from source. When the ``pip`` installer runs,
it will alert you if dependencies are missing.
To install the HEAD revision from sources in the current Python 3 environment::
To install the HEAD revision from sources in the current Python 3 environment:
.. code-block:: bash
pip3 install git+https://github.com/jbarlow83/OCRmyPDF.git
Or, to install in `development mode <https://pythonhosted.org/setuptools/setuptools.html#development-mode>`__, allowing customization of OCRmyPDF, use the ``-e`` flag::
Or, to install in `development mode <https://pythonhosted.org/setuptools/setuptools.html#development-mode>`__, allowing customization of OCRmyPDF, use the ``-e`` flag:
.. code-block:: bash
pip3 install -e git+https://github.com/jbarlow83/OCRmyPDF.git
On certain Linux distributions such as Ubuntu, you may need to use
run the install command as superuser::
run the install command as superuser:
.. code-block:: bash
sudo pip3 install [-e] git+https://github.com/jbarlow83/OCRmyPDF.git
Note that this will alter your system's Python distribution. If you prefer
to not install as superuser, you can install the package in a Python virtual environment::
to not install as superuser, you can install the package in a Python virtual environment:
.. code-block:: bash
git clone -b master https://github.com/jbarlow83/OCRmyPDF.git
pyvenv venv
@@ -227,7 +275,9 @@ to not install as superuser, you can install the package in a Python virtual env
However, ``ocrmypdf`` will only be accessible on the system PATH after
you activate the virtual environment.
To run the program::
To run the program:
.. code-block:: bash
ocrmypdf --help
@@ -240,7 +290,9 @@ Languages
---------
OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users,
you can often find packages that provide language packs::
you can often find packages that provide language packs:
.. code-block:: bash
# Debian/Ubuntu users
sudo apt-get install tesseract-ocr-chi-sim
@@ -251,9 +303,15 @@ languages can be requested.
Support
-------
In case you detect an issue, please:
Once ocrmypdf is installed, the built-in help which explains the command syntax and options can be accessed via:
- Check if your issue is already known
.. code-block:: bash
ocrmypdf --help
If you detect an issue, please:
- Check whether your issue is already known
- If no problem report exists on github, please create one here:
https://github.com/jbarlow83/OCRmyPDF/issues
- Describe your problem thoroughly

View File

@@ -47,12 +47,13 @@ These test resources are assemblies from other previously mentioned files, relea
- cardinal.pdf (four cardinal directions, rotated copies of LinnSequencer.jpg)
- ccitt.pdf (LinnSequencer.jpg, converted to CCITT encoding)
- encrypted_algo4.pdf (congress.jpg, encrypted with algorithm 4 - not supported by PyPDF2)
- graph_ocred.pdf (from graph.pdf)
- jbig2.pdf (congress.jpg, converted to JBIG2 encoding)
- multipage.pdf (from several other files)
- palette.pdf (congress.jpg, converted to a 256-color palette)
- skew.pdf (from c02-22.pdf)
- skew-encrypted.pdf (skew.pdf with encrypted applied)
- skew-encrypted.pdf (skew.pdf with encryption - access supported by PyPDF2)
.. _`Wikimedia: LinnSequencer`: https://upload.wikimedia.org/wikipedia/en/b/b7/LinnSequencer_hardware_MIDI_sequencer_brochure_page_2_300dpi.jpg
@@ -63,4 +64,4 @@ These test resources are assemblies from other previously mentioned files, relea
.. _`Wikimedia: Pandas text analysis.png`: https://en.wikipedia.org/wiki/File:Pandas_text_analysis.png
.. _`Wikimedia: JPEG2000 Lichtenstein`: https://en.wikipedia.org/wiki/JPEG_2000#/media/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png
.. _`Wikimedia: JPEG2000 Lichtenstein`: https://en.wikipedia.org/wiki/JPEG_2000#/media/File:Jpeg2000_2-level_wavelet_transform-lichtenstein.png