mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-05-04 20:54:18 -04:00
Update README to point to ReadTheDocs
This commit is contained in:
297
README.rst
297
README.rst
@@ -2,7 +2,20 @@ OCRmyPDF
|
||||
========
|
||||
|
||||
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to
|
||||
be searched.
|
||||
be searched or copy-pasted.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf # it's a scriptable command line program
|
||||
-l eng+fra # it supports multiple languages
|
||||
--rotate-pages # it can fix pages that are misrotated
|
||||
--deskew # it can deskew crooked PDFs!
|
||||
--title "My PDF" # it can change output metadata
|
||||
--jobs 4 # it uses multiple cores by default
|
||||
--output-type pdfa # it produces PDF/A by default
|
||||
input_scanned.pdf # takes PDF input (or images)
|
||||
output_searchable.pdf # produces validated PDF output
|
||||
|
||||
|
||||
Main features
|
||||
-------------
|
||||
@@ -38,286 +51,18 @@ Linux/UNIX: I found many, but none of them were really satisfying.
|
||||
- Or they did not produce valid PDF files (even though they were readable with my current PDF reader)
|
||||
- On top of that none of them produced PDF/A files (format dedicated for long time storage)
|
||||
|
||||
... so I decided to develop my own tool (using various existing scripts
|
||||
as an inspiration)
|
||||
...so I decided to develop my own tool (using various existing scripts
|
||||
as an inspiration).
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
Download OCRmyPDF here: https://github.com/jbarlow83/OCRmyPDF/releases
|
||||
|
||||
These steps describe how to install OCRmyPDF on your system.
|
||||
|
||||
- `Installing on Debian and Ubuntu`_ (Debian stretch and Ubuntu 16.10 or later)
|
||||
- `Installing the Docker image`_
|
||||
- `Installing on Mac OS X`_
|
||||
- `Installing on Ubuntu 14.04 LTS`_
|
||||
- Installing and running on `Windows`_ using the Docker image
|
||||
|
||||
If you prefer to install from source or install OCRmyPDF to a Python virtual environment, see steps for `Installing HEAD revision from sources`_.
|
||||
|
||||
.. _Windows: `Installing on Windows`_
|
||||
|
||||
|
||||
Installing on Debian and Ubuntu
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Linux, UNIX, and macOS are supported. Windows is not directly supported but there is a Docker image available that runs on Windows.
|
||||
|
||||
Users of Debian 9 or later or Ubuntu 16.10 or later may simply
|
||||
``apt-get install ocrmypdf``.
|
||||
|
||||
.. _Docker:
|
||||
|
||||
Installing the Docker image
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For many users, installing the Docker image will be easier than installing all of OCRmyPDF's dependencies. For Windows, it is the only option.
|
||||
|
||||
If you have `Docker <https://docs.docker.com/>`_ installed on your system, you can install
|
||||
a Docker image of the latest release.
|
||||
|
||||
Follow the Docker installation instructions for your platform. If you can run this command
|
||||
successfully, your system is ready to download and execute the image:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run hello-world
|
||||
|
||||
OCRmyPDF will use all available CPU cores. By default, the VirtualBox machine instance on Windows and OS X has only a single CPU core enabled. Use the VirtualBox Manager to determine the name of your Docker engine host, and then follow these optional steps to enable multiple CPUs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Optional step for Mac OS X users
|
||||
docker-machine stop "yourVM"
|
||||
VBoxManage modifyvm "yourVM" --cpus 2 # or whatever number of core is desired
|
||||
docker-machine start "yourVM"
|
||||
eval $(docker-machine env "yourVM")
|
||||
|
||||
Assuming you have a Docker engine running somewhere, you can run these commands to download
|
||||
the image:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker pull jbarlow83/ocrmypdf
|
||||
|
||||
Then tag it to give a more convenient name, just ocrmypdf:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker tag jbarlow83/ocrmypdf ocrmypdf
|
||||
|
||||
This image contains language packs for English, French, Spanish and German. The alternative "polyglot" image provides `all available language packs <https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages>`_:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Alternative step: If you need all language packs
|
||||
docker pull jbarlow83/ocrmypdf-polyglot
|
||||
docker tag jbarlow83/ocrmypdf-polyglot ocrmypdf
|
||||
|
||||
You can then run ocrmypdf using the command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run ocrmypdf --help
|
||||
|
||||
To execute the OCRmyPDF on a local file, you must `provide a writable volume to the Docker image <https://docs.docker.com/userguide/dockervolumes/>`_, such as this in this template:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run -v "$(pwd):/home/docker" <other docker arguments> ocrmypdf <your arguments to ocrmypdf>
|
||||
|
||||
In this worked example, the current working directory contains an input file called ``test.pdf`` and the output will go to ``output.pdf``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run -v "$(pwd):/home/docker" ocrmypdf --skip-text test.pdf output.pdf
|
||||
|
||||
Note that ``ocrmypdf`` has its own separate ``-v VERBOSITYLEVEL`` argument to control debug verbosity. All Docker arguments should before the ``ocrmypdf`` image name and all arguments to ``ocrmypdf`` should be listed after.
|
||||
|
||||
|
||||
Installing on Mac OS X
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
These instructions probably work on all Mac OS X versions later than 10.7 (Lion). OCRmyPDF is known to work on Yosemite and El Capitan, and regularly tested on El Capitan.
|
||||
|
||||
If it's not already present, `install Homebrew <http://brew.sh/>`_.
|
||||
|
||||
Update Homebrew:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew update
|
||||
|
||||
Install or upgrade the required Homebrew packages, if any are missing:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install libpng openjpeg jbig2dec libtiff # image libraries
|
||||
brew install qpdf
|
||||
brew install ghostscript
|
||||
brew install python3
|
||||
brew install libxml2 libffi leptonica
|
||||
brew install unpaper # optional
|
||||
|
||||
Install the required Tesseract OCR engine with the language packs you plan to use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install tesseract # Option 1: for English
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install tesseract --with-all-languages # Option 2: for all language packs
|
||||
|
||||
Update the homebrew pip and install Pillow:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip3 install --upgrade pip
|
||||
pip3 install --upgrade pillow
|
||||
|
||||
You can then install OCRmyPDF from PyPI:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip3 install ocrmypdf
|
||||
|
||||
The command line program should now be available:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --help
|
||||
|
||||
Installing on Ubuntu 14.04 LTS
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Installing on Ubuntu 14.04 LTS (trusty) is more difficult than some other options, because of bugs in Python package installation.
|
||||
|
||||
Add new "apt" repositories needed for backports of Ghostscript 9.16 and libav-11, which supports unpaper 6.1. This will replace Ghostscript on your system.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo add-apt-repository ppa:vshn/ghostscript -y
|
||||
sudo add-apt-repository ppa:heyarje/libav-11 -y
|
||||
|
||||
Update apt-get:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo apt-get update
|
||||
sudo apt-get upgrade
|
||||
|
||||
Install system dependencies:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo apt-get install \
|
||||
zlib1g-dev \
|
||||
libjpeg-dev \
|
||||
libffi-dev \
|
||||
libavformat56 libavcodec56 libavutil54 \
|
||||
ghostscript \
|
||||
tesseract-ocr \
|
||||
qpdf \
|
||||
python3-pip \
|
||||
python3-pil \
|
||||
python3-pytest \
|
||||
python3-reportlab
|
||||
|
||||
If you wish install OCRmyPDF to the system Python, then install as follows (note this installs new packages
|
||||
into your system Python, which could interfere with other programs):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo pip3 install ocrmypdf
|
||||
|
||||
If you wish to install OCRmyPDF to a virtual environment to isolate the system Python, you can
|
||||
follow these steps. This includes a workaround `for a known, unresolved issue in Ubuntu 14.04's ensurepip
|
||||
package <http://www.thefourtheye.in/2014/12/Python-venv-problem-with-ensurepip-in-Ubuntu.html>`_:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo apt-get install python3-venv
|
||||
python3 -m venv venv-ocrmypdf --without-pip
|
||||
source venv-ocrmypdf/bin/activate
|
||||
wget -O - -o /dev/null https://bootstrap.pypa.io/get-pip.py | python
|
||||
deactivate
|
||||
python3 -m venv --system-site-packages venv-ocrmypdf
|
||||
source venv-ocrmypdf/bin/activate
|
||||
pip install ocrmypdf
|
||||
|
||||
These installation instructions omit the optional dependency ``unpaper``, which is only available at version 0.4.2 in Ubuntu 14.04. The author could not find a backport of ``unpaper`` and is not motivated to figure how to set up a Ubuntu PPA to distribute it. You can create a .deb package to do the job of installing unpaper 6.1 (for x86 64-bit only):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
wget -q https://dl.dropboxusercontent.com/u/28971240/unpaper_6.1-1.deb -O unpaper_6.1-1.deb
|
||||
sudo dpkg -i unpaper_6.1-1.deb
|
||||
|
||||
|
||||
Installing on Windows
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Direct installation on Windows is not possible. Install the _`Docker` container as described above. Ensure that your command prompt can run the docker "hello world" container.
|
||||
|
||||
Running on Windows
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The command line syntax to run ocrmypdf from a command prompt will resemble:
|
||||
|
||||
.. code-block:: bat
|
||||
|
||||
docker run -v /c/Users/sampleuser:/home/docker ocrmypdf --skip-text test.pdf output.pdf
|
||||
|
||||
where /c/Users/sampleuser is a Unix representation of the Windows path C:\\Users\\sampleuser, assuming a user named "sampleuser" is running ocrmypdf on a file in their home directory, and the files "test.pdf" and "output.pdf" are in the sampleuser folder. The Windows user must have read and write permissions.
|
||||
|
||||
Installing HEAD revision from sources
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you have ``git`` and ``python3.4`` or ``python3.5`` installed, you can install from source. When the ``pip`` installer runs,
|
||||
it will alert you if dependencies are missing.
|
||||
|
||||
To install the HEAD revision from sources in the current Python 3 environment:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip3 install git+https://github.com/jbarlow83/OCRmyPDF.git
|
||||
|
||||
Or, to install in `development mode <https://pythonhosted.org/setuptools/setuptools.html#development-mode>`_, allowing customization of OCRmyPDF, use the ``-e`` flag:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip3 install -e git+https://github.com/jbarlow83/OCRmyPDF.git
|
||||
|
||||
On certain Linux distributions such as Ubuntu, you may need to use
|
||||
run the install command as superuser:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo pip3 install [-e] git+https://github.com/jbarlow83/OCRmyPDF.git
|
||||
|
||||
Note that this will alter your system's Python distribution. If you prefer
|
||||
to not install as superuser, you can install the package in a Python virtual environment:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone -b master https://github.com/jbarlow83/OCRmyPDF.git
|
||||
python3 -m venv
|
||||
source venv/bin/activate
|
||||
cd OCRmyPDF
|
||||
pip3 install .
|
||||
|
||||
However, ``ocrmypdf`` will only be accessible on the system PATH after
|
||||
you activate the virtual environment.
|
||||
|
||||
To run the program:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ocrmypdf --help
|
||||
|
||||
If not yet installed, the script will notify you about dependencies that
|
||||
need to be installed. The script requires specific versions of the
|
||||
dependencies. Older version than the ones mentioned in the release notes
|
||||
are likely not to be compatible to OCRmyPDF.
|
||||
For everyone else, `see our documentation <https://ocrmypdf.readthedocs.io/en/latest/installation.html>`_ for installation steps.
|
||||
|
||||
Languages
|
||||
---------
|
||||
@@ -336,8 +81,8 @@ you can often find packages that provide language packs:
|
||||
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple
|
||||
languages can be requested.
|
||||
|
||||
Support
|
||||
-------
|
||||
Documentation and support
|
||||
-------------------------
|
||||
|
||||
Once ocrmypdf is installed, the built-in help which explains the command syntax and options can be accessed via:
|
||||
|
||||
@@ -345,7 +90,7 @@ Once ocrmypdf is installed, the built-in help which explains the command syntax
|
||||
|
||||
ocrmypdf --help
|
||||
|
||||
The `Wiki <https://github.com/jbarlow83/OCRmyPDF/wiki>`_ page also contains some tips and suggests.
|
||||
Our `documentation is served on Read the Docs <https://ocrmypdf.readthedocs.io/en/latest/index.html>`_.
|
||||
|
||||
If you detect an issue, please:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user