mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-05-05 13:16:55 -04:00
Readme: Add table of contents, brew install tesseract --with-language packs
This commit is contained in:
69
README.rst
69
README.rst
@@ -8,7 +8,7 @@ Main features
|
||||
-------------
|
||||
|
||||
- Generates a searchable
|
||||
`PDF/A <https://en.wikipedia.org/?title=PDF/A>`__ file from a regular PDF
|
||||
`PDF/A <https://en.wikipedia.org/?title=PDF/A>`_ file from a regular PDF
|
||||
- Places OCR text accurately below the image to ease copy / paste
|
||||
- Keeps the exact resolution of the original embedded images
|
||||
- When possible, inserts OCR information as a "lossless" operation without rendering vector information
|
||||
@@ -18,11 +18,11 @@ Main features
|
||||
- Provides debug mode to enable easy verification of the OCR results
|
||||
- Processes pages in parallel when more than one CPU core is
|
||||
available
|
||||
- Uses `Tesseract OCR <https://github.com/tesseract-ocr/tesseract>`__ engine
|
||||
- Supports the `39 languages <https://code.google.com/p/tesseract-ocr/downloads/list>`__ recognized by Tesseract
|
||||
- Uses `Tesseract OCR <https://github.com/tesseract-ocr/tesseract>`_ engine
|
||||
- Supports the `39 languages <https://code.google.com/p/tesseract-ocr/downloads/list>`_ recognized by Tesseract
|
||||
- Battle-tested on thousands of PDFs, a test suite and continuous integration
|
||||
|
||||
For details: please consult the `release notes <RELEASE_NOTES.rst>`__.
|
||||
For details: please consult the `release notes <RELEASE_NOTES.rst>`_.
|
||||
|
||||
Motivation
|
||||
----------
|
||||
@@ -31,9 +31,9 @@ I searched the web for a free command line tool to OCR PDF files on
|
||||
Linux/UNIX: I found many, but none of them were really satisfying.
|
||||
|
||||
- Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
|
||||
- Or they did not display correctly some escaped HTML characters located in the hOCR file produced by the OCR engine
|
||||
- Or they did not handle accents and multilingual characters
|
||||
- Or they changed the resolution of the embedded images
|
||||
- Or they generated PDF files having a ridiculous big size
|
||||
- Or they generated ridiculously large PDF files
|
||||
- Or they crashed when trying to OCR some of my PDF files
|
||||
- Or they did not produce valid PDF files (even though they were readable with my current PDF reader)
|
||||
- On top of that none of them produced PDF/A files (format dedicated for long time storage)
|
||||
@@ -46,10 +46,21 @@ Installation
|
||||
|
||||
Download OCRmyPDF here: https://github.com/jbarlow83/OCRmyPDF/releases
|
||||
|
||||
You can install it to a Python virtual environment or system-wide.
|
||||
These steps describe how to install OCRmyPDF on your system.
|
||||
|
||||
Debian and Ubuntu
|
||||
~~~~~~~~~~~~~~~~~
|
||||
- `Installing on Debian and Ubuntu`_ (Debian stretch and Ubuntu 16.10 or later)
|
||||
- `Installing the Docker image`_
|
||||
- `Installing on Mac OS X`_
|
||||
- `Installing on Ubuntu 14.04 LTS`_
|
||||
- Installing and running on `Windows`_ using the Docker image
|
||||
|
||||
If you prefer to install from source or install OCRmyPDF to a Python virtual environment, see steps for `Installing HEAD revision from sources`_.
|
||||
|
||||
.. _Windows: `Installing on Windows`_
|
||||
|
||||
|
||||
Installing on Debian and Ubuntu
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Users of Debian 9 or later or Ubuntu 16.10 or later may simply
|
||||
``apt-get install ocrmypdf``.
|
||||
@@ -61,7 +72,7 @@ Installing the Docker image
|
||||
|
||||
For many users, installing the Docker image will be easier than installing all of OCRmyPDF's dependencies. For Windows, it is the only option.
|
||||
|
||||
If you have `Docker <https://docs.docker.com/>`__ installed on your system, you can install
|
||||
If you have `Docker <https://docs.docker.com/>`_ installed on your system, you can install
|
||||
a Docker image of the latest release.
|
||||
|
||||
Follow the Docker installation instructions for your platform. If you can run this command
|
||||
@@ -94,7 +105,7 @@ Then tag it to give a more convenient name, just ocrmypdf:
|
||||
|
||||
docker tag jbarlow83/ocrmypdf ocrmypdf
|
||||
|
||||
This image contains language packs for English, French, Spanish and German. The alternative "polyglot" image provides `all available language packs <https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages>`__:
|
||||
This image contains language packs for English, French, Spanish and German. The alternative "polyglot" image provides `all available language packs <https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages>`_:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -108,7 +119,7 @@ You can then run ocrmypdf using the command:
|
||||
|
||||
docker run ocrmypdf --help
|
||||
|
||||
To execute the OCRmyPDF on a local file, you must `provide a writable volume to the Docker image <https://docs.docker.com/userguide/dockervolumes/>`__, such as this in this template:
|
||||
To execute the OCRmyPDF on a local file, you must `provide a writable volume to the Docker image <https://docs.docker.com/userguide/dockervolumes/>`_, such as this in this template:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -128,7 +139,7 @@ Installing on Mac OS X
|
||||
|
||||
These instructions probably work on all Mac OS X versions later than 10.7 (Lion). OCRmyPDF is known to work on Yosemite and El Capitan, and regularly tested on El Capitan.
|
||||
|
||||
If it's not already present, `install Homebrew <http://brew.sh/>`__.
|
||||
If it's not already present, `install Homebrew <http://brew.sh/>`_.
|
||||
|
||||
Update Homebrew:
|
||||
|
||||
@@ -140,13 +151,22 @@ Install or upgrade the required Homebrew packages, if any are missing:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install libpng openjpeg jbig2dec # image libraries
|
||||
brew install libpng openjpeg jbig2dec libtiff # image libraries
|
||||
brew install qpdf
|
||||
brew install ghostscript
|
||||
brew install python3
|
||||
brew install libxml2 libffi leptonica
|
||||
brew install unpaper # optional
|
||||
brew install tesseract
|
||||
brew install unpaper # optional
|
||||
|
||||
Install the required Tesseract OCR engine with the language packs you plan to use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install tesseract # Option 1: for English, French, German, Spanish
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
brew install tesseract --with-all-languages # Option 2: for all language packs
|
||||
|
||||
Update the homebrew pip and install Pillow:
|
||||
|
||||
@@ -205,7 +225,7 @@ into your system Python, which could interfere with other programs):
|
||||
|
||||
If you wish to install OCRmyPDF to a virtual environment to isolate system Python from modified, you can
|
||||
follow these steps. This includes a workaround `for a known, unresolved issue in Ubuntu 14.04's ensurepip
|
||||
package <http://www.thefourtheye.in/2014/12/Python-venv-problem-with-ensurepip-in-Ubuntu.html>`__:
|
||||
package <http://www.thefourtheye.in/2014/12/Python-venv-problem-with-ensurepip-in-Ubuntu.html>`_:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -218,7 +238,7 @@ package <http://www.thefourtheye.in/2014/12/Python-venv-problem-with-ensurepip-i
|
||||
source venv-ocrmypdf/bin/activate
|
||||
pip install ocrmypdf
|
||||
|
||||
Ubuntu 14.04 only installs ``unpaper`` version 0.4.2, which is not supported by OCRmyPDF because it is produces invalid output. This program is an optional dependency, and provides page deskewing and cleaning. See `Dockerfile <Dockerfile>`__ for an example of how to building unpaper 6.1 from source. If you choose to install unpaper later, OCRmyPDF will use the foremost version on the system PATH.
|
||||
Ubuntu 14.04 only installs ``unpaper`` version 0.4.2, which is not supported by OCRmyPDF because it is produces invalid output. This program is an optional dependency, and provides page deskewing and cleaning. See `Dockerfile <Dockerfile>`_ for an example of how to building unpaper 6.1 from source. If you choose to install unpaper later, OCRmyPDF will use the foremost version on the system PATH.
|
||||
|
||||
Installing on Windows
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -248,7 +268,7 @@ To install the HEAD revision from sources in the current Python 3 environment:
|
||||
|
||||
pip3 install git+https://github.com/jbarlow83/OCRmyPDF.git
|
||||
|
||||
Or, to install in `development mode <https://pythonhosted.org/setuptools/setuptools.html#development-mode>`__, allowing customization of OCRmyPDF, use the ``-e`` flag:
|
||||
Or, to install in `development mode <https://pythonhosted.org/setuptools/setuptools.html#development-mode>`_, allowing customization of OCRmyPDF, use the ``-e`` flag:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -294,8 +314,11 @@ you can often find packages that provide language packs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Display a list of all Tesseract language packs
|
||||
apt-cache search tesseract-ocr
|
||||
|
||||
# Debian/Ubuntu users
|
||||
sudo apt-get install tesseract-ocr-chi-sim
|
||||
sudo apt-get install tesseract-ocr-chi-sim # Example: Install Chinese Simplified language back
|
||||
|
||||
You can then pass the ``-l LANG`` argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple
|
||||
languages can be requested.
|
||||
@@ -309,6 +332,8 @@ Once ocrmypdf is installed, the built-in help which explains the command syntax
|
||||
|
||||
ocrmypdf --help
|
||||
|
||||
The `Wiki <https://github.com/jbarlow83/OCRmyPDF/wiki>`_ page also contains some tips and suggests.
|
||||
|
||||
If you detect an issue, please:
|
||||
|
||||
- Check whether your issue is already known
|
||||
@@ -323,11 +348,11 @@ If you detect an issue, please:
|
||||
Press & Media
|
||||
-------------
|
||||
|
||||
- `c't 1-2014, page 59 <http://heise.de/-2279695>`__:
|
||||
- `c't 1-2014, page 59 <http://heise.de/-2279695>`_:
|
||||
Detailed presentation of OCRmyPDF v1.0 in the leading German IT
|
||||
magazine c't
|
||||
- `heise Open Source, 09/2014: Texterkennung mit
|
||||
OCRmyPDF <http://heise.de/-2356670>`__
|
||||
OCRmyPDF <http://heise.de/-2356670>`_
|
||||
|
||||
Disclaimer
|
||||
----------
|
||||
|
||||
Reference in New Issue
Block a user