mirror/OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2026-05-05 13:16:55 -04:00

Go to file

fritz-hh ae716a91cb jhove paths corrected

2013-04-26 16:11:59 +02:00

jhove paths corrected

2013-04-26 16:11:59 +02:00

Support for additional tesseract config files

2013-04-23 21:36:34 +02:00

added readme

2013-04-26 14:28:04 +02:00

.gitattributes

gitignore, gitattributes and releaseNotes added

2013-04-09 18:54:14 +02:00

.gitignore

gitignore, gitattributes and releaseNotes added

2013-04-09 18:54:14 +02:00

COPYRIGHT.md

Update COPYRIGHT.md

2013-04-26 12:19:28 +03:00

hocrTransform.py

hocrTransform: font changed to Helvetica

2013-04-26 11:49:21 +02:00

OCRmyPDF.sh

jhove paths corrected

2013-04-26 16:11:59 +02:00

README.md

Update README.md

2013-04-25 12:20:26 +03:00

RELEASE_NOTES.md

gitignore, gitattributes and releaseNotes added

2013-04-09 18:54:14 +02:00

README.md

OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

To get the script usage, call: ./OCmyPDF.sh -h

Features

Generate a searchable PDF/A file from a PDF file containing only images
Keep the exact resolution of the original embedded images
If requested deskew and / or clean the image before performing OCR
Validate the generated file against the PDF/A specification using jhove
Provides debug mode to enable easy verification of the OCR results

Motivation

I searched the web for a free tool to OCR PDF files on linux/unix: I found many, but none of them was satisfying.

Either they produced PDF files with misplaced text below the image (making copy/paste impossible)
Or they changed the resolution of the embedded images
Or they generated PDF file having a rediculous big size
Or they crashed when trying to OCR some of my PDF files
Or they did not produce valid PDF files (even though they were readable with my current PDF reader) On top of that none of them produced PDF/A files (format dedicated for long time storage / archiving)

... so I decided to develop my own tool (using various existing scripts as an inspiration)

Install

TODO

Install jhove: download jhove from here: http://sourceforge.net/projects/jhove/files/jhove/ After extracting the JHOVE files to some directory "jhove", you have to edit the file "jhove/conf/jhove.conf" and change something in "something" to the actual directory (ending in "/jhove").