OCRmyPDF

mirror of https://github.com/ocrmypdf/OCRmyPDF.git synced 2026-02-08 21:33:02 -05:00

Author	SHA1	Message	Date
James R. Barlow	740f67091c	Rename OCROptions to OcrOptions for consistency Technically OCROptions is more Pythonic but we have several pre-existing classes named OcrWhatever. Go with the local flow.	2026-01-12 23:37:54 -08:00
James R. Barlow	0c3745a1a4	Add OCR engine selection framework and null OCR engine Introduce --ocr-engine option to select between OCR engines: - 'auto' (default): Uses Tesseract - 'tesseract': Explicit Tesseract selection - 'none': Skip OCR entirely (for PDF processing only) Key changes: - Extend OcrEngine ABC with generate_ocr() and supports_generate_ocr() for direct OcrElement tree output (bypasses hOCR) - Add get_ocr_engine(options) hook parameter for engine selection - Implement NullOcrEngine for --ocr-engine none - Export OcrElement, OcrClass, BoundingBox from ocrmypdf package - Add ocr_tree support to grafting pipeline This prepares the foundation for pluggable OCR engines while maintaining full backward compatibility with existing Tesseract-based workflows.	2026-01-12 10:11:14 -08:00
James R. Barlow	16c2604a07	Remove lossy JBIG2 support, retain lossless JBIG2 only Lossy JBIG2 has been removed due to well-documented risks of character substitution errors (e.g., 6/8 confusion). The --jbig2-lossy and --jbig2-page-group-size arguments are now deprecated and ignored with a warning. Changes: - Remove jbig2_lossy and jbig2_page_group_size from OCROptions - Simplify optimize.py to use single-image JBIG2 encoding only (no symbol dictionaries/JBIG2Globals) - Remove convert_group() from jbig2enc.py - Deprecate CLI args with warnings for backward compatibility - Update documentation to explain lossless-only JBIG2	2025-12-23 02:45:07 -08:00
James R. Barlow	0ad7f5fc13	feat: add dynamic nested access to plugin options Completes Phase 5 of the CLI refactoring plan by enabling nested plugin option access (e.g., options.tesseract.timeout) alongside the legacy flat access (options.tesseract_timeout). Changes: - Add module-level plugin option model registry in _options.py - Add __getattr__ to OCROptions for dynamic namespace access - Register plugin models in setup_plugin_infrastructure() - Add test for nested plugin option access Plugin option instances are lazily created from flat field values and cached for subsequent access. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-21 12:21:48 -08:00
James R. Barlow	dbd3c93757	Fix issue with unpickling HOCRResult Fixes [Bug]: HOCRResult.from_json() not unpickling correctly #1427	2024-11-10 02:05:57 -08:00
James R. Barlow	7a8cc21e31	Add support for sidecar output to io.BytesIO Closes #1252	2024-04-07 01:38:55 -07:00
James R. Barlow	71166f7be8	Make hocr API experimental for now This commit can be reverted when we are ready to release a new version.	2023-10-30 00:07:10 -07:00
James R. Barlow	fbf0674189	hocr_to_ocr_pdf: handle missing hocr json file	2023-10-24 00:54:31 -07:00
James R. Barlow	23951c9e38	Working HOCR folder to PDF converter	2023-10-24 00:54:30 -07:00
James R. Barlow	68bb38d0ad	pdf_to_hocr: improve plugin handling	2023-10-24 00:52:31 -07:00
James R. Barlow	0443e87345	Introduce pdf_to_hocr API	2023-10-24 00:52:31 -07:00
James R. Barlow	de2bb5ce8c	Remove tqdm dependency and TqdmConsole Might be too aggressive? No deprecation warning....	2023-09-21 00:05:39 -07:00
James R. Barlow	80ed2117cc	Change to SPDX license tracking	2022-07-28 01:10:07 -07:00
James R. Barlow	dc6f1a266a	Modernize type annotations	2022-07-23 00:39:24 -07:00
James R. Barlow	aa0ec40102	Change license of all GPLv3 files to MPL-2.0 https://github.com/jbarlow83/OCRmyPDF/issues/600	2020-08-05 00:44:42 -07:00
James R. Barlow	f4cb424451	Support input/output streams at API level	2020-06-22 02:02:18 -07:00
James R. Barlow	58abb5785c	pytest picky about list vs tuple	2020-04-15 03:16:51 -07:00
James R. Barlow	4a640b8dcd	Fix language argument not working as list Fixes #523	2020-04-14 23:18:52 -07:00
James R. Barlow	c36e9950ae	tests: test TqdmConsole	2019-12-30 17:51:09 -08:00

19 Commits