mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-04-18 05:00:01 -04:00
Fixes #1630 where --redo-ocr would shift OCR text vertically on PDFs with non-zero mediabox origins (e.g., [0, 100, width, height+100]). The bug occurred in _graft_fpdf2_text_layer where the Form XObject BBox was set to the text layer's mediabox [0, 0, w, h] instead of the base page's mediabox [0, 100, w, h+100]. This caused a coordinate mismatch between the BBox and the transformation matrix, resulting in text being positioned incorrectly. The fix changes line 450 in _graft.py to use base_mediabox instead of mediabox, making the fpdf2 renderer consistent with the sandwich renderer which already used base_mediabox correctly. This issue commonly affected: - JSTOR PDFs (generated by iText with cropping) - Cropped PDFs from various tools - PDFs with non-standard coordinate systems Added regression test that creates a PDF with offset mediabox origin and verifies --redo-ocr preserves coordinates correctly.