Files
OCRmyPDF/tests
James R. Barlow d68e2f6e34 Fix OCR text layer misalignment with non-zero mediabox origins
Fixes #1630 where --redo-ocr would shift OCR text vertically on PDFs
with non-zero mediabox origins (e.g., [0, 100, width, height+100]).

The bug occurred in _graft_fpdf2_text_layer where the Form XObject BBox
was set to the text layer's mediabox [0, 0, w, h] instead of the base
page's mediabox [0, 100, w, h+100]. This caused a coordinate mismatch
between the BBox and the transformation matrix, resulting in text being
positioned incorrectly.

The fix changes line 450 in _graft.py to use base_mediabox instead of
mediabox, making the fpdf2 renderer consistent with the sandwich renderer
which already used base_mediabox correctly.

This issue commonly affected:
- JSTOR PDFs (generated by iText with cropping)
- Cropped PDFs from various tools
- PDFs with non-standard coordinate systems

Added regression test that creates a PDF with offset mediabox origin
and verifies --redo-ocr preserves coordinates correctly.
2026-02-08 23:55:26 -08:00
..
2026-01-13 01:50:57 -08:00
2026-01-13 01:50:57 -08:00