Files
OCRmyPDF/tests
James R. Barlow 5be368fe75 Fix RTL text extraction order in fpdf2 renderer (#1655)
fpdf2's shape_text() produces RTL ligature glyphs (e.g. lam-alef) with
multi-character CMap entries whose character order gets reversed by the
bidi algorithm during text extraction, producing garbled output like
"سالح" instead of "سلاح".

For invisible text (the production OCR overlay path), bypass text shaping
and use encode_text() with pre-reversed strings. encode_text() maps
characters 1:1 in logical order, avoiding the ligature CMap issue. The
pre-reversal compensates for bidi reversal by text extractors. Since the
text is invisible (Tr=3), the lack of joining forms is harmless.

Add RTL text extraction tests that verify glyph stream order, ToUnicode
CMap 1:1 mappings, and correct logical order for Arabic (including
lam-alef ligature) and Hebrew scripts.
2026-04-04 01:40:38 -07:00
..