mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-05-03 20:24:33 -04:00
fpdf2's shape_text() produces RTL ligature glyphs (e.g. lam-alef) with multi-character CMap entries whose character order gets reversed by the bidi algorithm during text extraction, producing garbled output like "سالح" instead of "سلاح". For invisible text (the production OCR overlay path), bypass text shaping and use encode_text() with pre-reversed strings. encode_text() maps characters 1:1 in logical order, avoiding the ligature CMap issue. The pre-reversal compensates for bidi reversal by text extractors. Since the text is invisible (Tr=3), the lack of joining forms is harmless. Add RTL text extraction tests that verify glyph stream order, ToUnicode CMap 1:1 mappings, and correct logical order for Arabic (including lam-alef ligature) and Hebrew scripts.