pageinfo: workaround PyPDF extractText limitations on hidden text

It appears that extractText() does not find all text. At a glance it may be that Tesseract's PDF renderer generates a font and uses glyphs that map to different Unicode code points that PyPDF expects, so it discards the content and finds nothing. As a proxy in lieu of better PDF parsing, assume that a "GlyphLessFont" means there is a text there. I had previously found it does not work to check for the presence of a font on page. Some PDF generators create a font resource entry even if the font is never called for.
2026-05-19 20:14:53 -04:00 · 2015-12-02 23:16:36 -08:00
parent d6124c1787
commit df1fda7438
1 changed files with 10 additions and 6 deletions
--- a/ocrmypdf/pageinfo.py
+++ b/ocrmypdf/pageinfo.py
@@ -112,12 +112,16 @@ def _page_has_text(pdf, page):

    # More nuanced test to deal with quirks of Tesseract PDF generation
    # Check if there's a Glyphless font
-    font = page['/Resources']['/Font']
-    font_objects = list(font.keys())
-    for font_object in font_objects:
-        basefont = font[font_object]['/BaseFont']
-        if basefont.endswith('GlyphLessFont'):
-            return True
+    try:
+        font = page['/Resources']['/Font']
+    except KeyError:
+        pass
+    else:
+        font_objects = list(font.keys())
+        for font_object in font_objects:
+            basefont = font[font_object]['/BaseFont']
+            if basefont.endswith('GlyphLessFont'):
+                return True

    return False