pageinfo: workaround PyPDF extractText limitations on hidden text

It appears that extractText() does not find all text. At a glance it
may be that Tesseract's PDF renderer generates a font and uses glyphs
that map to different Unicode code points that PyPDF expects, so it
discards the content and finds nothing. As a proxy in lieu of better
PDF parsing, assume that a "GlyphLessFont" means there is a text there.

I had previously found it does not work to check for the presence of a
font on page. Some PDF generators create a font resource entry even if
the font is never called for.
This commit is contained in:
James R. Barlow
2015-12-02 23:16:36 -08:00
parent d6124c1787
commit df1fda7438

View File

@@ -112,12 +112,16 @@ def _page_has_text(pdf, page):
# More nuanced test to deal with quirks of Tesseract PDF generation
# Check if there's a Glyphless font
font = page['/Resources']['/Font']
font_objects = list(font.keys())
for font_object in font_objects:
basefont = font[font_object]['/BaseFont']
if basefont.endswith('GlyphLessFont'):
return True
try:
font = page['/Resources']['/Font']
except KeyError:
pass
else:
font_objects = list(font.keys())
for font_object in font_objects:
basefont = font[font_object]['/BaseFont']
if basefont.endswith('GlyphLessFont'):
return True
return False