mirror of
https://github.com/ocrmypdf/OCRmyPDF.git
synced 2026-05-19 20:14:53 -04:00
pageinfo: workaround PyPDF extractText limitations on hidden text
It appears that extractText() does not find all text. At a glance it may be that Tesseract's PDF renderer generates a font and uses glyphs that map to different Unicode code points that PyPDF expects, so it discards the content and finds nothing. As a proxy in lieu of better PDF parsing, assume that a "GlyphLessFont" means there is a text there. I had previously found it does not work to check for the presence of a font on page. Some PDF generators create a font resource entry even if the font is never called for.
This commit is contained in:
@@ -112,12 +112,16 @@ def _page_has_text(pdf, page):
|
||||
|
||||
# More nuanced test to deal with quirks of Tesseract PDF generation
|
||||
# Check if there's a Glyphless font
|
||||
font = page['/Resources']['/Font']
|
||||
font_objects = list(font.keys())
|
||||
for font_object in font_objects:
|
||||
basefont = font[font_object]['/BaseFont']
|
||||
if basefont.endswith('GlyphLessFont'):
|
||||
return True
|
||||
try:
|
||||
font = page['/Resources']['/Font']
|
||||
except KeyError:
|
||||
pass
|
||||
else:
|
||||
font_objects = list(font.keys())
|
||||
for font_object in font_objects:
|
||||
basefont = font[font_object]['/BaseFont']
|
||||
if basefont.endswith('GlyphLessFont'):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
Reference in New Issue
Block a user