Font replacement issue for Tesseract OCR pdfs

Hi there,

I am having an issue with using Aspose.PDF to convert a PDF with selectable text layer to a PDF/A format.

When using the Tesseract OCR fork from UB-Mannheim to OCR images and output them to PDF, these PDFs then produce an error when attempting to convert them to a PDF/A format using Aspose.PDF. The converted PDFs still retain the selectable text layer, but the location and width of the text in the layer is incorrect.

This seems to be an issue with font replacement when handling the PDF output of Tesseract OCR as seen in this error logged in the ConversionLog xml file:
“Width information for glyphs is inconsistent in embedded font ‘GlyphLessFont’”

I have attached a .NET Framework console application AsposeConversionError.zip (6.0 MB)
which can be used to reproduce the issue. The license will need to be added as AsposeConversionError.Properties.Resources.Aspose_Total_NET as it is not included. The input sample PDF is directly obtained from the output of Tesseract OCR.

So its easier to instantly see the issue, the ConversionLog.xml, samplepdf.pdf and output.pdf files are also included in the zip file.

It would be nice to see this issue addressed, as Tesseract is a commonly used OCR engine.

@frimbingpickering
I checked the file obtained using the attached application in Adobe Acrobat Pro 2023 and Preflight - they show that the document complies with the PDF/A-1B standard
image.png (92.1 KB)