Aspose.pdf destroys PDF Text

BenjaminA · December 7, 2018, 9:35am

Hi !
I have a scanned pdf that I used tesseract to convert to searchable PDF. tesseract.pdf (85.6 KB). When I remove the image from PDF then PDF text looks like this: tesseract_text.jpg (196.0 KB)

When I use TextAbsorber or TextFragmentAbsorber and save as new PDF then the text layer gets corrupted (is not as original pdf).

Document doc = new Aspose.Pdf.Document(@“C:/tesseract.pdf”);
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
doc.Pages.Accept(textFragmentAbsorber);
doc.Save(@“C:/aspose.pdf”, SaveFormat.Pdf);

Output PDF: aspose.pdf (86.2 KB)
Corrupted text: aspose_text.png (251.3 KB)

I don’t even do any operation on PDF.

I think this problem is related to:

BenjaminA · December 7, 2018, 11:01am

I tested a little more and found that the text is not corrupted but there is something wrong with the output font and font size.

Tesseract uses an embeded font called GlyphLessFont. Perhaps Aspose.PDF does’t recognize this font.

asad.ali · December 7, 2018, 7:13pm

@benjamin.a

Thanks for contacting support.

We were able to notice the issue in our environment using Aspose.PDF for .NET 18.12 and for the sake of correction we have logged it as PDFNET-45793 in our issue tracking system. We will further look into this issue and keep you posted with the status of its rectification. Please be patient and spare us little time.

We have logged above details along with the issue as well. This would really help us investigating the issue and resolve it accordingly.

We are sorry for the inconvenience.