Outstanding Issue with Aspose PDF TextAbsorber

Hi, back on 28-05-2013 (PDFNEWNET-35382) I posted an issue with extracting text from PDF’s returning garbage when using the Aspose.Pdf.Text.TextAbsorber.



Attached is the example PDF and output.



We are using Aspose.PDF 9.5.0.0 with an Aspose.Total licence.



Test environments have been on Windows 8.1 and Server 2012.



We have had a client waiting on a solution for far to long now, can you please provide an update for a resolution.



Regards,

Bryant.

Hi Bryant,


Thanks for your patience.

The development team started investigating above stated issue but due to low priority of this problem and high priority of other issues, its resolution has been postponed. We do understand that its been couple of months since this problem was reported but there are some other issues with high precedence so this problem will be resolved after the resolution of high priority issues (as per schedule). Please be patient and spare us little time.

I was able to resolve the issue today with much investigation, pass this onto your developers.

The PDF’s being scanned had encoded FontTypes that were obviously decoding as glyphs rather than text.

Not sure if you would be able to translate the glyph without using OCR.

My solution, As I was generating the initial PDF via a Windows Print Driver generated PostScript file was to change the printer driver settings “PostScript Output Option” to “Optimise for Portability” rather than the default “Optimise for Speed”.

Changing this setting ensures TrueType fonts are used and Aspose can decode the text.

Cheers.

Hi Bryant,


Thanks for sharing the details.

We are glad to hear that you have managed to resolve this issue and I have also shared these details with development team and they will definitely consider this information during the resolution of this problem. As soon as we have some further updates, we will let you know.