Aspose cant extract text (propably compressed)

tomasgrosup1 · September 12, 2012, 9:25am

Hello,

I have a PDF created by Bullzip PDF Printer and it opens perfectly in PDF viewers.

However when I try to extract its text using Aspose ( 6.6.0), some text are completely random.

Please find the PDF file attached.

Here is the text I get from page 3:

7KLVHOHFWURQLFKRPHWLPHUV\VWHPLVDQHOHFWULFDO GHYLFHDQGVKRXOGEHRSHUDWHGZLWKFDXWLRQ … and continues like this

Here is the code I use to extract the text:

var plainTextAbsorber = new TextAbsorber();

document.Pages[3].Accept(plainTextAbsorber);

var text = plainTextAbsorber.Text;

Is there anything I can do to change this behaviour?

Thank you.

Tomas

nausherwan.aslam · September 12, 2012, 12:48pm

Hi Tomas,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for sharing the template file.

I tested your template file with the latest version of Aspose.Pdf and it does not extract the data properly. Your issue has been registered in our issue tracking system with issue id: PDFNEWNET-34240. You will be notified via this forum thread regarding any updates against your issue.

Sorry for the inconvenience,

codewarior · November 20, 2012, 9:38am

Hi Tomas,

We have further investigated this issue and have found that It is impossible to extract text from the document since fonts does not contain Unicode mapping. Adobe shows the same behavior.

The issue might be resolved contacting to Bullzip developers to make the printer include unicode mapping with fonts to Bullzip output documents.

We are sorry for this inconvenience.