TextFragmentAbsorber does not work on certain documents created via selectPDF

sireesha.charyulu · April 14, 2016, 6:10pm

Hi,

Env: .Net4.0, Attempted on Aspose library 11.5.0 and Aspose10.3.0

I cannot seem to get text fragment to extract text from attached document.

String MYTEXT = “{{t:s;r:y;o:”Role”;}}”;

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(filename);
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(MYTEXT);
Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions =
new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
pdfDocument.Pages.Accept(textFragmentAbsorber);
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
 { // DOES NOT ENTER THIS FOR loop } </blockquote>Is this a bug? When I copy the same text in a word document & save as PDF via Print on Mac, it works fine. Thank you, Sireesha

tilal.ahmad · April 17, 2016, 11:11pm

Hi Sireesha,

Thanks for your inquriy. I have tested the scenario both with Aspose.Pdf for .NET 11.5.0 and 10.3.0 and unable to notice the reported issue. Please share some more details or a sample console application to replicate the issue, so we will look into it and will guide you accordingly.

We are sorry for the inconvenience caused.

Best Regards,

sireesha.charyulu · April 19, 2016, 12:43pm

Hi,

Attached are two documents. 1. ‘working.pdf’ and 2. ‘not_working.pdf’

Here is the code I am using for extracting text:

String MYTEXT=“{{t:s;r:y;o:”Role”;}}”;
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(filename);
Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(MYTEXT);
Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions =
new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
pdfDocument.Pages.Accept(textFragmentAbsorber);
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
Aspose.Pdf.PageCollection pageCollection = pdfDocument.Pages;
foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{

}

For the environment I am using: .Net 4.0 , I tried using Aspose.pdf library 10.3.0 and Aspose.pdf library 11.5.0. I get the same result: Extracts the text from document ‘working.pdf’ and not from document ‘not_working.pdf’

What is different in ‘not_working.pdf’ document that the same code does not absorb text fragment?

FYI, I created the ‘not_working.pdf’ from http://selectpdf.com/ website that converts html code to pdf.

Thank you,
Sireesha

codewarior · April 20, 2016, 1:14pm

Hi Sireesha,

Thanks for using our API’s.

I have tested the scenario and have managed to reproduce same problem. For the sake of correction, I have logged it as PDFNEWNET-40630 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.