Unable to find text on document

bvk · April 30, 2020, 12:40am

Hi all,

On the attached document I have not been able to find any instances of text on the second page despite it being searchable in pdf viewers.

var filePath = @"FormerHeaderFooter.pdf";
using (var pdf = new Aspose.Pdf.Document(filePath))
	{
	var absorber = new TextFragmentAbsorber(@"\S+", new TextSearchOptions(true));
	absorber.Visit(pdf);

	var textFragmentList = absorber.TextFragments.Where(x => x.Text.Contains("Center")).ToList();
	var textFragmentsOnPage2 = absorber.TextFragments
		.Where(x => x.Text.Contains("Center") && x.Page.Number == 2);
}

For reference this document was created by originally creating headers/footers then wiping that object away from the document–the text was retained but became unsearchable. Interestingly, text that was formerly left in left headers/footers was ok, but former center or right headers/footers was not.

FormerHeaderFooter.zip (563.5 KB)

Thanks!

asad.ali · April 30, 2020, 7:39pm

@bvk

We have tested the scenario in our environment with Aspose.PDF for .NET 20.4 and were unable to notice any issue. The API was able to extract text from 2nd page of the document. Would you please make sure that you are using latest version of the API and in case issue still persists, please let us know.

bvk · May 1, 2020, 2:07pm

@asad.ali,

I am on Aspose PDF .NET for 20.4. What code did you use to find the text? When I used the code above or even just base text fragment absorber without any text search options or anything it was unable to find any text fragments on the 2nd page-- the textFragmentsOnPage2 variable was empty.

asad.ali · May 2, 2020, 5:43pm

@bvk

We tested the scenario by searching text on page level using following modified code snippet:

using (var pdf = new Aspose.Pdf.Document(dataDir + "FormerHeaderFooter.pdf"))
{
  var absorber = new TextFragmentAbsorber(@"\S+", new TextSearchOptions(true));
  pdf.Pages[2].Accept(absorber);
  var textFragmentList = absorber.TextFragments.Where(x => x.Text.Contains("Center")).ToList();
  var textFragmentsOnPage2 = absorber.TextFragments.Where(x => x.Text.Contains("Center") && x.Page.Number == 2);
}

Furthermore, we also noticed that API was not extracting text when searching in whole PDF. Therefore, an investigation ticket as PDFNET-48126 has been logged in our issue tracking system for the sake of correction. We will surely inform you as soon as it is resolved. Please spare us some time.

bvk · May 4, 2020, 4:44pm

@asad.ali

Thank you for the update. We will implement the workaround of only using the TextFragmentAbsorber one page at a time for now.

asad.ali · May 5, 2020, 12:13am

@bvk

Sure, and we will also let you know as soon as logged ticket is resolved.