TextFragmentAbsorber is not working when PDF font is Arial

Alice_Kim · September 29, 2022, 8:54am

I am using TextFragmentAbsorber for searching keywords and adding Annotations.
When PDF Font is Curier, TextFragmentAbsorber works fine.
But, when PDF Font is Arial, TextFragmentAbsorber can’t search some keywords even if the page contents is the exactly same.

I search words that start with some spaces, and then TextFragments.Count is zero when PDF is using Arial font.

Sample code is following.

Document doc = new Document(Path.Combine(path, fileName));

TextFragmentAbsorber absorber = new TextFragmentAbsorber(“Exhibit 1”, new TextSearchOptions(false));

for (int i = 1; i <= doc.Pages.Count; i++)
{
doc.Pages[i].Accept(absorber);
}

if (absorber.TextFragments.Count > 0)
{
AddLinkAnnotations(doc, “Exhibit 1”, “Exhibit 1.jpg”, absorber.TextFragments);
}else
{
System.Diagnostics.Debug.WriteLine(“There are no searched results.”);
}

The left screenshot is Arial, and the right screenshot is Courier.
Sample.png (40.8 KB)

asad.ali · September 29, 2022, 6:33pm

@Alice_Kim

Would you kindly share your sample PDF document with us so that we can test the scenario in our environment and address it accordingly?

Alice_Kim · October 4, 2022, 4:16am

Searching keywords are “Plaintiff’s 2”, “Plaintiff’s 3”, “Plaintiff’s 6”, “Plaintiff’s 7”, “Plaintiff’s 8”, “Plaintiff’s 9”, “Plaintiff’s 11”, “Plaintiff’s 13”, “Plaintiff’s 14”.

And TextFragmentAbsorber is not working to “Plaintiff’s 2”, “Plaintiff’s 6”, “Plaintiff’s 8”, “Plaintiff’s 14” when PDF Font is Arial.

This PDF “Sample-CT(Arial).pdf” does not work. (12.1 KB)

This PDF “Sample-CT(Courier).pdf” works fine. (11.9 KB)

asad.ali · October 4, 2022, 3:58pm

@Alice_Kim

We are checking it and will get back to you shortly.

Alice_Kim · November 10, 2022, 7:10am

Hello,
Are you checking my sample PDF?
Let me know how to solve this problem, please.

asad.ali · November 10, 2022, 7:42pm

@Alice_Kim

We used below code snippet in our environment while using Aspose.PDF for .NET 22.10 and did not notice any issues. The API was able to find the text in both PDFs. Can you please try using the below code and let us know in case you still face any issues?

Document doc = new Document(dataDir + "Sample-CT(Arial).pdf");

TextFragmentAbsorber absorber = new TextFragmentAbsorber("Plaintiff's 2");
doc.Pages.Accept(absorber);

if (absorber.TextFragments.Count > 0)
{
 Console.WriteLine("Text is found." + absorber.TextFragments.Count);
}
else
{
 Console.WriteLine("There are no searched results.");
}

Alice_Kim · March 20, 2023, 1:33am

I upgraded the Aspose.PDF to 23.3.0.0 for my sample code.
But, this problem still exists.
I am using regular expressions for specific conditions.

I attach my sample code & sample pdf files again.
When font family is Arial and font size is small, TextFragmentAbsorber.TextFragments.Count is zero for “Plaintiff’s 2” and “Plaintiff’s 6”.
I attached the screenshot of comparison.
Compare Airal with Courier.png (48.9 KB)

I attached my sample console program and sample pdf files & image files.
I removed Aspose.PDF package because of upload file size limitation, so you have to install Aspse.PDF pacage from nuget.

Could you explain why the results are different depends on Font Family.

LinkAnnotationTest(RemovePackage).zip (719.5 KB)

asad.ali · March 20, 2023, 9:22am

@Alice_Kim

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-53966

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.