Search Non-English (Arabic) text in PDF using Aspose.PDF - Text does not match the search string

bvk · January 20, 2020, 10:04pm

Hi all,

I have created a few arabic docs and have not been able to find text matches in them via TextFragmentAbsorber. I have attached a couple documents where this is the case-- an english version that works fine and an arabic version that does not match.

Repro code:

// Arabic, Does not find matches
using (var document = new Document(@"LoremIpsum_Arabic.pdf"))
{
	// Last word of first sentence, before first comma, end of second sentence
	var textToFind = new[] { "الألم", "السنین", "التعلیمیة" };
	var matchCount = 0;
	foreach (var text in textToFind)
	{
		var textFragmentAbsorber = new TextFragmentAbsorber(text);
		document.Pages.Accept(textFragmentAbsorber);
		
		matchCount += textFragmentAbsorber.TextFragments.Count;
	}
	
	// Should have 3 matches, returns 0 matches. Each of these terms are findable when viewing file in acrobat
	var finalMatchCount = matchCount;
}

English matching, works fine:

using (var document = new Document(@"LoremIpsum_English.pdf"))
{
	// Last word of first sentence, before first comma, end of second sentence
	var textToFind = new[] { "aliqua", "veniam", "ex" };
	var matchCount = 0;
	foreach (var text in textToFind)
	{
		var textFragmentAbsorber = new TextFragmentAbsorber(text);
		document.Pages.Accept(textFragmentAbsorber);
		
		matchCount += textFragmentAbsorber.TextFragments.Count;
	}
	
	// Returns 4 matches as expected (extra ex match found in english)
	var finalMatchCount = matchCount;
}

Thanks!
LoremIpsum.zip (41.4 KB)

asad.ali · January 21, 2020, 10:54am

@bvk

We have been able to reproduce the issue in our environment while using Aspose.PDF for .NET 20.1 and logged it as PDFNET-47574 in our issue tracking system. We will surely look into details of it and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.