Hi all,
I have created a few arabic docs and have not been able to find text matches in them via TextFragmentAbsorber
. I have attached a couple documents where this is the case-- an english version that works fine and an arabic version that does not match.
Repro code:
// Arabic, Does not find matches
using (var document = new Document(@"LoremIpsum_Arabic.pdf"))
{
// Last word of first sentence, before first comma, end of second sentence
var textToFind = new[] { "الألم", "السنین", "التعلیمیة" };
var matchCount = 0;
foreach (var text in textToFind)
{
var textFragmentAbsorber = new TextFragmentAbsorber(text);
document.Pages.Accept(textFragmentAbsorber);
matchCount += textFragmentAbsorber.TextFragments.Count;
}
// Should have 3 matches, returns 0 matches. Each of these terms are findable when viewing file in acrobat
var finalMatchCount = matchCount;
}
English matching, works fine:
using (var document = new Document(@"LoremIpsum_English.pdf"))
{
// Last word of first sentence, before first comma, end of second sentence
var textToFind = new[] { "aliqua", "veniam", "ex" };
var matchCount = 0;
foreach (var text in textToFind)
{
var textFragmentAbsorber = new TextFragmentAbsorber(text);
document.Pages.Accept(textFragmentAbsorber);
matchCount += textFragmentAbsorber.TextFragments.Count;
}
// Returns 4 matches as expected (extra ex match found in english)
var finalMatchCount = matchCount;
}
Thanks!
LoremIpsum.zip (41.4 KB)