Search in PDF for non latin characters

tkaufmann · June 16, 2015, 10:17am

Hi,

I want to use aspose to search through a PDF document that contains Arabic and Hebrew characters and find a word.
Today it only finds latin (English) words.

How can I achieve this?
Regards.

codewarior · June 17, 2015, 7:51am

Hi Tzach,

Thanks for contacting support.

I have tested the scenario using one of my sample PDF document containing English and Arabic text where I have used following code snippet and as per my observations, the Arabic text is properly being recognized. Can you please share your sample PDF files, so that we can test the scenario in our environment. We are sorry for this inconvenience.

[C#]

//open
document<o:p></o:p>

Document pdfDocument = new Document("c:/pdftest/For+Arabic.pdf");

//create TextAbsorber object to find all instances of the input search phrase

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("التوقيع");

//accept the absorber for all the pages

pdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

//loop through the fragments

foreach (TextFragment textFragment in textFragmentCollection)

{

foreach (TextSegment textSegment in textFragment.Segments)

{

Console.WriteLine("Text : {0} ", textSegment.Text);

Console.WriteLine("Position : {0} ", textSegment.Position);

Console.WriteLine("Font - Name : {0}", textSegment.TextState.Font.FontName);

Console.WriteLine("Font Size : {0} ",textSegment.TextState.FontSize);

Console.WriteLine("Page Number : {0} ", textFragment.Page.Number);

}

tkaufmann · June 18, 2015, 7:53am

Hi Nayyer,

Thank you for your reply.

I found that my issue happens when I try to search for words that might be contained within other words, i.e not whole word match.

In this case we need to use regex as part of the TextFragmentAbsorber settings. e.g textFragmentAbsorber = new TextFragmentAbsorber(@"[\S](?i)" + textToSearch + @"[\S]");

Now my issue is to support any language. It seems that I will need to switch into unicode support.

Do you have any example for that?

Thanks,

Tzach

codewarior · June 19, 2015, 5:53am

Hi Tzach,

Thanks for sharing the details.

Aspose.Pdf supports all the languages and there should be no issue using non-English languages. In case you are facing any issue, please share the resource/input PDF files, so that we can test the conversion in our environment. We are sorry for your inconvenience.

tkaufmann · June 19, 2015, 5:57am

Hi,

Do you have a sample in which you use unicode characters (i.e \u70B9) ?

Regards,

Tzach

codewarior · June 22, 2015, 7:13am

Hi Tzach,

Thanks for sharing the details.

I have again tested the scenario where I have first inserted U70B9 (点) character into newly created PDF file and then have tried searching same character inside PDF document using earlier shared code snippet and as per my observations, the text is properly being identified. Can you please share your source / sample PDF files, which can help us in replicating the problem in our environment. We are sorry for this inconvenience.

For your reference, I have attached my sample PDF file with unicode character.

[C#]

//open
document<o:p></o:p>

Document pdfDocument = new Document("c:/pdftest/UniCode_Character.pdf");

//create TextAbsorber object to find all instances of the input search phrase

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("点");

//accept the absorber for all the pages

pdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

//loop through the fragments

foreach (TextFragment textFragment in textFragmentCollection)

{

foreach (TextSegment textSegment in textFragment.Segments)

{

Console.WriteLine("Text : {0} ", textSegment.Text);

Console.WriteLine("Position : {0} ", textSegment.Position);

Console.WriteLine("Font - Name : {0}", textSegment.TextState.Font.FontName);

Console.WriteLine("Font Size : {0} ", textSegment.TextState.FontSize);

Console.WriteLine("Page Number : {0} ", textFragment.Page.Number);

}