Turkish text was not found in PDF with TextAbsorber

PatrickITech · December 3, 2023, 9:18pm

When I used the TextAbsorber for the text “Şikayet Edilen : Adı Soyadı – T.C. Kimlik numarası – Adres ve telefon bilgileri” without specifying a rectangle it could not find it in my PDF.

Why it does not find it? Do I need to change somewhere the encoding or the like?

I cannot attach the PDF since its sensitive data I cannot make public. If you give me an email address then I am happy to share the PDF in private.

asad.ali · December 3, 2023, 11:06pm

@PatrickITech

Please note that the files shared in this forum thread as secured. They can only be accessed by the Aspose Staff and the post/topic creator. Therefore, you can safely share your file with us here with your reply. If you still want to share the file privately, you can click on the username and choose Message option.
image.png (19.6 KB)

PatrickITech · December 3, 2023, 11:18pm

Many thanks for the explanation. I have sent you the PDF and the screenshot via Private Message.

asad.ali · December 4, 2023, 10:52am

@PatrickITech

We have received the sample file and testing the case. We will be sharing our feedback with you shortly.

asad.ali · December 4, 2023, 9:34pm

@PatrickITech

We have tested the scenario in our environment and did not notice any issue. Below code snippet was able to extract the complete text in your shared PDF:
image.png (21.3 KB)

Document pdfDocument = new Document(dataDir + "PDF_TEST_1 - Copy.pdf");
TextAbsorber ta = new TextAbsorber();
pdfDocument.Pages.Accept(ta);
Console.WriteLine(ta.Text);

Would you please make sure to use the API with a valid or 30-days free temporary license? In case issue still persists, please let us know.

PatrickITech · December 4, 2023, 9:42pm

Thanks for your answer, I have used a different code, sorry to not mentioned that in th beginning.

When searching the text “Şikayet Edilen : Adı Soyadı – T.C. Kimlik numarası – Adres ve telefon bilgileri” with the TextFragmentAbsorber that I get no result back. This means in the below code the TextFragmentAbsorberAddress.TextFragments is empty.

var fileName = pdfViewer.DocumentSource.ToString();

Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(fileName);

var textSelection = "Şikayet Edilen : Adı Soyadı – T.C. Kimlik numarası – Adres ve telefon bilgileri";

Aspose.Pdf.Text.TextFragmentAbsorber TextFragmentAbsorberAddress = new Aspose.Pdf.Text.TextFragmentAbsorber(textSelection);

TextFragmentAbsorberAddress.TextSearchOptions.LimitToPageBounds = true;

pdf.Pages.Accept(TextFragmentAbsorberAddress);

foreach (Aspose.Pdf.Text.TextFragment tf in TextFragmentAbsorberAddress.TextFragments)

{

tf.Text = newText;

}

pdf.Save(fileName);

asad.ali · December 4, 2023, 11:04pm

@PatrickITech

Please allow us to test from this perspective as well and we will get back to you shortly.

asad.ali · December 13, 2023, 8:04pm

@PatrickITech

The text can only be searched out in a PDF in the same way how it was added. For example, please check the below code:

Document pdfDocument = new Document(dataDir + "PDF_TEST_1 - Copy.pdf");
TextFragmentAbsorber ta = new TextFragmentAbsorber(@"Şikayet Edilen  : Adı Soyadı – T.C. Kimlik numarası – Adres ve telefon bilgileri");
pdfDocument.Pages.Accept(ta);
Console.WriteLine(ta.TextFragments.Count);

Above code is able to find the text. You will notice additional spaces among the words of the entire line. This is how the text is present in the PDF actually. In order to get that how text is present, you can use TextAbsorber Class like below:

Document pdfDocument = new Document(dataDir + "PDF_TEST_1 - Copy.pdf");
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
var t = textAbsorber.Text; // It gives the text with original formatting

Furthermore, please note that Adobe Reader is also not able to find the text with the formatting in which you are trying to search it using the API. Please use the code snippet and search term as suggested above and let us know in case you still face any issues.

PatrickITech · January 2, 2024, 10:27pm

Thanks this resolved my problem!