TextFragmentAbsorber select exact phrase

hi i try to select exact phrase on a pdf generated by my procedure , i try another pdf more simple downloaded from internet and seems to work , i’m going mad , why on some pdf works and on other pdf don’t work ?

best regards

@francescoesposito,

Kindly send us the complete details of the use case, including problematic PDF documents and code. We will investigate and share our findings with you.

here you are the code and pdf , thank you for your support !:grinning:

Try

        Dim licence As Aspose.Pdf.License = New Aspose.Pdf.License
        licence.SetLicense(Application.StartupPath + "\lib\Aspose.Pdf.lic")
        'open document

        Dim document As New Document(strPathPdf)

        Dim textFragmentAbsorber As New TextFragmentAbsorber("Garrett Nevels")


        'set text search option to specify regular expression usage

        Dim textSearchOptions As New TextSearchOptions(True)

        textFragmentAbsorber.TextSearchOptions = textSearchOptions



        document.Pages.Accept(textFragmentAbsorber)



        Dim textFragmentCollection1 As TextFragmentCollection = textFragmentAbsorber.TextFragments
        If textFragmentCollection1.Count > 0 Then


            Dim objxml As New ClsXml
            objxml.CreateFileXml(System.IO.Path.GetFileNameWithoutExtension(strPathPdf))
            For Each textFragment As TextFragment In textFragmentCollection1

                'SCRITTURA DEL FILE XML PER L'EVIDENZIAZIONE ARTICOLO


                'SCRITTURA DEL FILE XML PER L'EVIDENZIAZIONE ARTICOLO


                'Dim freeText As New Aspose.Pdf.InteractiveFeatures.Annotations.HighlightAnnotation(textFragment.Page, New Aspose.Pdf.Rectangle(textFragment.Position.XIndent, textFragment.Position.YIndent, textFragment.Position.XIndent + textFragment.Rectangle.Width, textFragment.Position.YIndent + textFragment.Rectangle.Height))
                Dim freeText As New Aspose.Pdf.Annotations.HighlightAnnotation(textFragment.Page, New Aspose.Pdf.Rectangle(textFragment.Position.XIndent, textFragment.Position.YIndent, textFragment.Position.XIndent + textFragment.Rectangle.Width, textFragment.Position.YIndent + textFragment.Rectangle.Height))
                objxml.createNode(textFragment.Page.Number.ToString, textFragment.Position.XIndent.ToString, textFragment.Position.YIndent.ToString, (textFragment.Position.XIndent + textFragment.Rectangle.Width).ToString, (textFragment.Position.YIndent + textFragment.Rectangle.Height).ToString)

                freeText.Opacity = 0.5

                freeText.Color = Aspose.Pdf.Color.FromRgb(0.6, 0.8, 0.98)



                textFragment.Page.Annotations.Add(freeText)

            Next
            objxml.closeFileXml()
        End If



        document.Save(strPathPdf)
    Catch ex As Exception
        'MyLog.Error(ex.Message)
        Dim sendmail As New ClsMail
        sendmail.fsettabody = " Conversione Web2PDF Timer1_Tick() -  errore" + ex.Message + "<br>" + "fWriteHighLight() -  errore" + ex.InnerException.Message
        sendmail.SendEmail()
    End Try

wwwemiliaromagnanews24itgrissinbongarrettnevelsduemesi59591htmlGVMWEB.pdf (151.8 KB)

@francescoesposito

Thanks for sharing sample code snippet and PDF document.

We have tested the scenario in our environment and were able to notice that the API was not extracting the desired text from the shared PDF document. Hence, we have logged an issue as PDFNET-43652 in our issue tracking system. We will further investigate the reasons behind this issue and keep you informed with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.

Hi Asad

no news for me regarding the problem ?

best regards

@francescoes

Thanks for your patience.

We have investigated the issue. The expression was not found because the document text contained no space (U+0020) character between “Garrett” and “Nevels”. It contains no-break space (U+00A0) instead. Therefore regular expression “Garrett Nevels” matched with nothing. Please use “Garrett\sNevels” expression that means any white-space character may appear between words.

We used following code for testing and resultant document looked fine:

Document document = new Document(myDir + "TestInputPDF.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"Garrett\sNevels");
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
document.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection1 = textFragmentAbsorber.TextFragments;
            
if (textFragmentCollection1.Count > 0)
{
    //ClsXml objxml = new ClsXml();
    //objxml.CreateFileXml(System.IO.Path.GetFileNameWithoutExtension(strPathPdf));
    foreach (TextFragment textFragment in textFragmentCollection1)
    {
        Aspose.Pdf.Annotations.HighlightAnnotation freeText = new Aspose.Pdf.Annotations.HighlightAnnotation(textFragment.Page, new Aspose.Pdf.Rectangle(textFragment.Position.XIndent, textFragment.Position.YIndent, textFragment.Position.XIndent + textFragment.Rectangle.Width, textFragment.Position.YIndent + textFragment.Rectangle.Height));
        //objxml.createNode(textFragment.Page.Number.ToString, textFragment.Position.XIndent.ToString, textFragment.Position.YIndent.ToString, (textFragment.Position.XIndent + textFragment.Rectangle.Width).ToString, (textFragment.Position.YIndent + textFragment.Rectangle.Height).ToString);
        freeText.Opacity = 0.5;
        freeText.Color = Aspose.Pdf.Color.FromRgb(0.6, 0.8, 0.98);
        textFragment.Page.Annotations.Add(freeText);
    }
    //objxml.closeFileXml();
}
document.Save(myDir + "43652_out.pdf");

43652_out.pdf (153.6 KB)

In case of any further assistance, please feel free to let us know.