Wrong Text Fragment text

rinnodata · November 10, 2014, 12:27am

Hi All,

I would like to ask you a query regarding Apose.Pdf .Net. While i am accessing the text fragment from the PDF the wrong text is accessed from the text fragment. For more explanation you can see attached sample PDF.

Please refer the below code .

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(pdfFile);

TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"((http:|www)([^ (,#)$]+))");

Aspose.Pdf.Text.TextOptions.TextSearchOptions textsearchoptions = new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);

textFragmentAbsorber.TextSearchOptions = textsearchoptions;

pdfDocument.Pages.Accept(textFragmentAbsorber);

foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)

{

-------some code--------------

}

Any help is highly anticipated.

Thanks in advance.

Regards,

Gurjeet Singh

tilal.ahmad · November 10, 2014, 10:26pm

Hi Gurjeet,

Thanks for your inquiry. Please use following regular expression it will help you to find URL in text. Hopefully it will help you to accomplish the task.

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?");

Please feel free to contact us fur any further assistance.

Best Regards,

rinnodata · November 10, 2014, 10:46pm

Hi All,

Thanks for your response but my purpose is to find the link in a single or multiple line for that I need the regular expression like the one I am using.

I need to handle both cases, i.e., when the link is in the same text fragment segment or in different text fragment segments.

Any help is highly anticipated.

Thanks in advance.

Regards,

Gurjeet Singh

codewarior · November 11, 2014, 1:42am

Hi Gurjeet,

I am afraid currently Aspose.Pdf for .NET does not support the feature to extract/search TextFragment spanning multiple lines. However for the sake of implementation, we already have logged this requirement as PDFNEWNET-37696 in our issue tracking system. We will further look into the details of this requirement and will keep you posted on the status of correction.

We are sorry for this inconvenience.

rinnodata · November 11, 2014, 4:36am

Hi All,

In the attached file, I have highlighted a link which I can fetch using the text fragment object using the following code:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(pdfFile);
TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"((http:|www)([^ (,#)$]+))");
Aspose.Pdf.Text.TextOptions.TextSearchOptions textsearchoptions = new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textsearchoptions;
pdfDocument.Pages.Accept(textFragmentAbsorber);
foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments) {
-------some code--------------
}

When I get the text from the text fragment object, the character from the text is missing in link which I fetch using the text fragment object.

Any help is highly anticipated.
Thanks in advance.
Regards,
Gurjeet Singh

tilal.ahmad · November 11, 2014, 10:37pm

Hi Gurjeet,

Thanks for your inquiry. I have tested the scenario using Aspose.Pdf for .NET 9.7.0 and unable to notice the reported issue of missing character in the link text “nfl3jzn”, screenshot attached for the reference. If you are using some old version of Aspose.Pdf for .NET then please upgrade to latest version of Aspose.Pdf for .NET. If issue persist then please share your environment details, so we will investigate it further.

Best Regards,

codewarior · July 25, 2016, 12:28pm

Hi Gurjeet,

Thanks for your patience and following are our observations. Please take into account regex pattern for multiline search must contain expression like (.|\n)? because extacted text contains \n characters.

Also (obviously) there is no way to determine where multi-line URL ends and common text continues in the case when you doesn’t know full text of URL. However if text of URL is known you will able to form regular expression to find URL.

In case ‘http://mypagetl.tenaris.ot/UserInformation/UIService.asmx?op=GetXmlUserProfile’ spanned over multiple lines regular expression will be

(?i)http:(.|\n)?file

Please try using latest release of Aspose.Pdf for .NET 11.8.0 with following code snippet to generate correct output.

[C#]

//open document
Document pdfDocument = new Document(“c:/pdftest/Prueba3.pdf”);
//create TextAbsorber object to find all the phrases matching the regular expression
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?i)http:(.|\n)*?file");
//set text search option to specify regular expression usage
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
//accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
Console.WriteLine("Text : {0} ", textFragment.Text);
Console.WriteLine("Font Size : {0} ", textFragment.TextState.FontSize);
}