Multiline TextFragmentAbsorber

james.sales · May 22, 2018, 1:25pm

Hi Aspose support!

The attached PDF contains 2 instances of the string “When you click Online Video, you can paste in the embed” however the TextFragmentAbsorber can only find one. I’m guessing it’s because the 1st instance of the string in the PDF spans two lines. Is is possible to search across multiple lines with TextFragmentAbsorber?

Aspose.Pdf.Document pdf = new Document(@“C:\RedactionTemp\TestLetter.pdf”);

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(
@"(?i)When you click Online Video, you can paste in the embed",
new TextSearchOptions(true));

// Accept the absorber for all the pages
pdf.Pages.Accept(textFragmentAbsorber);

TestLetter.pdf (320.1 KB)

asad.ali · May 22, 2018, 7:15pm

@james.sales

Thanks for contacting support.

Please note that we use ‘\r\n’ as newline in the extracted text (but it may depends on platform.). Therefore we recommend ‘(?i)firstline\r\nsecondline\b’ expression to find words separated by newline marker. Or ‘(?i)firstline\s+secondline\b’ expression to find both plain and multi-line positioning of desired text.

Please consider the following expression in order to search your string:

"(?i)When you click Online Video, you can\s+paste in the embed\b"

In case of any further assistance, please feel free to let us know.

james.sales · January 14, 2019, 2:46pm

@asad.ali

Sorry, it’s been a while since I looked at this. The issue we have is that when we search for text, we don’t know in advance if the text is split across multiple lines, and if it is when don’t know where the split occurs in the string.

I’ve noted that the standard search functionality using TextFragmentAbsorber doesn’t locate a string if it extends over multiple lines. Is there any other way of searching for text that doesn’t require the string to be prepared in advance with numerous different regular expression possibilities?

Thanks

asad.ali · January 14, 2019, 8:15pm

@james.sales

Thanks for getting back to us.

In case you do not know where the split will occur in the string, you may use new line expression '\s*' after each word in the string which will work for both plain and multi-line string. Your final regular expression would look like following:

var textFragmentAbsorber = new TextFragmentAbsorber(@"(?i)When\s*you\s*click\s*Online\s*Video,\s*you\s*can\s*paste\s*in\s*the\s*embed\b");

In case you still face any issue, please feel free to let us know.

james.sales · January 15, 2019, 9:33am

@asad.ali

Thanks, that works. I think I need to practice my regular expressions!

james.sales · January 15, 2019, 2:52pm

@asad.ali

Sorry one final question, I am trying to locate and replace email addresses in a file using the TextFragmentAbsorber. I’ve validated the Regex I am using and it correctly identifies email address in a Regex tester however when I pass it to the TextFragmentAbsorber I’m not getting any results. Example code below and document attached.

var searchText = @"^(?("")("".+?(?<!\)""@)|((0-9a-z)(?<=[0-9a-z])@))(?([)([(\d{1,3}.){3}\d{1,3}])|(([0-9a-z][-0-9a-z][0-9a-z]*.)+[a-z0-9][-a-z0-9]{0,22}[a-z0-9]))$"

Aspose.Pdf.Document pdfDoc = new Aspose.Pdf.Document(@“C:\emailaddress.pdf”);
TextFragmentAbsorber textFragmentAbsorber = null;
textFragmentAbsorber =
new TextFragmentAbsorber(searchText, new TextSearchOptions(true));
pdfDoc.Pages.Accept(textFragmentAbsorber);

TextFragmentCollection textFragments = textFragmentAbsorber.TextFragments;

EmailAddress.pdf (204.9 KB)

asad.ali · January 15, 2019, 8:02pm

@james.sales

Thanks for your inquiry.

Please use following regular expression in order to find email address from the document:

var textFragmentAbsorber = new TextFragmentAbsorber(@"[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*");

The above regular expression was tested in our environment with your PDF as well as over free Regex Tester Service.

james.sales · January 16, 2019, 9:27am

@asad.ali

Thanks, works perfectly!