TextFragmentAbsorber using Regular Expression not spanning multiple pages

I’m trying to extract text out of PDF based on regular expression and it seems to be working for most part but I have encountered a strange behavior.



If the text that I’m looking for spans across multiple pages does TextFragmentAbsorber look at this as continuous text? If looks like it stops at the end of page 1 even though I indicated all pages. In fact it picked up text from bottom of first page that meets my regular expression and then paragraph from top of the first page all in single TextFragment.



Below is the section of the code for your reference and I’ve attached the complete CS code and pdf file being used to test this. I was expecting text from page 3 to be captured as well since I would like to incorporate the “Recipe Tip” into my regular expression.



//DIRECTIONS

//Create TextAbsorber object to extract text

TextFragmentAbsorber textFragmentAbsorberDirections = new TextFragmentAbsorber(“Directions(\r\n|\r|\n)[0-9a-zA-Z(\r\n|\r|\n)°@#$%&+\-_(),+’:;?.,!\[\]\s\/ è]”);



//Set text search option to specify regular expression usage

TextSearchOptions textSearchOptionsDirections = new TextSearchOptions(true);

textFragmentAbsorberDirections.TextSearchOptions = textSearchOptionsDirections;



//Accept the absorber for all the pages

pdfDocument.Pages.Accept(textFragmentAbsorberDirections);



//Get the extracted text from first fragment

Console.WriteLine("{0}", textFragmentAbsorberDirections.TextFragments[1].Text);

Hi there,

Thanks for your inquiry. I have tested your scenario with shared document using Aspose.Pdf for .NET 10.6.0 and managed to observe the reported issue. For further investigation, I have logged an issue in our issue tracking system as PDFNEWNET-39052 and also linked your request to it. We will keep you updated via this thread regarding the issue status.

Please feel free to contact us for any further assistance.

<span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif”;mso-fareast-font-family:
Calibri;color:#333333;mso-ansi-language:EN-US;mso-fareast-language:EN-US;
mso-bidi-language:AR-SA”>Best Regards,

Hi,


Thanks for your patience.

The earlier reported issue is still pending for review as the team has been busy resolving other previously reported issues. However as soon as we have some further updates, we will let you know.

Approximately when will you get to this as we need this functionality for a project we are working on right now. Is there anyway to escalate this using our support agreement?

Hi There,


Thanks for your inquiry. I am afraid we can not share any ETA at the moment, as your issue is pending in the queue with other issues for investigation. As soon as our development team completes the issue analysis we will share our findings/ETA with you accordingly. However we have recorded your concern and raised issue priority within normal support. We will keep you updated about the issue resolution progress.

We are sorry for this delay and inconvenience.

Best Regards,

Hi there,


Thanks for your patience. We have investigated the issue and found it not TextFragmentAbsorber issue. Please find complete details in your other related thread.

Best Regards,

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan