Extract content between two text identifiers

sonorc · February 13, 2013, 7:29am

Hello,
I need advice on how to extract parts of a document between two text identifiers, keeping all the formatting intact.
For example, I have a document with a tagged part, formed like this:

Irrelevant contentFormatted content that I need to extract.
It can span multiple paragraphs, and is only limited by the identifiers.Irrelevant content

The irrelevant content, the tag, and the needed content can all be contained in one run. The tags themselves aren’t needed. There can be multiple tagged parts in a single document, and I need to extract them all.

I can find a tagged part by using a regex, like this: (?<=)(.*?)(?=). Then in the ReplaceAction I can get all the needed runs, but I need the paragraphs, to keep the formatting.
And if I get the paragraphs that contain the runs, I don’t know how to remove the tags and the irrelevant content from them.

I also tried extracting the paragraphs as described at https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/, but I don’t know how to find the start and the end nodes with the tags in the first place, and, once again, how to remove the tags and the irrelevant content from them when I get the needed nodes.

awais.hafeez · February 13, 2013, 9:37am

Hi,

Thanks for your inquiry.

I think, to be able to extract all document elements (e.g. Shapes, Tables, Paragraphs etc) enclosed in between these and tags, you need to implement the following workflow:

Find the Node/Paragraph which represents the starting keyword i.e.
Find the Node/Paragraph which represents the ending keyword i.e.
You can then use the code suggested in this article to be able to extract content between start and end nodes
Generate a temporary Document comprising of the nodes you just extracted
Iterate through the Paragraph collection and remove strings “” and “” from them.

Please let me know if I can be of any further assistance.

Best regards,

sonorc · February 14, 2013, 2:06am

Hello,
In this case, could you please advise me on what is the fastest and most robust way of finding the node with the specified keyword? I can’t use regex and Range.Replace, because the version of Aspose.Words I’m using throws a NullReferenceException on a regex like (?<=keyword)(.*) if keyword has an angle bracket. So, do I have to iterate through all the Paragraph nodes and scan their Range property for the keyword, or is there some other way to do it?

awais.hafeez · February 14, 2013, 4:34am

Hi,

Thanks for your request. Sure, you can iterate through all the Paragraph nodes and scan their Range property for the keyword. You can also use DocumentVisitor to achieve this. You can find a very good example, which demonstrates the technique, here:
https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

Best regards,

javidp84 · June 12, 2018, 8:47pm

is this link broken or moved somewhere else?

awais.hafeez · June 13, 2018, 1:04am

@javidp84,

Please check the following article:
How to Extract Selected Content Between Nodes in a Document