Hello,
I need advice on how to extract parts of a document between two text identifiers, keeping all the formatting intact.
For example, I have a document with a tagged part, formed like this:
Irrelevant contentFormatted content that I need to extract.
It can span multiple paragraphs, and is only limited by the identifiers.Irrelevant content
The irrelevant content, the tag, and the needed content can all be contained in one run. The tags themselves aren’t needed. There can be multiple tagged parts in a single document, and I need to extract them all.
I can find a tagged part by using a regex, like this: (?<=)(.*?)(?=). Then in the ReplaceAction I can get all the needed runs, but I need the paragraphs, to keep the formatting.
And if I get the paragraphs that contain the runs, I don’t know how to remove the tags and the irrelevant content from them.
I think, to be able to extract all document elements (e.g. Shapes, Tables, Paragraphs etc) enclosed in between these and tags, you need to implement the following workflow:
Find the Node/Paragraph which represents the starting keyword i.e.
Find the Node/Paragraph which represents the ending keyword i.e.
You can then use the code suggested in this article to be able to extract content between start and end nodes
Generate a temporary Document comprising of the nodes you just extracted
Iterate through the Paragraph collection and remove strings “” and “” from them.
Please let me know if I can be of any further assistance.
Hello,
In this case, could you please advise me on what is the fastest and most robust way of finding the node with the specified keyword? I can’t use regex and Range.Replace, because the version of Aspose.Words I’m using throws a NullReferenceException on a regex like (?<=keyword)(.*) if keyword has an angle bracket. So, do I have to iterate through all the Paragraph nodes and scan their Range property for the keyword, or is there some other way to do it?