I need advice on how to extract parts of a document between two text identifiers, keeping all the formatting intact.
For example, I have a document with a tagged part, formed like this:
Irrelevant contentFormatted content that I need to extract.
It can span multiple paragraphs, and is only limited by the identifiers.Irrelevant content
The irrelevant content, the tag, and the needed content can all be contained in one run. The tags themselves aren’t needed. There can be multiple tagged parts in a single document, and I need to extract them all.
I can find a tagged part by using a regex, like this: (?<=)(.*?)(?=). Then in the ReplaceAction I can get all the needed runs, but I need the paragraphs, to keep the formatting.
And if I get the paragraphs that contain the runs, I don’t know how to remove the tags and the irrelevant content from them.
I also tried extracting the paragraphs as described at https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/, but I don’t know how to find the start and the end nodes with the tags in the first place, and, once again, how to remove the tags and the irrelevant content from them when I get the needed nodes.