Free Support Forum - aspose.com

Extract content between two text identifiers

Hello,
I need advice on how to extract parts of a document between two text identifiers, keeping all the formatting intact.
For example, I have a document with a tagged part, formed like this:

Irrelevant contentFormatted content that I need to extract.
It can span multiple paragraphs, and is only limited by the identifiers.
Irrelevant content

The irrelevant content, the tag, and the needed content can all be contained in one run. The tags themselves aren’t needed. There can be multiple tagged parts in a single document, and I need to extract them all.

I can find a tagged part by using a regex, like this: (?<=)(.*?)(?=). Then in the ReplaceAction I can get all the needed runs, but I need the paragraphs, to keep the formatting.
And if I get the paragraphs that contain the runs, I don’t know how to remove the tags and the irrelevant content from them.

I also tried extracting the paragraphs as described at http://www.aspose.com/docs/display/wordsnet/Extract+Content+Overview+and+Code, but I don’t know how to find the start and the end nodes with the tags in the first place, and, once again, how to remove the tags and the irrelevant content from them when I get the needed nodes.

Hi,


Thanks for your inquiry.

I think, to be able to extract all document elements (e.g. Shapes, Tables, Paragraphs etc) enclosed in between these and tags, you need to implement the following workflow:

  1. Find the Node/Paragraph which represents the starting keyword i.e.
  2. Find the Node/Paragraph which represents the ending keyword i.e.
  3. You can then use the code suggested in this article to be able to extract content between start and end nodes
  4. Generate a temporary Document comprising of the nodes you just extracted
  5. Iterate through the Paragraph collection and remove strings “” and “” from them.

Please let me know if I can be of any further assistance.

Best regards,

Hello,
In this case, could you please advise me on what is the fastest and most robust way of finding the node with the specified keyword? I can’t use regex and Range.Replace, because the version of Aspose.Words I’m using throws a NullReferenceException on a regex like (?<=keyword)(.*) if keyword has an angle bracket. So, do I have to iterate through all the Paragraph nodes and scan their Range property for the keyword, or is there some other way to do it?

Hi,


Thanks for your request. Sure, you can iterate through all the Paragraph nodes and scan their Range property for the keyword. You can also use DocumentVisitor to achieve this. You can find a very good example, which demonstrates the technique, here:
http://www.aspose.com/docs/display/wordsnet/How+to++Extract+Content+using+DocumentVisitor

Best regards,

is this link broken or moved somewhere else?

@javidp84,

Please check the following article:
How to Extract Selected Content Between Nodes in a Document