Extract content based on Heading style using C#

Gptrnt · April 28, 2020, 5:50am

HI,

I want to get all the contents(including the paragraph, table, chart, picture, and shapes) under the each heading. This is the sample doc
Sample_doc_.zip (3.6 MB)

I want to get all content belongs to each heading. Please help me to find a way for this.

Also sometimes the content will be invisible. So can you please tell me how to identify (or extract) the hidden content.

tahir.manzoor · April 28, 2020, 4:06pm

@Gptrnt

In your case, we suggest you following solution.

Iterate over paragraph nodes of document.
Bookmarks the paragraphs that has style “Heading 1” e.g. bookmark1, bookmark2 etc.
Extract the content between bookmarks.

Please read the following article about extracting content from the document.
How to Extract Selected Content Between Nodes in a Document

Gptrnt · April 28, 2020, 4:52pm

Hi,

I used the same method, but I am not able to find the last heading content. Because there is np mark for the end. Can you please tell me how to find the last node(it could table, picture,etc…) of the doc ?

tahir.manzoor · April 28, 2020, 6:54pm

@Gptrnt

Following code example shows how to bookmark the paragraphs having heading style and extract the content. You can get code of ExtractContent and GenerateDocument methods from the article shared in my previous post.

Document doc = new Document(MyDir + "Sample_doc_.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection nodes = doc.GetChildNodes(NodeType.Paragraph, true);
foreach (Paragraph paragraph in nodes.Cast<Paragraph>().Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1))
{
    builder.MoveToParagraph(nodes.IndexOf(paragraph), 0);
    builder.StartBookmark("bookmark" + i);
    builder.EndBookmark("bookmark" + i);
    i++;
}
doc.UpdatePageLayout();

for (int j = 1; j < i; j++)
{
    BookmarkStart start = doc.Range.Bookmarks["bookmark" + j].BookmarkStart;
    BookmarkStart end = doc.Range.Bookmarks["bookmark" + j + 1].BookmarkStart;

    ArrayList extractedNodesInclusive = Common.ExtractContent(start, end, true);
    Document dstDoc = Common.GenerateDocument(doc, extractedNodesInclusive);
    dstDoc.Save(MyDir + "out"+j+".docx");
}