Document segmentation on the basis of style

swipezy · February 15, 2016, 11:08pm

Hi,
I have a well formatted word document. It has headings 1 through 3, normal texts, tables and images. My requirement is to break the document in a way that certain heading (e.g H1) and its content becomes a separate document and so on.
THIS IS HEADING 1—(i)
<contents of heading 1…>
THIS IS ANOTHER HEADING 1-----(ii)
<contents…>
I want to make (i) and (ii) two separate documents along with their contents inside them. Rather than banging my head in a wrong direction, I thought I’d ask for help.
Thanks in advance.

tahir.manzoor · February 16, 2016, 9:30am

Hi Julian,

Thanks for your inquiry. Following code example shows how to get the Heading 1 and its contents from document. Hope this helps you. If you still face problem, please share your input document here for our reference. We will then provide you more information about your query along with code.

Please get the code of ExtractContent and GenerateDocument methods from following documentation article.
Extract Content Overview and Code

Document doc = new Document(MyDir + "input.docx");
int i = 1;
DocumentBuilder builder = new DocumentBuilder(doc);
NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
foreach (Paragraph para in paragraphs)
{
    if (para.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1)
    {
        builder.MoveToParagraph(paragraphs.IndexOf(para), 0);
        builder.StartBookmark("bm_extractcontents" + i);
        builder.EndBookmark("bm_extractcontents" + i);
        i++;
    }
}
builder.MoveToDocumentEnd();
builder.StartBookmark("bm_extractcontents" + i);
builder.EndBookmark("bm_extractcontents" + i);
for (int bm = 1; bm < i; bm++)
{
    BookmarkStart bookmarkStart = doc.Range.Bookmarks["bm_extractcontents" + bm].BookmarkStart;
    BookmarkStart bookmarkEnd = doc.Range.Bookmarks["bm_extractcontents" + (bm + 1)].BookmarkStart;
    // Firstly extract the content between these nodes including the bookmark.
    ArrayList extractedNodes = ExtractContent(bookmarkStart, bookmarkEnd, false);
    Document dstDoc = GenerateDocument(doc, extractedNodes);
    dstDoc.Save(MyDir + "Out" + bm + ".docx");
}

swipezy · February 16, 2016, 6:10pm

Hi,
Thanks for the prompt reply. I could run the code and see the output. It seems to be a bit dodgy when the content within a heading has multiple paragraphs. I have uploaded a test document for your reference. Please consider that.
Thanks.

tahir.manzoor · February 17, 2016, 11:21am

Hi Julian,

Thanks for sharing the document. We have noticed that DocumentBuilder.MoveToParagraph method moves the cursor to incorrect position for your document. This inserts the bookmark at incorrect position. We have logged this problem in our issue tracking system as WORDSNET-13136. You will be notified via this forum thread once this issue is resolved. We apologize for your inconvenience.

Could you please share your expected output documents here for our reference? Thanks for your cooperation.

swipezy · February 17, 2016, 4:40pm

All good, thanks for your help anyway.

swipezy · February 24, 2016, 8:44pm

Hi,
while attempting to break document on the basis of header, it works fine with some document and gives out of range error in some document. Here is where i get the error:
builder.MoveToParagraph(paragraphs.IndexOf(para), 0);
Any help?

tahir.manzoor · February 25, 2016, 2:57am

Hi Julian,

Thanks for your inquiry. Could you please attach your input Word document here for testing for which you are getting the error? We will investigate the issue on our side and provide you more information.

tahir.manzoor · March 17, 2016, 12:19pm

Hi Julian,

Thanks for your patience. It is to inform you that we have completed the work on issue (WORDSNET-13136) and has come to a conclusion that this issue is actually not a bug in Aspose.Words. So, we have closed this issue as ‘Not a Bug’.

Please note that Document.GetChildNodes(NodeType.Paragraph, true) gets the paragraphs from the Section’s body and header/footer. Please use Body.GetChildNodes(NodeType.Paragraph, true) to get the paragraphs only from section’s body as shown in following highlighted code snippet.

Document doc = new Document(MyDir + "testOr.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
for (int sectionIndex = 0; sectionIndex < doc.Sections.Count; sectionIndex++)
{
    builder.MoveToSection(sectionIndex);
    NodeCollection paragraphs = doc.Sections[sectionIndex].Body.GetChildNodes(NodeType.Paragraph, true);
    foreach (Paragraph para in paragraphs)
    {
        if (para.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1 && para.ToString(SaveFormat.Text).Trim().Length > 1)
        {
            builder.MoveToParagraph(paragraphs.IndexOf(para), 0);
            builder.StartBookmark("bm_extractcontents" + i);
            builder.EndBookmark("bm_extractcontents" + i);
            i++;
        }
    }
}
builder.MoveToDocumentEnd();
builder.StartBookmark("bm_extractcontents" + i);
builder.EndBookmark("bm_extractcontents" + i);
doc.Save(MyDir + "Out.docx");