Extract Paragraphs under different headings from word Document

My requirement is such that I want to extract paragraphs(including text, images, embedded ole object, etc.) from a word document under different headings.

Example:

Input:

  1. Heading 1

This paragraph is under heading 1

1.1 Heading 1.1

This paragraph is under heading 1.1

1.1.1 Heading 1.1.1

This paragraph is under heading 1.1.1

  1. Heading 2

This paragraph is under Heading 2

Expected Output:

This paragraph is under Heading 1

This paragraph is under Heading 1.1

This paragraph is under Heading 1.1.1

This paragraph is under Heading 2

Please let me know if I can do this using asposeForWords using JAVA.

Hi,

My requirement is such that I want to extract paragraphs(including text, images, embedded ole object, etc.) from a word document under different headings.

Example:

Input:

  1. Heading 1

This paragraph is under heading 1

1.1 Heading 1.1

This paragraph is under heading 1.1

1.1.1 Heading 1.1.1

This paragraph is under heading 1.1.1

  1. Heading 2

This paragraph is under Heading 2

Expected Output:

This paragraph is under Heading 1

This paragraph is under Heading 1.1

This paragraph is under Heading 1.1.1

This paragraph is under Heading 2

Please let me know if I can do this using asposeForWords using JAVA.

Thanks,

Priyanka

::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

This message was posted using Support2Forum.

Hi there,

Thanks for your inquiry. Please read about extracting content between paragraphs based on style from here:
https://docs.aspose.com/words/java/extract-selected-content-between-nodes/

Please get the code of extractContent method from here:
https://docs.aspose.com/words/java/extract-selected-content-between-nodes/

Moreover, I suggest you please check the code examples of Aspose.Words from here:
https://github.com/asposewords/Aspose_Words_Java

Hope this helps you. Please let us know if you have any more queries.

Thanks for the reply. I have already seen these examples but looks like I want more than that since the paragraph under these heading under normal style.

Could you please guide me how I can go ahead.

Hi there,

Thanks for your inquiry. Could you please attach your input and expected output Word documents here for our reference? We will then provide you more information on this along with code.

Please find enclosed the Input Document and expected output documents.

I want to sequentially read a file and generate multiple documents based on headings.

Hi there,

Thanks for sharing the documents. Please use the following code example to achieve your requirements.

Document doc = new Document(MyDir + "Input+Document.docx");
int i = 1;
DocumentBuilder builder = new DocumentBuilder(doc);
NodeCollection nodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph para : (Iterable)nodes)
{
    if (para.getParagraphFormat().isHeading())
    {
        builder.moveToParagraph(nodes.indexOf(para), 0);
        builder.startBookmark("bm_extractcontents" + i);
        builder.endBookmark("bm_extractcontents" + i);
        i++;
    }
}
builder.moveToDocumentEnd();
builder.startBookmark("bm_extractcontents" + i);
builder.endBookmark("bm_extractcontents" + i);
for (int bm = 1; bm < i; bm++)
{
    BookmarkStart bookmarkStart = doc.getRange().getBookmarks().get("bm_extractcontents" + bm).getBookmarkStart();
    BookmarkStart bookmarkEnd = doc.getRange().getBookmarks().get("bm_extractcontents" + (bm + 1)).getBookmarkStart();
    // Firstly extract the content between these nodes including the bookmark.
    ArrayList extractedNodes = extractContent(bookmarkStart, bookmarkEnd, false);
    Document dstDoc = generateDocument(doc, extractedNodes);
    dstDoc.save(MyDir + "Out" + bm + ".docx");
}

Please get the code of extractContent method from here:
https://docs.aspose.com/words/java/extract-selected-content-between-nodes/

Hope this helps you. Please let us know if you have any more queries.

Thanks for your quick response.

Please could you also tell me. Is it possible to have only paragraphs extracted not headings.

Please find input and expected output attached.

Hi there,

Thanks for your inquiry. Please check highlighted code snippet below. This code example removes the first paragraph from the extracted contents. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "Input%2bDocument.docx");
int i = 1;
DocumentBuilder builder = new DocumentBuilder(doc);
NodeCollection nodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph para : (Iterable)nodes)
{
    if (para.getParagraphFormat().isHeading())
    {
        builder.moveToParagraph(nodes.indexOf(para), 0);
        builder.startBookmark("bm_extractcontents" + i);
        builder.endBookmark("bm_extractcontents" + i);
        i++;
    }
}
builder.moveToDocumentEnd();
builder.startBookmark("bm_extractcontents" + i);
builder.endBookmark("bm_extractcontents" + i);
for (int bm = 1; bm < i; bm++)
{
    BookmarkStart bookmarkStart = doc.getRange().getBookmarks().get("bm_extractcontents" + bm).getBookmarkStart();
    BookmarkStart bookmarkEnd = doc.getRange().getBookmarks().get("bm_extractcontents" + (bm + 1)).getBookmarkStart();
    // Firstly extract the content between these nodes including the bookmark.
    ArrayList extractedNodes = extractContent(bookmarkStart, bookmarkEnd, false);
    Document dstDoc = generateDocument(doc, extractedNodes);
    if (dstDoc.getFirstSection().getBody().getFirstParagraph().getParagraphFormat().isHeading())
        dstDoc.getFirstSection().getBody().getFirstParagraph().remove();
    dstDoc.save(MyDir + "Out" + bm + ".docx");
}

“Extract Paragraphs under different headings from word Document” is not working for the attached template. Please help.

Hi there,

Thanks for your inquiry.
Please use attached modified code of extractContent method in your application. This will fix the exception issue.

Please let us know if you have any more queries.

Hi,

Thanks for your quick response. This did solve the exception issue but didn’t give me the desired output.

Please find attached the output that I am expecting.I just want to generate document for headings along with its paragraph. Rest header, footer, Table of contents should all be neglected. I don’t want blank document to be generated for these things.Thanks.

Hi there,

Thanks for your inquiry.
You are facing the shared issue due to IF field inside heading. Please ignore the extractContent method shared in my previous post and use following code example to achieve your requirements. Please check the original version of extractContent method in attachment. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "Template_copy_as(2).docx");
doc.updateFields();
int i = 1;
DocumentBuilder builder = new DocumentBuilder(doc);
NodeCollection nodes = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph para : (Iterable)nodes)
{
    if (para.hasChildNodes() && para.getParagraphFormat().isHeading())
    {
        Paragraph paragraph = new Paragraph(doc);
        para.getParentNode().insertBefore(paragraph, para);
        builder.moveTo(paragraph);
        builder.startBookmark("bm_extractcontents" + i);
        builder.endBookmark("bm_extractcontents" + i);
        i++;
    }
}
builder.moveToDocumentEnd();
builder.startBookmark("bm_extractcontents" + i);
builder.endBookmark("bm_extractcontents" + i);
for (int bm = 1; bm < i; bm++)
{
    BookmarkStart bookmarkStart = doc.getRange().getBookmarks().get("bm_extractcontents" + bm).getBookmarkStart();
    BookmarkStart bookmarkEnd = doc.getRange().getBookmarks().get("bm_extractcontents" + (bm + 1)).getBookmarkStart();
    // Firstly extract the content between these nodes including the bookmark.
    ArrayList extractedNodes = extractContent(bookmarkStart, bookmarkEnd, false);
    Document dstDoc = generateDocument(doc, extractedNodes);
    if (dstDoc.getFirstSection().getBody().getFirstParagraph().getParagraphFormat().isHeading())
        dstDoc.getFirstSection().getBody().getFirstParagraph().remove();
    if (dstDoc.toString(SaveFormat.TEXT).trim().length() > 0)
        dstDoc.save(MyDir + "Out" + bm + ".docx");
}

Can I get output without headings?

Hi there,

Thanks for your inquiry. Please use the following modified code example to achieve your requirements. Hope this helps you.

Please let us know if you have any more queries.

for (int bm = 1; bm < i; bm++)
{
    BookmarkStart bookmarkStart = doc.getRange().getBookmarks().get("bm_extractcontents" + bm).getBookmarkStart();
    BookmarkStart bookmarkEnd = doc.getRange().getBookmarks().get("bm_extractcontents" + (bm + 1)).getBookmarkStart();
    // firstly extract the content between these nodes including the bookmark.
    ArrayList extractedNodes = extractContent(bookmarkStart, bookmarkEnd, false);
    Document dstDoc = generateDocument(doc, extractedNodes);
    dstDoc.updatePageLayout();
    if (dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes().getCount() == 0)
        dstDoc.getFirstSection().getBody().getFirstParagraph().remove();
    if (dstDoc.getFirstSection().getBody().getFirstParagraph().getParagraphFormat().isHeading())
        dstDoc.getFirstSection().getBody().getFirstParagraph().remove();
    if (dstDoc.toString(SaveFormat.TEXT).trim().length() > 0)
        dstDoc.save(MyDir + "Out" + bm + ".docx");
}

PFA my expected output(output.zip) and input(Template.docx) document. Also, Please note “out_4.docx” is blank because Interfaces do not have any body contents

Hi there,

Thanks for your inquiry. We have tested the scenario using latest version of Aspose.Words for Java 15.10.0 and have not found the shared issue. Please use Aspose.Words for Java 15.10.0.

Please let us know if you have any more queries.

Please could you tell me maven dependency declaration for this version

Hi there,

Thanks for your inquiry. Please check Aspose.Words v15.10.0 maven repository from following link.

Aspose.Words Maven Repository

We have attached sample prom.xml with this post for your kind reference.

I changed the aspose version and it still didn’t work for me.

Please see the actual output I am getting (Actual output.zip) and the expected output (Expected Output.zip).