Extract Selected Content Between Nodes generates incorrect output

purusadh · November 21, 2019, 12:29pm

Hi Team,

I have used below API to extract documents based on the bookmark, and that API is working for other document but it’s not working on attached document.
https://reference.aspose.com/java/words/com.aspose.words/Bookmark

https://github.com/aspose-words/Aspose.Words-for-Java

Exception on Line :124 on attached word document

Please find the attached reference for your understanding.

query on document extraction.zip (1.0 MB)

Thanks,

Purushottam Sadh

tahir.manzoor · November 21, 2019, 3:38pm

@purusadh

Please create a simple Java application ( source code without compilation errors ) that helps us to reproduce your problem on our end and attach it here for testing. We will investigate the issue and provide you more inforamtion on it.

purusadh · November 22, 2019, 5:23am

Hi Tahir,

Please find attached _source code without compilation errors.

Thanks
Purushottam
query on document extraction.zip (1.1 MB)

tahir.manzoor · November 22, 2019, 8:55am

@purusadh

Thanks for sharing the code. We are getting compilation errors for “Constants”. Please share its code also. We will investigate the issue and share the solution with you.

purusadh · November 22, 2019, 9:16am

Hi Tahir,

Sorry for inconvineance, Please find Constans.java file also with zip.

Thanks
Purushotttam

query on document extraction.zip (1.1 MB)

tahir.manzoor · November 22, 2019, 5:46pm

@purusadh

In your code, you insert the bookmarks in the header and body of section. The start and end nodes should be child node of section’s body. You need to extract the header’s content separately and import it into destination document.

purusadh · November 25, 2019, 8:02am

Hi Tahir,

I didn’t get you, Please provide some reference code for it. As I am using as same code which is available in reference example.

https://github.com/aspose-words/Aspose.Words-for-Java

Thanks
Purushottam

tahir.manzoor · November 25, 2019, 3:15pm

@purusadh

Please save your document after inserting the bookmarks. Open the document in MS Word and check bookmark’s position. One bookmark is in the header of document and one is inside the body. The extract content utility code does not extract the content between these two inserted bookmarks. The start and end node should be child node of section’s body. We suggest you please read about document object model and extract content article from here:
Aspose.Words Document Object Model
Extract Selected Content Between Nodes

If you still face problem, please ZIP and attach your expected output document. We will then provide you code example according to your requirement.

purusadh · November 26, 2019, 7:15am

Hi Tahir,

I went through above reference and also check bookmark, it’s going to add proper bookmark on “Heading 1” as per my requirement. Same code is working for other documents.

Please check I have shared updated code with bookmark and expected output which would be two separate document.

Thanks
Purushottam

query on document extraction.zip (2.8 MB)

tahir.manzoor · November 26, 2019, 3:17pm

@purusadh

In your expected output document, the first paragraph has text “RATIONALE” and last paragraph has text “FIGURE 1 - FUSE CONFIGURATION/DIMENSIONS”. You need to insert the bookmarks for these two paragraphs in your input document.

We have inserted the bookmarks into these paragraph in your document using MS Word and attached it along with output document.
Docs.zip (1.1 MB)

Following code example shows how to extract content between two bookmarks and copy header and footer from one document into another. Hope this helps you.

Document doc = new Document(MyDir + "modified AS28938A.doc");

Bookmark bm1 = doc.getRange().getBookmarks().get("extractcontent1");
Bookmark bm2 = doc.getRange().getBookmarks().get("extractcontent2");

ArrayList<Node> extractedNodes = extractContent(bm1.getBookmarkStart(), bm2.getBookmarkStart(), false);

Document dstDoc = generateDocument(doc, extractedNodes);

//Copy header and footer
Section section = (Section) bm1.getBookmarkStart().getAncestor(NodeType.SECTION);
for(HeaderFooter headerFooter : section.getHeadersFooters())
{
    HeaderFooter header = dstDoc.getFirstSection().getHeadersFooters().getByHeaderFooterType(headerFooter.getHeaderFooterType());
    if (header == null)
    {
        // There is no header of the specified type in the current section, create it.
        header = new HeaderFooter(dstDoc, headerFooter.getHeaderFooterType());
        dstDoc.getFirstSection().getHeadersFooters().add(header);
    }

    for (Node srcNode :  (Iterable<Node>)headerFooter.getChildNodes())
    {
        Node dstNode = dstDoc.importNode(srcNode, true, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        header.appendChild(dstNode);
    }
}

dstDoc.getFirstSection().getPageSetup().setDifferentFirstPageHeaderFooter(section.getPageSetup().getDifferentFirstPageHeaderFooter());
dstDoc.save(MyDir + "output.docx");

purusadh · November 27, 2019, 12:38pm

Hi Tahir,

Thanks for update.

In our case adding bookmarks is dynamic process at run time and we need to add bookmark on Heading 1 not on paragraphs as I mentioned in previous mail.

And our expected output should data between two headings (Heading 1).
I have attached output in previous mail.

Thanks,
Purushottam

tahir.manzoor · November 27, 2019, 4:55pm

@purusadh

Yes, we noticed that you are inserting bookmark dynamically for paragraphs those have style “Heading 1”. As shared in my previous posts, you are inserting bookmark in the footer of document. This is not the correct approach to extract the content using Extract Content utility code.

The start and end nodes should be child nodes of Section’s Body node. And import/extract the header and footer separately to destination document as shared in my previous post.

Moreover, please read the following suggested articles.

Hope this clears the detail of your query.

purusadh · November 28, 2019, 1:10pm

Hi Tahir,

There is some miscommunication in our discussion, I don’t want to extract header and footer based on “Heading 1”, I just need to extract content with word style.

Same code is working proper for attached document but not which I have shared in previous mail.

So, I just wan to recheck with you some thing is missing in previous document “AS28938A.doc” or need to change my code which working for other files(Working document.doc).

Thanks
Purushottam Sadh

Docs.zip (785.2 KB)

Thanks
Purushottam

tahir.manzoor · November 28, 2019, 3:45pm

@purusadh

In your case, the text “RATIONALE” and “SCOPE” has heading style in the document’s body. So, you need to insert the bookmarks to these paragraphs. Following code shows how to iterate over body nodes of document and insert the bookmarks.

Please check for(Section section : doc.getSections()) in the following code example. Hope this helps you.

Document doc = new Document(MyDir + "AS28938A.doc");
 java.util.List<String> uploadedSections = new ArrayList<>();

System.out.print("AsposeUtils.extractSections() Before getting document");
// Gets uploaded document object

System.out.print("AsposeUtils.extractSections() Remove Nonbreaking Space Characters");

System.out.print("AsposeUtils.extractSections() uploaded document object {}"+ doc);
doc.updateListLabels();

// TODO: for ckeditor
doc.joinRunsWithSameFormatting();

DocumentBuilder builder = new DocumentBuilder(doc);

int bmo = 1;
for (Section asposeSection : doc.getSections()) {
    if (doc.getFirstSection() == asposeSection)
        continue;

    builder.moveTo(((Section) asposeSection.getPreviousSibling()).getBody().getLastParagraph());
    int orientation = asposeSection.getPageSetup().getOrientation();

    // When we extract sections (e.g. TEST), extractContent method
    // does
    // not retain
    // Section Brakes in extracted
    // sections, i.e.The extractContent method does not extract the
    // section breaks.
    // We are adding bookmarks to know the position of section
    // brakes.
    if (asposeSection.getPageSetup().getSectionStart() == SectionStart.CONTINUOUS) {
        builder.startBookmark("BM_BreakC" + bmo);
        builder.endBookmark("BM_BreakC" + bmo);
        builder.startBookmark(orientation + "Orientation" + bmo);
        builder.endBookmark(orientation + "Orientation" + bmo);
    }
    if (asposeSection.getPageSetup().getSectionStart() == SectionStart.NEW_PAGE) {
        builder.startBookmark("BM_BreakNewPage" + bmo);
        builder.endBookmark("BM_BreakNewPage" + bmo);
        builder.startBookmark(orientation + "Orientation" + bmo);
        builder.endBookmark(orientation + "Orientation" + bmo);
    }
    bmo++;
}

int i = 1;
 
for(Section section : doc.getSections())
{
    NodeCollection<Paragraph> nodes = section.getBody().getChildNodes(NodeType.PARAGRAPH, true);

    for (Paragraph para : (Iterable<Paragraph>) nodes) {

        if (para.getParagraphFormat().isHeading()
                && para.getParagraphFormat().getStyle().getName().equals(Constants.HEADING_STYLE)) {

            uploadedSections.add(para.getText().toLowerCase().trim());

            Paragraph paragraph = new Paragraph(doc);

            para.getParentNode().insertBefore(paragraph, para);

            builder.moveTo(paragraph);

            builder.startBookmark(Constants.BOOKMARK_NAME + i);

            builder.endBookmark(Constants.BOOKMARK_NAME + i);

            // increase counter
            i++;

        }

    }

    System.out.print("uploadedSections {}"+ uploadedSections);
    builder.moveToDocumentEnd();
    builder.startBookmark(Constants.BOOKMARK_NAME + i);
    builder.endBookmark(Constants.BOOKMARK_NAME + i);

    System.out.print("AsposeUtils.extractSections() going for extract html content and xml {}");
    generateXml(doc, i);
}

purusadh · November 29, 2019, 10:18am

Hi Tahir,

Above approach is working for me for both type of documents.

Thanks,
Purushottam

tahir.manzoor · November 29, 2019, 1:32pm

@purusadh

Thanks for your feedback. Please let us know if you have any more queries.