Copy/Extract shape using paragraph node in JAVA

@priyadharshini

Thanks for sharing your requirement in detail. Please spare us some time for the analysis of your desired output. We will get back to you soon with code example according to your requirement.

Best Regards,
Tahir Manzoor

Thank you… waiting eagerly for reply…

@priyadharshini

Thanks for your patience. Please use following code example to achieve your requirement. Hope this helps you.

Document doc = new Document(MyDir + "Imageproblem.docx");
int i = 1;
ArrayList nodes = new ArrayList();

//Remove empty paragraphs
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (paragraph.toString(SaveFormat.TEXT).trim().length() == 0
            && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
            && paragraph.getText().contains(ControlChar.PAGE_BREAK) == false) {
        paragraph.remove();
    }
}

//Get the paragraphs that start with "Fig".
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node previousPara = paragraph.getPreviousSibling();
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH
                && previousPara.toString(SaveFormat.TEXT).trim().length() == 0
                && ((Paragraph)previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            if(previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        //Extract the consecutive shapes and export them into new document
        Document dstDoc = new Document();
        for (Paragraph para : (Iterable<Paragraph>)nodes)
        {
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(para, true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);
        }
        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
        nodes.clear();
    }
}

Thank you Tahir… It is absolutely working.

Regards
Priya Dharshini J P

@priyadharshini

Thanks for your feedback. Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

Thank you tahir.

While executing the input document test15.zip (363.8 KB)
as attached with the above mentioned logic, output is created as expected but with many blank documents. Is there a way to create only documents with images and avoiding blank document creation.

Regards
Priya Dharshini J P

Mismatch.zip (448.0 KB)
Hi team,\

By using the above logic, All group images in problem document output after execution is produced with mismatch, the order in which images are created is reverse. Expected Output is attached. Kindly help out.

Regards
Priya Dharshini J P

Hi team,

Also requesting a solution to delete/remove extracted contents from source document after embedding into new document in order to avoid repetition of images.

Thankin you

Due to time consistency, requesting solution as soon as possible.

Regards
Priya Dharshini J P

@priyadharshini

Thanks for your inquiry. You want to extract images from Word document before the text that starts with “Fig” or “Figure”. You also want to remove the empty paragraphs from the output document. We already shared the solution to your queries in following thread. Please use the same approach and modify the code according to your use cases.

Best Regards,
Tahir Manzoor

But my problem is group images created are in reverse order from source document. And many blank documents are created during execution. In addition to it I request you to delete/remove the images extracted after execution from source document.

Thanking you

Thank you @tahir.manzoor

using the code mentioned above,

Document doc = new Document(MyDir + “Imageproblem.docx”);
int i = 1;
ArrayList nodes = new ArrayList();

//Remove empty paragraphs
for (Paragraph paragraph : (Iterable) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
if (paragraph.toString(SaveFormat.TEXT).trim().length() == 0
&& paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
&& paragraph.getText().contains(ControlChar.PAGE_BREAK) == false) {
paragraph.remove();
}
}

//Get the paragraphs that start with “Fig”.
for (Paragraph paragraph : (Iterable) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
if(paragraph.toString(SaveFormat.TEXT).trim().startsWith(“Fig”))
{
Node previousPara = paragraph.getPreviousSibling();
while (previousPara != null
&& previousPara.getNodeType() == NodeType.PARAGRAPH
&& previousPara.toString(SaveFormat.TEXT).trim().length() == 0
&& ((Paragraph)previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
{
if(previousPara != null)
nodes.add(previousPara);
previousPara = previousPara.getPreviousSibling();
}

    //Extract the consecutive shapes and export them into new document
    Document dstDoc = new Document();
    for (Paragraph para : (Iterable<Paragraph>)nodes)
    {
        NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        Node newNode = importer.importNode(para, true);
        dstDoc.getFirstSection().getBody().appendChild(newNode);
    }
    dstDoc.save(MyDir + "output"+i+".docx");
    i++;
    nodes.clear();
}

}

I have the following difficulties:

  1. Many Blank/Empty Documents are created during execution.

  2. Consecutive images(Group Images) are appearing in reverse order.
    (For example: If 3 consecutive images are in source document then, first images appears in last and last image appears at first in output document. )

  3. After extraction of images to new document, inorder to avoid repetition of same image being extracted again, I request a work around solution delete/remove that image from source document.

I am in need of such a workaround, hope you can help me out.
Thanking you for helping out.
The above mentioned solution is working fine for consecutive images except for reversal order.

Regards
Priya

@priyadharshini,

Thanks for your inquiry. We have modified the code according to your requirements. Please use the following modified code example.

Document doc = new Document(MyDir + "Problem.docx");
int i = 1;
ArrayList nodes = new ArrayList();

//Remove empty paragraphs
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (paragraph.toString(SaveFormat.TEXT).trim().length() == 0
            && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
            && paragraph.getText().contains(ControlChar.PAGE_BREAK) == false) {
        paragraph.remove();
    }
}

//Get the paragraphs that start with "Fig".
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node previousPara = paragraph.getPreviousSibling();
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH
                && previousPara.toString(SaveFormat.TEXT).trim().length() == 0
                && ((Paragraph)previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            if(previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        if(nodes.size() > 0)
        {
            //Reverse the node collection.
            Collections.reverse(nodes);

            //Extract the consecutive shapes and export them into new document
            Document dstDoc = new Document();
            for (Paragraph para : (Iterable<Paragraph>)nodes)
            {
                NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(para, true);
                dstDoc.getFirstSection().getBody().appendChild(newNode);
            }
            //Remove the first empty paragraph
            if(dstDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
                dstDoc.getFirstSection().getBody().getFirstParagraph().remove();
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
            nodes.clear();
        }
    }
}

The above code example does not import duplicate images in output document.

Best Regards,
Tahir Manzoor

Thank you @tahir.manzoor, the code is doing excellently as we expected. but due to extraction of inline images that are extracted using different mechanisms that you had mentioned at earlier stages, we get duplicate images, so to avoid that, we request a form to delete/remove images extracted after extraction from source document to avoid duplication. I am very thankful to your continuous support and solutions. We are able to perform well with the absolutely perfect replies from @tahir.manzoor
Thanking You
Priya Dharshini J P

Hi @tahir.manzoor,
Can I expect solution soon due to showcase nearing…

Regards
Priya

@priyadharshini,

Thanks for your inquiry. We have not found the duplicate images issue in output documents. Could you please share the following resources here for testing?

  • Your input document.
  • Please share the page numbers of input document whose content are duplicating.
  • Please share the output documents that shows the undesired behavior.

Thanks for your cooperation.

Best Regards,
Tahir Manzoor

Code.zip (7.1 KB)
Support.zip (1.9 MB)
test (15).zip (363.8 KB)
test (8).zip (2.7 MB)
test (2).zip (2.7 MB)

Hi @tahir.manzoor,

I have attached the Code we have been using for extraction from your solutions, test files which will produce duplication when executed.

Thanking you for all the help @tahir.manzoor, We await solution.
Regards
Priya Dharshini J P

@priyadharshini,

Thanks for sharing the input documents. In case you are using old version of Aspose.Words, we suggest you please use latest version of Aspose.Words for Java 17.6. We have not found any duplicate images in output documents while using code example shared in following post.

Output documents : output test 8.zip (1.9 MB)
output test (2).zip (1.6 MB)
output test (15).zip (260.5 KB)

Best Regards,
Tahir Manzoor

Thank you for all the help and support @tahir.manzoor

Regards
Priya Dharshini J P