Copy/Extract shape using paragraph node in JAVA

priyadharshini · June 22, 2017, 6:19am

image.zip (48.4 KB)
Hi team,
Requiring a work around solution to copy/extract shapes in docx to new docx based on figure caption starting with “Fig” using paragraph and curNode.
Extraction of group images also expected

Regards
Priya Dharshini J P

tahir.manzoor · June 22, 2017, 10:21am

Hi Priya,

Thanks for your inquiry. In your document, there are two shapes and many empty paragraphs. The last paragraph’s text starts with “Fig”. Both shapes are in separate paragraphs. Could you please share the parameters that you want to use to extract the contents?

You can simply remove the paragraph that starts with “Fig” text to get the desired output. Please check following code example.

Document doc = new Document(MyDir + "in.docx");
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        paragraph.remove();
    }
}
doc.save(MyDir + "output.docx");

priyadharshini · June 22, 2017, 10:31am

Hi tahir,

The document shared here is just one page of a journal document, hence trimming the paragraph and saving it as new doc is not working out. I require these kind of simultaneous images to get extracted by using a filter like a shape followed by another shape… until a text starting with “Fig” is found. the selected shapes should have to be saved in the new doc altogether as group image.
Based on previous searches, if next sibling of a shape is again a shape, then consider that shape as curNode and again search till a string starts with “Fig” is found.
I am in need of such a workaround, hope you can help me out.

Thanks

tahir.manzoor · June 22, 2017, 6:38pm

@priyadharshini

Thanks for sharing your requirement in detail. Please spare us some time for the analysis of your desired output. We will get back to you soon with code example according to your requirement.

Best Regards,
Tahir Manzoor

priyadharshini · June 23, 2017, 2:27am

Thank you… waiting eagerly for reply…

tahir.manzoor · June 23, 2017, 4:32pm

@priyadharshini

Thanks for your patience. Please use following code example to achieve your requirement. Hope this helps you.

Document doc = new Document(MyDir + "Imageproblem.docx");
int i = 1;
ArrayList nodes = new ArrayList();

//Remove empty paragraphs
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (paragraph.toString(SaveFormat.TEXT).trim().length() == 0
            && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
            && paragraph.getText().contains(ControlChar.PAGE_BREAK) == false) {
        paragraph.remove();
    }
}

//Get the paragraphs that start with "Fig".
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node previousPara = paragraph.getPreviousSibling();
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH
                && previousPara.toString(SaveFormat.TEXT).trim().length() == 0
                && ((Paragraph)previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            if(previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        //Extract the consecutive shapes and export them into new document
        Document dstDoc = new Document();
        for (Paragraph para : (Iterable<Paragraph>)nodes)
        {
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(para, true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);
        }
        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
        nodes.clear();
    }
}

priyadharshini · June 24, 2017, 3:02pm

Thank you Tahir… It is absolutely working.

Regards
Priya Dharshini J P

tahir.manzoor · June 24, 2017, 5:58pm

@priyadharshini

Thanks for your feedback. Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

priyadharshini · June 25, 2017, 4:57am

Thank you tahir.

While executing the input document test15.zip (363.8 KB)
as attached with the above mentioned logic, output is created as expected but with many blank documents. Is there a way to create only documents with images and avoiding blank document creation.

Regards
Priya Dharshini J P

priyadharshini · June 25, 2017, 6:42am

Mismatch.zip (448.0 KB)
Hi team,\

By using the above logic, All group images in problem document output after execution is produced with mismatch, the order in which images are created is reverse. Expected Output is attached. Kindly help out.

Regards
Priya Dharshini J P

priyadharshini · June 25, 2017, 8:07am

Hi team,

Also requesting a solution to delete/remove extracted contents from source document after embedding into new document in order to avoid repetition of images.

Thankin you

priyadharshini · June 26, 2017, 10:11am

Due to time consistency, requesting solution as soon as possible.

Regards
Priya Dharshini J P

tahir.manzoor · June 26, 2017, 6:18pm

@priyadharshini

Thanks for your inquiry. You want to extract images from Word document before the text that starts with “Fig” or “Figure”. You also want to remove the empty paragraphs from the output document. We already shared the solution to your queries in following thread. Please use the same approach and modify the code according to your use cases.

Best Regards,
Tahir Manzoor

priyadharshini · June 26, 2017, 6:35pm

But my problem is group images created are in reverse order from source document. And many blank documents are created during execution. In addition to it I request you to delete/remove the images extracted after execution from source document.

Thanking you

priyadharshini · June 27, 2017, 5:07am

Thank you @tahir.manzoor

using the code mentioned above,

Document doc = new Document(MyDir + “Imageproblem.docx”);
int i = 1;
ArrayList nodes = new ArrayList();

//Remove empty paragraphs
for (Paragraph paragraph : (Iterable) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
if (paragraph.toString(SaveFormat.TEXT).trim().length() == 0
&& paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
&& paragraph.getText().contains(ControlChar.PAGE_BREAK) == false) {
paragraph.remove();
}
}

//Get the paragraphs that start with “Fig”.
for (Paragraph paragraph : (Iterable) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
if(paragraph.toString(SaveFormat.TEXT).trim().startsWith(“Fig”))
{
Node previousPara = paragraph.getPreviousSibling();
while (previousPara != null
&& previousPara.getNodeType() == NodeType.PARAGRAPH
&& previousPara.toString(SaveFormat.TEXT).trim().length() == 0
&& ((Paragraph)previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
{
if(previousPara != null)
nodes.add(previousPara);
previousPara = previousPara.getPreviousSibling();
}

    //Extract the consecutive shapes and export them into new document
    Document dstDoc = new Document();
    for (Paragraph para : (Iterable<Paragraph>)nodes)
    {
        NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        Node newNode = importer.importNode(para, true);
        dstDoc.getFirstSection().getBody().appendChild(newNode);
    }
    dstDoc.save(MyDir + "output"+i+".docx");
    i++;
    nodes.clear();
}

}

I have the following difficulties:

Many Blank/Empty Documents are created during execution.
Consecutive images(Group Images) are appearing in reverse order.
(For example: If 3 consecutive images are in source document then, first images appears in last and last image appears at first in output document. )
After extraction of images to new document, inorder to avoid repetition of same image being extracted again, I request a work around solution delete/remove that image from source document.

I am in need of such a workaround, hope you can help me out.
Thanking you for helping out.
The above mentioned solution is working fine for consecutive images except for reversal order.

Regards
Priya

tahir.manzoor · June 27, 2017, 8:36am

@priyadharshini,

Thanks for your inquiry. We have modified the code according to your requirements. Please use the following modified code example.

Document doc = new Document(MyDir + "Problem.docx");
int i = 1;
ArrayList nodes = new ArrayList();

//Remove empty paragraphs
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (paragraph.toString(SaveFormat.TEXT).trim().length() == 0
            && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
            && paragraph.getText().contains(ControlChar.PAGE_BREAK) == false) {
        paragraph.remove();
    }
}

//Get the paragraphs that start with "Fig".
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node previousPara = paragraph.getPreviousSibling();
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH
                && previousPara.toString(SaveFormat.TEXT).trim().length() == 0
                && ((Paragraph)previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            if(previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        if(nodes.size() > 0)
        {
            //Reverse the node collection.
            Collections.reverse(nodes);

            //Extract the consecutive shapes and export them into new document
            Document dstDoc = new Document();
            for (Paragraph para : (Iterable<Paragraph>)nodes)
            {
                NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(para, true);
                dstDoc.getFirstSection().getBody().appendChild(newNode);
            }
            //Remove the first empty paragraph
            if(dstDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
                dstDoc.getFirstSection().getBody().getFirstParagraph().remove();
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
            nodes.clear();
        }
    }
}

The above code example does not import duplicate images in output document.

Best Regards,
Tahir Manzoor

priyadharshini · June 27, 2017, 9:30am

Thank you @tahir.manzoor, the code is doing excellently as we expected. but due to extraction of inline images that are extracted using different mechanisms that you had mentioned at earlier stages, we get duplicate images, so to avoid that, we request a form to delete/remove images extracted after extraction from source document to avoid duplication. I am very thankful to your continuous support and solutions. We are able to perform well with the absolutely perfect replies from @tahir.manzoor
Thanking You
Priya Dharshini J P

priyadharshini · June 27, 2017, 4:23pm

Hi @tahir.manzoor,
Can I expect solution soon due to showcase nearing…

Regards
Priya

tahir.manzoor · June 28, 2017, 7:40am

@priyadharshini,

Thanks for your inquiry. We have not found the duplicate images issue in output documents. Could you please share the following resources here for testing?

Your input document.
Please share the page numbers of input document whose content are duplicating.
Please share the output documents that shows the undesired behavior.

Thanks for your cooperation.

Best Regards,
Tahir Manzoor

priyadharshini · June 28, 2017, 8:57am

Code.zip (7.1 KB)
Support.zip (1.9 MB)
test (15).zip (363.8 KB)
test (8).zip (2.7 MB)
test (2).zip (2.7 MB)

Hi @tahir.manzoor,

I have attached the Code we have been using for extraction from your solutions, test files which will produce duplication when executed.

Thanking you for all the help @tahir.manzoor, We await solution.
Regards
Priya Dharshini J P