Removing empty pages

Hi Team,
The requirement is extracting the images and saved into new document.For the extraction process using paragraph node and fig caption as keyword. In my code i have separate the image handling in following ways
Section A-handling figures with caption as previous
Section B-handling images with caption as nextsibling
Section C-handling images inside the table
Section D-handling images landscape mode
Section E-handling label images
In input document having table images and fig caption in next sibling images . It extracted the images .

please kindly help me to resolve the issues
Issue 1-In section A -some empty documents is created along wtih output. How to delete empty documents created during the execution
Issue 2-In section section D-some fig captions are extracted along with output.How to delete fig captions.

The source code Source.zip (8.4 KB)

The input test.zip (1.9 MB)

The actual output Actual Output.zip (2.1 MB)

The expected output Expected Output.zip (1.9 MB)

Thank you very much,
Regards,
Pria.

@akshayapria,

Thanks for your inquiry.

You can check either extracted document contains shape nodes or not. If there is no shape node in the document, do not save the document. Please check the second IF condition in following code snippet.

Please check the first IF condition in following code snippet.

Document dstDoc = new Document();

//Your code to extract the content


//Issue 2-In section section D-some fig captions are extracted along with output.How to delete fig captions.
if (dstDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().startsWith("Figure"))
{
    dstDoc.getFirstSection().getBody().getFirstParagraph().remove();
}

//Issue 1-In section A -some empty documents is created along wtih output. How to delete empty documents created during the execution
if (dstDoc.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
    dstDoc.save("output.docx");

Hi @tahir.manzoor

Thank you very much.

Exactly both issues has been cleared.

Regards,
pria

Hi @tahir.manzoor,

Thank you very much.

oops .The other document having the same issue.

The input sample is test.zip (1.1 MB)

The actual output is actual output.zip (1.1 MB)

Thanks in advance,
pria

@akshayapria,

Thanks for your inquiry.

We already shared the similar code snippet with you. In the shared document the Paragraph is started with text “Fig”. You can change IF condition according to your requirement. Please change “Figure” with “Fig” in above IF condition.

Hi @tahir.manzoor ,

Thanks for your feedback

As per your feedback .I had changed the figure to fig.the fig captions are extracted separately0.

The actual output is actual output.zip (1.1 MB)

Thanks & regards ,
pria

@akshayapria,

Thanks for your inquiry. Please share your expected output documents. What will be the expected output document for “_Fig_land10_Fig_1.docx”? We will then provide you more information on this.

Hi @tahir,

Yes.Exactly.“

The requirement is to extract the images only but some fig captions are also extracted separate document like _Fig_land10_Fig_1.docx

So please kindly help me to remove those document from the output folder.

The expected output is expected output.zip (1.0 MB)

regards,
pria.

@akshayapria,

Thanks for sharing the documents. Please use the following code example to get the desired output documents.

Document interimdoc = new Document(MyDir + "test.docx");
int i = 1;
ArrayList nodes = new ArrayList();
// Get the paragraphs that start with "Fig".
for (Paragraph paragraph : (Iterable<Paragraph>) interimdoc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")) {

        Node previousPara = paragraph.getPreviousSibling();

        while (previousPara != null && previousPara.getNodeType() == NodeType.PARAGRAPH
                && !previousPara.toString(SaveFormat.TEXT).trim().startsWith("Fig")
                && ((Paragraph) previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {
            if (previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        if (nodes.size() > 0) {
            // Reverse the node collection.
            Collections.reverse(nodes);

            // Extract the consecutive shapes and export them into new document
            Document dstDoc = new Document();
            for (Paragraph para : (Iterable<Paragraph>) nodes) {
                NodeImporter importer = new NodeImporter(interimdoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(para, true);
                dstDoc.getFirstSection().getBody().appendChild(newNode);
            }
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
        }
        nodes.clear();
    }
}

HI @tahir.manzoor,

Thank you very much.

Yes ,the exact output is came.

But it can’t be work for other document.Actually removing of nodes.add(paragraph) is not working for other document.It shows empty folder.

I have attached the input Test.zip (661.7 KB)

Thanks and regards,
pria

@akshayapria,

Thanks for your inquiry. Please note that the same code will not work for all your use cases. You need to modify the code according to your requirement. We shared similar code examples in different forum thread with you. We suggest you please list down all your use cases and write the code accordingly.

NodeImporter class allows to efficiently perform repeated import of nodes from one document to another. Please read the detail of this class.

Moreover, we suggest you please read the following article about extracting the contents.
Extract Selected Content Between Nodes

Please let us know if you have any more queries.

Hi @tahir.manzoor ,

I need both cases.

If i am use both case the same images will be extracted two times.

Please ,kindly help me to resolve the same.

regards,
pria

@akshayapria,

Thanks for your inquiry.

For your second use case shared in this thread, please use the following conditions in while loop.

while (previousPara != null && previousPara.getNodeType() == NodeType.PARAGRAPH
                && (previousPara.toString(SaveFormat.TEXT).trim().length() == 0
                || !previousPara.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
                && ((Paragraph) previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {
            if (previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

For this case, the Fig caption is inside shape node. Please check the attached DOM image. DOM.png (30.5 KB)

Please use following code example for this use case. Hope this helps you.

Document interimdoc = new Document(MyDir + "Test2.docx");
int i = 1;
ArrayList nodes = new ArrayList();
// Get the paragraphs that start with "Fig".
for (Paragraph paragraph : (Iterable<Paragraph>) interimdoc.getChildNodes(NodeType.PARAGRAPH, true)) {
        if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")) {
            System.out.println(paragraph.toString(SaveFormat.TEXT).trim());
            Node parentNode = paragraph.getAncestor(NodeType.SHAPE);

            if(parentNode != null && parentNode.getNodeType() == NodeType.SHAPE)
            {
                Paragraph parentPara = ((Shape)parentNode).getParentParagraph();
                paragraph.remove();
                
                Document dstDoc = new Document();

                NodeImporter importer = new NodeImporter(interimdoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(parentPara, true);
                dstDoc.getFirstSection().getBody().appendChild(newNode);

                dstDoc.save(MyDir + "output" + i + ".docx");
                i++;
            }
        }
    }

Hi @tahir.manzoor ,

Thank you very much .

Exactly the above two issues are solved.

Now also some images are not extracted.let me know how to extract those images.

The source code source.zip (8.8 KB)

The input test.zip (1.9 MB)

The actual output Actual output.zip (248.6 KB)

The expected output expected output.zip (2.0 MB)

Thank you very much.
pria.

@akshayapria,

Thanks for your inquiry.

Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "test.doc");
DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
            && paragraph.getNextSibling() != null
            &&  paragraph.getNextSibling().getNodeType() == NodeType.PARAGRAPH
            &&  ((Paragraph)paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
    {
        Document dstDoc = new Document();
        NodeCollection shapes = ((Paragraph)paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true);
        for (Shape shape : (Iterable<Shape>) shapes)
        {
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(shape, true);
            dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
        }
    }
}

Hi Tahir,

Thank you for your feedback.

Thank you very much.It extract the images.but some issue is there.
The issues are
Issue1_It extract the images b only.In source document(same input) having 3(a)&3(b).It extract 3(b) only.The 3(a) is handled separately.how to get the previous sibling image.please help me to extract the the 3(a)&3(b)as single figure.

The other issue is _In other document is some images are not extracted.please help me to extract those images

The input is test.zip (505.4 KB)
The actual output actual output.zip (384.1 KB)
The expected output expected_output.zip (655.2 KB)

Thanks
&
regards,
pria

@akshayapria,

Thanks for your inquiry. In this case, some Fig caption are list labels. Please use following code example to get the desired output.

Document interimdoc = new Document(MyDir + "test.docx");
interimdoc.updateListLabels();
int i = 1;
ArrayList nodes = new ArrayList();
// Get the paragraphs that start with "Fig".
for (Paragraph paragraph : (Iterable<Paragraph>) interimdoc.getChildNodes(NodeType.PARAGRAPH, true)) {
    if (paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
            || (paragraph.isListItem() == true && paragraph.getListLabel().getLabelString().startsWith("Fig"))) {

        Node previousPara = paragraph.getPreviousSibling();

        while (previousPara != null && previousPara.getNodeType() == NodeType.PARAGRAPH
                && (previousPara.toString(SaveFormat.TEXT).trim().length() == 0
                || !previousPara.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
                && ((Paragraph) previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            if (previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        if (nodes.size() > 0) {
            // Reverse the node collection.
            Collections.reverse(nodes);

            // Extract the consecutive shapes and export them into new document
            Document dstDoc = new Document();
            for (Paragraph para : (Iterable<Paragraph>) nodes) {
                NodeImporter importer = new NodeImporter(interimdoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(para, true);
                dstDoc.getFirstSection().getBody().appendChild(newNode);
            }
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
        }
        nodes.clear();
    }
}

Hi @tahir.manzoor,

Thanks for your feedback.

Its really helpful for me.

i have an other issue .The input sample having two images .but it extract only one image along some fig caption also extracted as separate document.

please help me to resolve to extract the remaining one image and also remove the document having fig caption only.

The input input.zip (195.6 KB)

the expected output expected output.zip (165.0 KB)

The actual output actual output.zip (99.8 KB)

Thanks & regards,
pria

@akshayapria,

Thanks for your inquiry. We are working over your query and will get back to you soon.

1 Like

@akshayapria,

Thanks for your patience. In this case, please use following code example to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "input.docx");
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().contains("Figure"))
    {
        Node node = paragraph.getNextSibling();
        while (node.getNodeType() == NodeType.PARAGRAPH
                && node.toString(SaveFormat.TEXT).trim().length() == 0
                && ((Paragraph)node).getChildNodes(NodeType.SHAPE, true).getCount() == 0)
        {
            node = node.getNextSibling();
        }

        if(node != null
            &&  node.getNodeType() == NodeType.PARAGRAPH
            &&  ((Paragraph)node).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            Document dstDoc = new Document();
            NodeCollection shapes = ((Paragraph)node).getChildNodes(NodeType.SHAPE, true);
            for (Shape shape : (Iterable<Shape>) shapes)
            {
                NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(shape, true);
                dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
                dstDoc.save(MyDir + "output"+i+".docx");
                i++;
            }
        }
    }
}