Extract unnumbered image

Hi,

Please help in extracting the unnumbered image from the word document
Sample.zip (156.5 KB)

Thanks in advance.

@MikeLak

Thanks for your inquiry. Please ZIP and attach your expected output documents here for our reference. We will then provide you more information about your query along with code.

Sample.zip (484.1 KB)
Here I have attached the expected output.

@MikeLak

Thanks for sharing the detail. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "sample.docx");
int i = 1;

for (Shape shape : (Iterable<Shape>) doc.getChildNodes(NodeType.SHAPE, true))
{
    Document dstDoc = new Document();
    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
    Node newNode = importer.importNode(shape.getParentParagraph(), true);
    dstDoc.getFirstSection().getBody().appendChild(newNode);
    dstDoc.save(MyDir + "output"+i+".docx");
    i++;
}

Hi Tahir

I am able to extract only two images…The 3rd image is not extracted.Sample.zip (121.1 KB)sample2.zip (424.7 KB)
In sample2.zip not all the images are extracted.

@MikeLak

Thanks for your inquiry. The code example shared in my previous post works for the shared document (sample.docx).

In this case, we suggest you following solution.

  1. Iterate over all paragraphs of the document and check if it contains the text “Fig”, “(a)”, or “(b)”.
  2. Import the paragraph node to the new document using NodeImporter.importNode method.

Can you kindly share the code.

@MikeLak

Thanks for your inquiry. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "sample2.docx");
int i = 1;

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).contains("Fig")
            || paragraph.toString(SaveFormat.TEXT).contains("(a)")
            || paragraph.toString(SaveFormat.TEXT).contains("(b)"))
    {       System.out.println(paragraph.getText());
        if(paragraph.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            Document dstDoc = new Document();
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(paragraph, true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
        }
        else if(paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
                && paragraph.getPreviousSibling() != null
                && paragraph.getPreviousSibling().getNodeType() == NodeType.PARAGRAPH
                && ((Paragraph)paragraph.getPreviousSibling()).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            Document dstDoc = new Document();
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode((Paragraph)paragraph.getPreviousSibling(), true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
        }
    }
}

@tahir.manzoor

I am not able to extract the labelled images a,b.I have attached the input and output file to it.
ReferenceDoclet_withSingleCell.zip (22 Bytes)
doc2_output.zip (303.6 KB)

@MikeLak

Thanks for your inquiry. The ReferenceDoclet_withSingleCell.zip contains no document. Could you please ZIP and attach your input Word document for testing? We will investigate the issue on our side and provide you more information.

doc2_Sample.zip (297.4 KB)
Please find attached the input document.The expected output must contain every image as a single image.
The output must be doc2_output.zip (303.6 KB)

@MikeLak

Thanks for your inquiry. Please use the following code example to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "doc2_Sample.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

        if (PreviousPara != null &&
              (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") )

                ) {
            PreviousPara = PreviousPara.getPreviousSibling();

            if (PreviousPara != null && ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {

                for (Shape shape : (Iterable<Shape>) ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true))
                {
                    Document dstDoc = new Document();
                    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                    Node newNode = importer.importNode(shape, true);
                    dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
                    dstDoc.save(MyDir + "output" + i + ".docx");
                    i++;
                }
            }

        }
    }
}

Hi @tahir.manzoor

Thanks for the feedback. I have a document where I am not able to extract a , b images seperately.The sample input is 3.zip (642.2 KB)
Expected output is Expected_Output.zip (664.4 KB)
Thanks in advance.

@MikeLak

Thanks for your inquiry. In this case, the images are inside the table node. You need to list down all your use cases and extract the images accordingly. Please use the following modified code to get the desired output.

Document doc = new Document(MyDir + "3.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node PreviousPara = paragraph.getPreviousSibling();

        if (PreviousPara != null &&
              (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") )

                )
        {
            if (PreviousPara != null && PreviousPara.isComposite() && ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {
                for (Shape shape : (Iterable<Shape>) ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true))
                {
                    Document dstDoc = new Document();
                    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                    Node newNode = importer.importNode(shape, true);
                    dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
                    dstDoc.save(MyDir + "output" + i + ".docx");
                    i++;
                }
            }

        }
    }
}

Hi @tahir.manzoor

I have extracted images from this code. .I need the label a,b on each labelled image extraction. Input document is Sample1.zip (596.6 KB)
Expected output is sample_output.zip (610.6 KB)

private static void unNumberedImageExtrac(Document interimdoc) throws Exception 
{
	Document doc = new Document(filearg);

	DocumentBuilder builder = new DocumentBuilder(doc);
	int i = 1;
	NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
	for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
	{
	    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
	    {

	        Node PreviousPara = paragraph.getPreviousSibling();

	        if (PreviousPara != null &&
	              (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
	                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)")||
	                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(c)")||
	                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)"))

	                ) {
	            PreviousPara = PreviousPara.getPreviousSibling();
	            try{
	            if (PreviousPara != null && ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {

	                for (Shape shape : (Iterable<Shape>) ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true))
	                {
	                	Document dstDoc = new Document();
	                    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
	                    Node newNode = importer.importNode(shape, true);
	                    dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
	                    dstDoc.getPreviousSibling();
	                    dstDoc.save(folderName + "output_B" + i + ".docx");
	                    dstDoc.save(folderName + "output_B" + i + ".jpeg");
	                    dstDoc.save(folderName + "output_B" + i + ".pdf");
	                    i++;
	                }
	            }
	            }
	            catch(Exception e){

	            }
	        }
	    }
	}
	
}

@MikeLak

Thanks for your inquiry. We are working over your query and will get back to you with code example. Thanks for your cooperation.

Thanks @tahir.manzoor for all the support… Kindly help.

@MikeLak

For this case, please use the following code example. Hope this helps you.

Document doc = new Document(MyDir + "Sample1.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node PreviousPara = paragraph.getPreviousSibling();

        if (PreviousPara != null &&
                (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") )

                )
        {
            Node label = PreviousPara;
            if(label != null)
            {
                PreviousPara = label.getPreviousSibling();
                if (PreviousPara != null && PreviousPara.isComposite() && ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {

                    for (Shape shape : (Iterable<Shape>) ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true))
                    {
                        Document dstDoc = new Document();
                        NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                        Node newNode = importer.importNode(shape, true);
                        dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);


                        newNode = importer.importNode(label, true);
                        dstDoc.getFirstSection().getBody().appendChild(newNode);

                        if(i%2 == 0)
                            dstDoc.getFirstSection().getBody().getLastParagraph().getRange().replace("(a)", "", new FindReplaceOptions());
                        else
                            dstDoc.getFirstSection().getBody().getLastParagraph().getRange().replace("(b)", "", new FindReplaceOptions());

                        dstDoc.save(MyDir + "output" + i + ".docx");
                        i++;
                    }
                }

            }
        }
    }
}

@tahir.manzoor Thanks for sharing the code.It is working fine.