Extract unnumbered image

MikeLak · September 10, 2018, 9:22am

Hi,

Please help in extracting the unnumbered image from the word document
Sample.zip (156.5 KB)

Thanks in advance.

tahir.manzoor · September 10, 2018, 3:48pm

Thanks for your inquiry. Please ZIP and attach your expected output documents here for our reference. We will then provide you more information about your query along with code.

MikeLak · September 11, 2018, 4:08am

Sample.zip (484.1 KB)
Here I have attached the expected output.

tahir.manzoor · September 11, 2018, 11:38am

@MikeLak

Thanks for sharing the detail. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "sample.docx");
int i = 1;

for (Shape shape : (Iterable<Shape>) doc.getChildNodes(NodeType.SHAPE, true))
{
    Document dstDoc = new Document();
    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
    Node newNode = importer.importNode(shape.getParentParagraph(), true);
    dstDoc.getFirstSection().getBody().appendChild(newNode);
    dstDoc.save(MyDir + "output"+i+".docx");
    i++;
}

MikeLak · September 12, 2018, 2:38am

Hi Tahir

I am able to extract only two images…The 3rd image is not extracted.Sample.zip (121.1 KB)sample2.zip (424.7 KB)
In sample2.zip not all the images are extracted.

tahir.manzoor · September 12, 2018, 11:37am

@MikeLak

Thanks for your inquiry. The code example shared in my previous post works for the shared document (sample.docx).

In this case, we suggest you following solution.

Iterate over all paragraphs of the document and check if it contains the text “Fig”, “(a)”, or “(b)”.
Import the paragraph node to the new document using NodeImporter.importNode method.

MikeLak · September 12, 2018, 11:38am

Can you kindly share the code.

tahir.manzoor · September 12, 2018, 3:33pm

@MikeLak

Thanks for your inquiry. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "sample2.docx");
int i = 1;

for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).contains("Fig")
            || paragraph.toString(SaveFormat.TEXT).contains("(a)")
            || paragraph.toString(SaveFormat.TEXT).contains("(b)"))
    {       System.out.println(paragraph.getText());
        if(paragraph.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            Document dstDoc = new Document();
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(paragraph, true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
        }
        else if(paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
                && paragraph.getPreviousSibling() != null
                && paragraph.getPreviousSibling().getNodeType() == NodeType.PARAGRAPH
                && ((Paragraph)paragraph.getPreviousSibling()).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            Document dstDoc = new Document();
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode((Paragraph)paragraph.getPreviousSibling(), true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
        }
    }
}

MikeLak · September 19, 2018, 5:35am

@tahir.manzoor

I am not able to extract the labelled images a,b.I have attached the input and output file to it.
ReferenceDoclet_withSingleCell.zip (22 Bytes)
doc2_output.zip (303.6 KB)

tahir.manzoor · September 19, 2018, 12:12pm

@MikeLak

Thanks for your inquiry. The ReferenceDoclet_withSingleCell.zip contains no document. Could you please ZIP and attach your input Word document for testing? We will investigate the issue on our side and provide you more information.

MikeLak · September 19, 2018, 12:17pm

doc2_Sample.zip (297.4 KB)
Please find attached the input document.The expected output must contain every image as a single image.
The output must be doc2_output.zip (303.6 KB)

tahir.manzoor · September 19, 2018, 5:03pm

@MikeLak

Thanks for your inquiry. Please use the following code example to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "doc2_Sample.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

        if (PreviousPara != null &&
              (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") )

                ) {
            PreviousPara = PreviousPara.getPreviousSibling();

            if (PreviousPara != null && ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {

                for (Shape shape : (Iterable<Shape>) ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true))
                {
                    Document dstDoc = new Document();
                    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                    Node newNode = importer.importNode(shape, true);
                    dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
                    dstDoc.save(MyDir + "output" + i + ".docx");
                    i++;
                }
            }

        }
    }
}

MikeLak · September 20, 2018, 9:25am

Hi @tahir.manzoor

Thanks for the feedback. I have a document where I am not able to extract a , b images seperately.The sample input is 3.zip (642.2 KB)
Expected output is Expected_Output.zip (664.4 KB)
Thanks in advance.

tahir.manzoor · September 20, 2018, 4:09pm

@MikeLak

Thanks for your inquiry. In this case, the images are inside the table node. You need to list down all your use cases and extract the images accordingly. Please use the following modified code to get the desired output.

Document doc = new Document(MyDir + "3.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node PreviousPara = paragraph.getPreviousSibling();

        if (PreviousPara != null &&
              (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") )

                )
        {
            if (PreviousPara != null && PreviousPara.isComposite() && ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {
                for (Shape shape : (Iterable<Shape>) ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true))
                {
                    Document dstDoc = new Document();
                    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                    Node newNode = importer.importNode(shape, true);
                    dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
                    dstDoc.save(MyDir + "output" + i + ".docx");
                    i++;
                }
            }

        }
    }
}

MikeLak · September 24, 2018, 8:39am

Hi @tahir.manzoor

I have extracted images from this code. .I need the label a,b on each labelled image extraction. Input document is Sample1.zip (596.6 KB)
Expected output is sample_output.zip (610.6 KB)

private static void unNumberedImageExtrac(Document interimdoc) throws Exception 
{
	Document doc = new Document(filearg);

	DocumentBuilder builder = new DocumentBuilder(doc);
	int i = 1;
	NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
	for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
	{
	    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
	    {

	        Node PreviousPara = paragraph.getPreviousSibling();

	        if (PreviousPara != null &&
	              (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
	                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)")||
	                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(c)")||
	                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)"))

	                ) {
	            PreviousPara = PreviousPara.getPreviousSibling();
	            try{
	            if (PreviousPara != null && ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {

	                for (Shape shape : (Iterable<Shape>) ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true))
	                {
	                	Document dstDoc = new Document();
	                    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
	                    Node newNode = importer.importNode(shape, true);
	                    dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
	                    dstDoc.getPreviousSibling();
	                    dstDoc.save(folderName + "output_B" + i + ".docx");
	                    dstDoc.save(folderName + "output_B" + i + ".jpeg");
	                    dstDoc.save(folderName + "output_B" + i + ".pdf");
	                    i++;
	                }
	            }
	            }
	            catch(Exception e){

	            }
	        }
	    }
	}
	
}

tahir.manzoor · September 24, 2018, 5:54pm

@MikeLak

Thanks for your inquiry. We are working over your query and will get back to you with code example. Thanks for your cooperation.

MikeLak · September 25, 2018, 3:25am

Thanks @tahir.manzoor for all the support… Kindly help.

tahir.manzoor · September 25, 2018, 9:54am

@MikeLak

For this case, please use the following code example. Hope this helps you.

Document doc = new Document(MyDir + "Sample1.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node PreviousPara = paragraph.getPreviousSibling();

        if (PreviousPara != null &&
                (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") )

                )
        {
            Node label = PreviousPara;
            if(label != null)
            {
                PreviousPara = label.getPreviousSibling();
                if (PreviousPara != null && PreviousPara.isComposite() && ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {

                    for (Shape shape : (Iterable<Shape>) ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true))
                    {
                        Document dstDoc = new Document();
                        NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                        Node newNode = importer.importNode(shape, true);
                        dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);


                        newNode = importer.importNode(label, true);
                        dstDoc.getFirstSection().getBody().appendChild(newNode);

                        if(i%2 == 0)
                            dstDoc.getFirstSection().getBody().getLastParagraph().getRange().replace("(a)", "", new FindReplaceOptions());
                        else
                            dstDoc.getFirstSection().getBody().getLastParagraph().getRange().replace("(b)", "", new FindReplaceOptions());

                        dstDoc.save(MyDir + "output" + i + ".docx");
                        i++;
                    }
                }

            }
        }
    }
}

MikeLak · September 25, 2018, 10:33am

@tahir.manzoor Thanks for sharing the code.It is working fine.