Extraction of image from word document

priyanga · June 18, 2018, 1:05pm

Hi Team,

My requirement is to extract the image based on using the keyword figure caption from the word document and save into new document.

My issue is ,I am able to extract the normal images(image followed with figure caption) but figure followed with legends like (a) and (b) are not extracted.please, kindly help me to solve this issue.

Source: sample.zip (390.5 KB)

expected output: expected output.zip (665.3 KB)

Thanks & regards,
Priyanga G

tahir.manzoor · June 18, 2018, 5:43pm

@priyanga,

Thanks for your inquiry. We are working over your query and will get back to you with code example.

tahir.manzoor · June 19, 2018, 5:20pm

@priyanga,

Please use the following code example to extract the images from the document. Hope this helps you.

Document doc = new Document(MyDir + "sample.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
ArrayList tables = new ArrayList();
int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node PreviousPara = paragraph.getPreviousSibling();
        while (PreviousPara != null &&

                (PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0
                        || (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)")
                        )
                )
                && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() == 0

                )
        {
            PreviousPara = PreviousPara.getPreviousSibling();
        }


        PreviousPara = PreviousPara.getPreviousSibling();
        if(PreviousPara == null)
        {
            builder.moveToDocumentStart();
            builder.insertParagraph();
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
        {
            Node node = ((Paragraph)PreviousPara).getParentNode().insertAfter(new Paragraph(doc), PreviousPara);
            builder.moveTo(node);
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveTo(paragraph);
            //builder.writeln();
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.TABLE)
        {
            if(((Table)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
                tables.add(((Table)PreviousPara));
        }

    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());
        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
            dstDoc.getLastSection().getBody().getLastParagraph().remove();
        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
    }
}

priyanga · July 2, 2018, 7:02am

Hi @tahir.manzoor,

Thanks for your feedback.It’s working fine for extraction .

I have issue with one more document with part figure .Please,kindly help me to extract those images.

source document:sample.zip (1.6 MB)
Actual Output: actual output.zip (1.4 MB)
Expected Output:Expected output.zip (1.4 MB)

Thanks & regards,
Priyanga G

tahir.manzoor · July 2, 2018, 4:31pm

@priyanga,

Thanks for your inquiry. Please use the following modified code to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "sample.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
ArrayList tables = new ArrayList();
int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

         while (PreviousPara != null &&

                (PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0
                        || (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)")
                )
                )

           && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0
                )
        {
            PreviousPara = PreviousPara.getPreviousSibling();
        }

        //PreviousPara = PreviousPara.getPreviousSibling();
        if(PreviousPara == null)
        {
            builder.moveToDocumentStart();
            builder.insertParagraph();
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
        {
            Node node = ((Paragraph)PreviousPara).getParentNode().insertAfter(new Paragraph(doc), PreviousPara);
            builder.moveTo(node);
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveTo(paragraph);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.TABLE)
        {
            if(((Table)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
                tables.add(((Table)PreviousPara));
        }

    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
            dstDoc.getLastSection().getBody().getLastParagraph().remove();

        if(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
    }
}

priyanga · July 3, 2018, 4:36am

Hi @tahir.manzoor,

It’s absolutely working fine.

After the extraction and save into new document ,then how to delete the extracted images from the document.please,kindly help me to solve this problem.

Thanks & Regards,
Priyanga G

tahir.manzoor · July 3, 2018, 10:48am

@priyanga,

Thanks for your inquiry. Please use the following code example to remove the extracted image from the source document.

Document doc = new Document(MyDir + "sample.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
ArrayList tables = new ArrayList();
ArrayList shapes = new ArrayList();
int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

        while (PreviousPara != null &&

                (PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0
                        || (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)")
                )
                )

                && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0
                )
        {
            shapes.add(PreviousPara);
            PreviousPara = PreviousPara.getPreviousSibling();
        }

        if(PreviousPara == null)
        {
            builder.moveToDocumentStart();
            builder.insertParagraph();
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
        {
            Node node = ((Paragraph)PreviousPara).getParentNode().insertAfter(new Paragraph(doc), PreviousPara);
            builder.moveTo(node);
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveTo(paragraph);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }

    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        Node start = bm.getBookmarkStart();
        Node end = bm.getBookmarkEnd();
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
            dstDoc.getLastSection().getBody().getLastParagraph().remove();

        if(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
    }
}

for (Paragraph  paragraph : (Iterable<Paragraph>) shapes)
{
    paragraph.getChildNodes(NodeType.SHAPE, true).clear();
}

doc.save(MyDir + "out.docx");

priyanga · July 3, 2018, 12:52pm

Hi @tahir.manzoor,

In this modified code the below line was commented.so I am able to get part figures but figure followed with legends like (a) and (b) are not extracted.please, kindly help me to extract both cases and finally delete all extracted images.

//PreviousPara = PreviousPara.getPreviousSibling();

Thanks & Regards,
Priyanga G

tahir.manzoor · July 3, 2018, 5:55pm

@priyanga,

Thanks for your inquiry. Following code example works for both cases shared in this forum thread.

Document doc = new Document(MyDir + "sample.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
ArrayList tables = new ArrayList();
ArrayList shapes = new ArrayList();
int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node PreviousPara = paragraph.getPreviousSibling();
        while (PreviousPara != null
                && PreviousPara.getNodeType() == NodeType.PARAGRAPH
                && !PreviousPara.toString(SaveFormat.TEXT).trim().contains("Fig")
                && (
                PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0 ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(c)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)"))
                )
        {
            shapes.add(PreviousPara);
            PreviousPara = PreviousPara.getPreviousSibling();
        }

         
        if(PreviousPara == null)
        {
            builder.moveToDocumentStart();
            builder.insertParagraph();
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
        {
            Node node = ((Paragraph)PreviousPara).getParentNode().insertAfter(new Paragraph(doc), PreviousPara);
            builder.moveTo(node);
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveTo(paragraph);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }

    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        Node start = bm.getBookmarkStart();
        Node end = bm.getBookmarkEnd();
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

        for (Paragraph  paragraph : (Iterable<Paragraph>) dstDoc.getChildNodes(NodeType.PARAGRAPH, true))
        {
            if(paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
                paragraph.remove();
        }

        dstDoc.save(MyDir + "out//output"+i+".docx");
        i++;
    }
}

for (Paragraph  paragraph : (Iterable<Paragraph>) shapes)
{
    paragraph.getChildNodes(NodeType.SHAPE, true).clear();
}

doc.save(MyDir + "out.docx");

priyanga · July 4, 2018, 4:51am

Hi @tahir.manzoor,

Thanks a lot .It’s absolutely working fine.

Please,let me know how to get figure caption for both cases.because

part images figure on next paragraph
but the legends images the figure caption is present on second line after the legend then only figure caption is present .
Please,kindly achieve to get figure caption.

Thanks & Regards,
Priyanga G

tahir.manzoor · July 4, 2018, 11:58am

@priyanga,

Thanks for your inquiry. You can simply iterate over paragraph nodes and check their text. If it is started with “Fig”, get the text using Node.ToString method.