Image Extraction from word document using java

priyadharshini · January 30, 2018, 6:01pm

Expectedop2_2.zip (940.4 KB)
Expectedop2_1.zip (2.5 MB)
Source_2_2.zip (926.2 KB)
Dear Team,

Kindly provide a work around solution to extract images from word document using paragraph nodes in java. Images are expected to be extracted as part and separate images where the previous logic’s fail to extract due to presence of anchor. Also the output document is expected to be saved using the naming pattern as named in expected output from image caption. the next sibling logic is not working to find out the image caption. kindly please help out.
I have attached the Source and expected output documents,
Regards
PriyaSource_1.zip (2.4 MB)Source_2_1.zip (2.4 MB)

tahir.manzoor · January 31, 2018, 7:29am

@priyadharshini,

Thanks for your inquiry. We already shared the code example with you in your other forum threads for the same case. Please check the following code example. You can use the similar approach to get the desired output.

Document doc = new Document(MyDir + "Source_2_1.doc");
DocumentBuilder builder = new DocumentBuilder(doc);
int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node PreviousPara = paragraph.getPreviousSibling();
        while (PreviousPara != null
                && PreviousPara.getNodeType() == NodeType.PARAGRAPH
                && (
                PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0 ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(c)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(e)")||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("Setif"))
                )
        {
            PreviousPara = PreviousPara.getPreviousSibling();
        }

        if(PreviousPara == null)
        {
            builder.moveToDocumentStart();
            builder.startBookmark("Bookmark" + bookmark);
        }
        else
        {
            builder.moveToParagraph(paragraphs.indexOf((Paragraph)PreviousPara), -1);
            builder.startBookmark("Bookmark" + bookmark);
        }

        builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
        builder.endBookmark("Bookmark" + bookmark);
        bookmark++;
    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);
        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
    }
}