Issue on image extration

Mahi39 · May 6, 2022, 7:26am

Hi Team,

We are extracting images from documents using word-aspose. we have received one of the new scenarios in a document. In this document image and caption are combined in a single text frame. how to extract the image. Please suggest me.

Input doc: 2021GB007083-file001.docx (4.9 MB)

alexey.noskov · May 6, 2022, 2:46pm

@Mahesh39 You can use code like this to extract images from your document:

Document doc = new Document("C:\\Temp\\in.docx");
Iterable<Shape> shapes = doc.getChildNodes(NodeType.SHAPE, true);
int counter = 0;
for (Shape s : shapes)
{
    if (s.hasImage())
        s.getImageData().save("C:\\Temp\\img_" + (counter++) + FileFormatUtil.imageTypeToExtension(s.getImageData().getImageType()));
}

If you also would like to check whether shape is in the groupshape and the parent groupshape contains caption, you can use code like this:

Document doc = new Document("C:\\Temp\\in.docx");
Iterable<Shape> shapes = doc.getChildNodes(NodeType.SHAPE, true);
for (Shape s : shapes)
{
    if (s.hasImage())
    {
        GroupShape parentShape = (GroupShape)s.getAncestor(NodeType.GROUP_SHAPE);
        while (parentShape != null && parentShape.getAncestor(NodeType.GROUP_SHAPE) != null)
            parentShape = (GroupShape)parentShape.getAncestor(NodeType.GROUP_SHAPE);

        Iterable<Paragraph> paragraphs = parentShape.getChildNodes(NodeType.PARAGRAPH, true);
        for (Paragraph p : paragraphs)
        {
            System.out.println(p.toString(SaveFormat.TEXT));
        }
    }
}