Extracting Inline images from Word document

Saranya_Sekar · October 10, 2018, 5:56am

Hi Team,

I want help in java to extract Inline images from this sample document Inline-Input.zip (1.6 MB)
and the expected output is Inline-Output.zip (4.3 MB)

Also I need interim document generated with bookmark placed in the location of where the image is extracted.I have attached sample output for interim document as well. Kindly please help.

awais.hafeez · October 10, 2018, 11:14am

@Saranya_Sekar,

You can build your logic on the following code to meet this requirement.

Document doc = new Document("D:\\temp\\Inline-Input\\Inline-Input.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

// get the target Shape
Shape firstShape = (Shape) doc.getChildNodes(NodeType.SHAPE, true).get(0);

// Create new document
Document newDoc = (Document) doc.deepClone(false);
newDoc.removeAllChildren();
newDoc.ensureMinimum();

// import and insert the shape in new document
Shape importedShape = (Shape) newDoc.importNode(firstShape, true);
importedShape.setWrapType(WrapType.NONE);
importedShape.setHorizontalAlignment(HorizontalAlignment.CENTER);
importedShape.setVerticalAlignment(VerticalAlignment.CENTER);
importedShape.setRelativeHorizontalPosition(RelativeHorizontalPosition.PAGE);
importedShape.setRelativeHorizontalPosition(RelativeVerticalPosition.PAGE);

newDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);

// insert marker at the place of Shape
builder.moveTo(firstShape);
builder.getFont().setColor(Color.red);
builder.write(" <Inline-Img>Inline-Img1</Inline-Img> ");

// remove the shape from original document
firstShape.remove();

doc.save("D:\\temp\\Inline-Input\\out.docx");
newDoc.save("D:\\temp\\Inline-Input\\shape.docx");

Hope, this helps.

Saranya_Sekar · October 10, 2018, 11:29am

@awais.hafeez
It is extracting for only one image I need it for other 3 inline images as well . My generated output is Inline-Input.zip (4.5 MB)
.How to achieve.Thanks in advance. And the interim document bookmark also need to have corresponding image number.Shapes also need to saved in individual .docx file.Can you provide code for extracting other 3 images as well

awais.hafeez · October 10, 2018, 8:24pm

@Saranya_Sekar,

You can use a loop to iterate through all the shapes. The basic workflow will remain be the same. To learn about Aspose.Words document object model, please refer to the documentation.

Saranya_Sekar · October 11, 2018, 3:48am

I am using the iterator code to loop through all the shapes but it is extacting all the images.Inline-Input.zip (2.0 MB)
Shape firstShape = (Shape) doc.getChildNodes(NodeType.SHAPE, true).get(5);
This line is extracting all the images not only inline images.How to put loop condition for extracting inline images only. Kindly help please.

private static void inlineImagesExtraction(Document interimdoc) throws Exception
{
Document doc = interimdoc;
DocumentBuilder builder = new DocumentBuilder(doc);
int i=0;
// get the target Shape
for (Shape shape : (Iterable) doc.getChildNodes(NodeType.SHAPE, true))
{
Shape firstShape = (Shape) doc.getChildNodes(NodeType.SHAPE, true).get(i);

// Create new document
Document newDoc = (Document) doc.deepClone(false);
newDoc.removeAllChildren();
newDoc.ensureMinimum();

// import and insert the shape in new document
Shape importedShape = (Shape) newDoc.importNode(firstShape, true);
importedShape.setWrapType(WrapType.NONE);
importedShape.setHorizontalAlignment(HorizontalAlignment.CENTER);
importedShape.setVerticalAlignment(VerticalAlignment.CENTER);
importedShape.setRelativeHorizontalPosition(RelativeHorizontalPosition.PAGE);
importedShape.setRelativeHorizontalPosition(RelativeVerticalPosition.PAGE);

newDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);

// insert marker at the place of Shape
builder.moveTo(firstShape);
builder.getFont().setColor(Color.red);
builder.write(" <Inline-Img>Inline-Img1</Inline-Img> ");

// remove the shape from original document
firstShape.remove();
interimdoc.save(interim);
newDoc.save("D:\\temp\\Inline-Input\\shape.docx"); 
	}
}

awais.hafeez · October 11, 2018, 11:42am

@Saranya_Sekar,

Please try using the following code:

Document doc = new Document("D:\\Inline-Input\\Inline-Input.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

int i = 1;
for (Shape shape : (Iterable<Shape>) doc.getChildNodes(NodeType.SHAPE, true)) {
    if (shape.getParentParagraph().getChildNodes().getCount() == 1 ||
            shape.getAncestor(NodeType.TABLE) != null) {

    } else {
        Document newDoc = (Document) doc.deepClone(false);
        newDoc.removeAllChildren();
        newDoc.ensureMinimum();

        Shape importedShape = (Shape) newDoc.importNode(shape, true);
        importedShape.setWrapType(WrapType.NONE);
        importedShape.setHorizontalAlignment(HorizontalAlignment.CENTER);
        importedShape.setVerticalAlignment(VerticalAlignment.CENTER);
        importedShape.setRelativeHorizontalPosition(RelativeHorizontalPosition.PAGE);
        importedShape.setRelativeHorizontalPosition(RelativeVerticalPosition.PAGE);

        newDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);

        builder.moveTo(shape);
        builder.getFont().setColor(Color.red);
        builder.write(" <Inline-Img>Inline-Img" + i + "</Inline-Img> ");

        shape.remove();

        newDoc.save("D:\\temp\\Inline-Input\\shape_" + i + ".docx");
    }
    i++;
}

doc.save("D:\\Inline-Input\\out.docx");

Saranya_Sekar · October 11, 2018, 11:50am

Hi @awais.hafeez

Thank you very much. For the following document it is also extracting other images as well.Kindly help me to derive the expected output. Input file is Inline_Image1.zip (5.0 MB)
and the derived output is Inline-Input.zip (1.5 MB)
Kindly help me to retrieve only inline images.

awais.hafeez · October 12, 2018, 3:51am

@Saranya_Sekar,

We are working on your query and will get back to you soon.

Saranya_Sekar · October 12, 2018, 3:52am

@awais.hafeez
Thank you so much.

awais.hafeez · October 12, 2018, 1:08pm

@Saranya_Sekar,

You have attached new test documents. Considering “Inline_Image1.docx” as an input document, can you please also create your expected intermediate Word document (Inline_Image1_Interim.docx) by using MS Word and attach it here for our reference?

Saranya_Sekar · October 15, 2018, 3:38am

@awais.hafeez
I used this line of code which solved the issue

if (shape.getParentParagraph().getChildNodes().getCount() < 20 ||
        shape.getAncestor(NodeType.TABLE) != null) {

}

Thanks for helping.

awais.hafeez · October 15, 2018, 11:20am

@Saranya_Sekar,

It is great that you were able to resolve this issue on your end. In case you have any further inquiries or need any help, please let us know.

Saranya_Sekar · October 15, 2018, 11:20am

@awais.hafeez
Thank you for all help.

Saranya_Sekar · October 16, 2018, 7:02am

@awais.hafeez
I have a scenario where the inline image is present in Fig caption which is above the figure and beside the figure. This code is not extracting the inline images in these scenarios. Sample input is Inline_Image_above.zip (101.7 KB)
expected Interim is Inline_Image_above_Interim.zip (79.3 KB)
and the expected output is Inline_Image_above_output.zip (25.2 KB)

private static void inlineImagesExtraction(Document interimdoc) throws Exception 
{
	Document doc =interimdoc;
	DocumentBuilder builder = new DocumentBuilder(doc);

try{
int i = 1;
for (Shape shape : (Iterable) doc.getChildNodes(NodeType.SHAPE, true)) {
if (shape.getParentParagraph().getChildNodes().getCount() < 20 ||
shape.getAncestor(NodeType.TABLE) != null) {

	    } else {
	        Document newDoc = (Document) doc.deepClone(false);
	        newDoc.removeAllChildren();
	        newDoc.ensureMinimum();

	        Shape importedShape = (Shape) newDoc.importNode(shape, true);
	        importedShape.setWrapType(WrapType.NONE);
	        importedShape.setHorizontalAlignment(HorizontalAlignment.CENTER);
	        importedShape.setVerticalAlignment(VerticalAlignment.CENTER);
	        importedShape.setRelativeHorizontalPosition(RelativeHorizontalPosition.PAGE);
	        importedShape.setRelativeHorizontalPosition(RelativeVerticalPosition.PAGE);

	        newDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);

	        builder.moveTo(shape);
	        builder.getFont().setColor(Color.red);
	        builder.write(" <Inline-Img>Inline-Img" + i + "</Inline-Img> ");

	        shape.remove();

	        newDoc.save(folderName+"Inline_shape_" + i + ".docx");
	        newDoc.save(folderName+"Inline_shape_" + i + ".pdf");
	        newDoc.save(folderName+"Inline_shape_" + i + ".jpeg");
	    }
	    i++;
	}
	interimdoc.save(interim);

}
catch(Exception e){

}
}

awais.hafeez · October 16, 2018, 1:36pm

@Saranya_Sekar,

You can build on the following code to get the desired output:

Document doc = new Document("D:\\temp\\Inline_Image_above\\Inline_Image_above.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

int i = 1;
for (Shape shape : (Iterable<Shape>) doc.getChildNodes(NodeType.SHAPE, true)) {

    if (shape.getParentParagraph() == null ||
            shape.getParentParagraph().getChildNodes().getCount() == 1 ||
            shape.getAncestor(NodeType.TABLE) != null ||
            shape.getShapeType() == ShapeType.TEXT_BOX) {

    } else {
        Document newDoc = (Document) doc.deepClone(false);
        newDoc.removeAllChildren();
        newDoc.ensureMinimum();

        Shape importedShape = (Shape) newDoc.importNode(shape, true);
        importedShape.setWrapType(WrapType.NONE);
        importedShape.setHorizontalAlignment(HorizontalAlignment.CENTER);
        importedShape.setVerticalAlignment(VerticalAlignment.CENTER);
        importedShape.setRelativeHorizontalPosition(RelativeHorizontalPosition.PAGE);
        importedShape.setRelativeHorizontalPosition(RelativeVerticalPosition.PAGE);

        newDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);

        builder.moveTo(shape);
        builder.getFont().setColor(Color.red);
        builder.write(" <Inline-Img>Inline-Img" + i + "</Inline-Img> ");

        shape.remove();

        newDoc.save("D:\\temp\\Inline_Image_above\\shape_" + i + ".docx");
    }
    i++;
}

doc.save("D:\\temp\\Inline_Image_above\\out.docx");

Saranya_Sekar · October 17, 2018, 3:49am

@awaiz.hafeez
I am not able to extract the image above or beside the document. Help please.The output generated has no image stored in it.

awais.hafeez · October 17, 2018, 12:33pm

@Saranya_Sekar,

Please also provide a comparison screenshot highlighting the problematic areas in Aspose.Words generated output documents with respect to your expected output and attach it here for our reference.

Saranya_Sekar · October 22, 2018, 5:11am

@awais.hafeez
The screen shot of the interim image is attached here Inline_Image_Interim.zip (196.8 KB)
and also the folder with extracted images is empty with no images in it. Help please.

awais.hafeez · October 22, 2018, 1:33pm

@Saranya_Sekar,

I am afraid, you just shared the screenshot of first Page. But, there is no mention of problem(s) in there. We need a comparison screenshot highlighting the problematic area(s) in Aspose.Words generated output document(s) with respect to your expected output.

Saranya_Sekar · October 23, 2018, 3:49am

@awais.hafeez

It is working fine for all the other cases except the below one.
For the following document image other than inline is also extracted . Input is Inline_Comparison_1 (2).zip (40.5 KB) and the derived interim output is Inline_Comparison_2 (2).zip (8.5 KB)
and the images extracted are jvs_12643_Backup_SO1_shape.zip (89.2 KB)
But the expected output is jvs_12643_Backup_SO1.zip (43.0 KB)
Kindly help please. The code I am using is

private static void inlineImagesExtraction(Document interimdoc) throws Exception
{
Document doc =interimdoc;
DocumentBuilder builder = new DocumentBuilder(doc);
try{
int i = 1;
for (Shape shape : (Iterable) doc.getChildNodes(NodeType.SHAPE, true)) {
if (shape.getParentParagraph().getChildNodes().getCount() < 9 ||
shape.getAncestor(NodeType.TABLE) != null ||
shape.getShapeType() == ShapeType.TEXT_BOX) {

    } else {
        Document newDoc = (Document) doc.deepClone(false);
        newDoc.removeAllChildren();
        newDoc.ensureMinimum();

        Shape importedShape = (Shape) newDoc.importNode(shape, true);
        importedShape.setWrapType(WrapType.NONE);
        importedShape.setHorizontalAlignment(HorizontalAlignment.CENTER);
        importedShape.setVerticalAlignment(VerticalAlignment.CENTER);
        importedShape.setRelativeHorizontalPosition(RelativeHorizontalPosition.PAGE);
        importedShape.setRelativeHorizontalPosition(RelativeVerticalPosition.PAGE);

        newDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);

        builder.moveTo(shape);
        builder.getFont().setColor(Color.red);
        builder.write(" <Inline-Img>Inline-Img" + i + "</Inline-Img> ");

        shape.remove();

        newDoc.save(folderName+"shape_" + i + ".docx");
    }
    i++;
}


	interimdoc.save(interim);
}

catch(Exception e){

}
}