Issue on Image extraction - failure

Mahi39 · May 9, 2022, 6:19am

Hi Team,

We are extracting images from documents using word-aspose. we have received one of the new scenarios in a document. In this document, images are extracted but PDF files are shown as empty pages.

My code:

static String matches = "Fig.*(?:[ \\r\\n\\t].*)+|Scheme.*|Plate.*|Abbildung.*|Fig.*(?:[ \\r\\n\\t]*)+";
	private static org.apache.logging.log4j.Logger logger = LogManager.getLogger(FixedGraphic.class);
	static int count = 1;
	static Resultjson rs;
	
	public static void fixedImage(Document interimdoc) throws Exception {
		
		String pdf;
		NodeCollection shapes = interimdoc.getChildNodes(NodeType.SHAPE, true);
		LayoutCollector collector = new LayoutCollector(interimdoc);
		int imageIndex = 1;
		
		for (Shape shape : (Iterable<Shape>)shapes)
		{
			String text="NoMatch";
			try {
				text=shape.getParentParagraph().getAncestor(NodeType.TABLE).getPreviousSibling().toString(SaveFormat.TEXT);
			}
			catch(Exception e) {
				logger.info(e.getMessage());
			}
			
					try {
						
						if (shape.hasImage() && !text.contains(AIE.docName))
					    {
					    	
					    	String imgName ="FX" +imageIndex;
					    	pdf = pdfFolder + imgName + ".pdf";
					    	
					        Document itermDoc = (Document)interimdoc.deepClone(false);
					        
					        itermDoc.appendChild(itermDoc.importNode(
					                shape.getAncestor(NodeType.SECTION),
					                false,
					                ImportFormatMode.USE_DESTINATION_STYLES));
					        
					        
					        itermDoc.ensureMinimum();

					        
					        Node importedShape = itermDoc.importNode(shape, true, ImportFormatMode.USE_DESTINATION_STYLES);
					        itermDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);
					         
					        // Save as PDF.
					        itermDoc.save(pdf);
							imageIndex++;
					    }
					}
					catch(Exception e) {
						logger.info(e.getMessage());
					}
			}
		}

Input doc: davids_et_al_2022_05.docx (3.3 MB)

My output: FX1.pdf (38.7 KB)

Please do the needful. Thanks.

alexey.noskov · May 9, 2022, 2:59pm

@Mahesh39 The problem occurs because image shapes are in group shapes in your document. Group shapes have its own coordinate system and when you import the shape from it coordinate system is not adjusted. That is why you do not see the image in the output document. You can fix this by either extracting only image bytes and inserting the image into the destination document:

itermDoc.ensureMinimum();

DocumentBuilder builder = new DocumentBuilder(itermDoc);
builder.insertImage(shape.getImageData().getImageBytes());

// Save as PDF.
itermDoc.save(pdf);
imageIndex++

or by importing parent group shape:

itermDoc.ensureMinimum();

Node shapeNode = shape;
while (shapeNode.getParentNode().getNodeType() == NodeType.GROUP_SHAPE)
    shapeNode = shapeNode.getParentNode();

Node importedShape = itermDoc.importNode(shapeNode, true, ImportFormatMode.USE_DESTINATION_STYLES);
itermDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);

// Save as PDF.
itermDoc.save(pdf);
imageIndex++;

Mahi39 · May 10, 2022, 3:25am

@alexey.noskov Thanks for your support. I tried the above code, but unfortunately, it was not working. The same error follows us. Please suggest an alternative.

Current output: FX1.pdf (27.4 KB)

alexey.noskov · May 10, 2022, 4:33am

@Mahesh39 Which of the suggested approaches do you use? If you use the second approach, I suspect, you missed to change this line of code:

Node importedShape = itermDoc.importNode(shapeNode, true, ImportFormatMode.USE_DESTINATION_STYLES);

You should note, that imported node is shapeNode not shape. Please check.

Mahi39 · May 10, 2022, 4:52am

@alexey.noskov Thanks. it’s working now. we need the figure only. not a caption. is it possible to remove the image caption?

And also one more query on the above code. if contains more than one figure in the group. The Same figure was extracted double times.

Input: davids_et_al_2022_05.docx (731.2 KB)

Extracted images: FX3.pdf (192.6 KB)
FX4.pdf (192.6 KB)

alexey.noskov · May 10, 2022, 5:18am

@Mahesh39 In this case the first approach fits your needs better. So use the another approach I have suggested.