Image extraction issue 7

e503824 · June 7, 2022, 10:48am

Dear team,

we are extracting images from docx, but below docx having having flowchat but we are notable to extract below case please find below source code and input file

Source code :

public class FixedGraphic {
	
	static String matches = "Fig.*(?:[ \\r\\n\\t].*)+|Scheme.*|Plate.*|Abbildung.*|Fig.*(?:[ \\r\\n\\t]*)+";
	private static org.apache.logging.log4j.Logger logger = LogManager.getLogger(FixedGraphic.class);
	static int count = 1;
	static Resultjson rs;
	
	public static void fixedImage(Document interimdoc) throws Exception {
		
		String pdf;
		NodeCollection shapes = interimdoc.getChildNodes(NodeType.SHAPE, true);
		LayoutCollector collector = new LayoutCollector(interimdoc);
		int imageIndex = 1;
		
		for (Shape shape : (Iterable<Shape>)shapes)
		{
			String text="NoMatch";
			try {
				text=shape.getParentParagraph().getAncestor(NodeType.TABLE).getPreviousSibling().toString(SaveFormat.TEXT);
				
				
			}
			catch(Exception e) {
				logger.info(e.getMessage());
			}
			
					try {
						
						//25.05.2022 - Mahe - 27
						boolean mathType=isOleEquation(shape);
						//System.out.println("mathType: "+mathType);
						
						if (shape.hasImage() && !text.contains(AIE.docName) && mathType==false && !AIE.supplementaryFigure)
					    {
							String imgName = "";
							String lwFilearg= AIE.filearg.toLowerCase().replaceAll("\\s", "");
							if(lwFilearg.contains("graphicalabstract")) {
								 imgName ="GA" +imageIndex;
							}
							else {
					    	 imgName ="FX" +imageIndex;
							}
					    	
					    	pdf = AIE.pdfFolder + imgName + ".pdf";
					    	imageIndex++;
					        // Create an intermediate document to where shape will be imported to.
					        Document itermDoc = (Document)interimdoc.deepClone(false);
					        // use section imported from the source document to keep the same page size and orientation.
					        itermDoc.appendChild(itermDoc.importNode(
					                shape.getAncestor(NodeType.SECTION),
					                false,
					                ImportFormatMode.USE_DESTINATION_STYLES));
					        
					        // Add required nodes since we did not import child nodes from the source document.
					        itermDoc.ensureMinimum();
					        
					        Node shapeNode = shape;
					        while (shapeNode.getParentNode().getNodeType() == NodeType.GROUP_SHAPE)
					        	{shapeNode = shapeNode.getParentNode();}
					        
					        // Import shape and put it into the document.
					        Node importedShape = itermDoc.importNode(shapeNode, true, ImportFormatMode.USE_DESTINATION_STYLES);
					        itermDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);
					         
					        // Save as PDF.
					        itermDoc.save(pdf);
					        
					        int width = (int) shape.getWidth();
					        int height = (int) shape.getHeight();
							int pageNO =collector.getStartPageIndex(shape);
							AIE.extractedimage.add(imgName);
							AIE.allimages.add(imgName);
							
							String extracteddetails =Kromatrix.fixedImgCreatejson(imgName, width, height,pageNO, pdf, AIE.allimages);
							rs.setExtracteddetails(extracteddetails);
							
					        //Added bookmark in interim document
					        Paragraph pa = shape.getParentParagraph();
					        AIE.insertBookmark(interimdoc, pa, imgName);
					        
					       
					        //Remove the figure
					        shape.remove();
					    }
					}
					catch(Exception e) {
						logger.info(e.getMessage());
					}
			}
		}

input file : Manuscript RATE-protocol FINALR3.docx (73.5 KB)

alexey.noskov · June 7, 2022, 7:04pm

@e503824 The problem is that your flowchart is not an image and the following condition will never pass

if (shape.hasImage() && !text.contains(AIE.docName) && mathType==false && !AIE.supplementaryFigure)

None of shapes in your document has image. The flow chart in your document is build with may textbox shapes and arrows, which are even not grouped, that makes it difficult to extract because each part of the flow chart is a separate shape with it’s own absolute position. In addition all the shapes are children of different paragraphs, so to preserve their positions, it is required to copy them with parent paragraphs.
If you have control over the document creation process, I would suggest you at least group the shapes.

e503824 · June 8, 2022, 3:47am

anything possible to extract flow charts

alexey.noskov · June 8, 2022, 2:08pm

@e503824 Unfortunately, I cannot provide you a simple code to extract such kind of flow charts. It requires quite complex analysis to extract the absolutely positioned shapes and preserve their positions in the output document. The problem also is that position of shapes depends on the empty paragraphs between the paragraphs that contain shapes. If extract only paragraphs that contain shapes the layout of the flow chart will be distorted: out.pdf (66.3 KB).
The following code properly extracts the flow chart shapes, but it assumes that it is required to extract all nodes between the first paragraph with shape and the last paragraph with shape, that will not work in general case, for example when there are several such groups of shapes:

Document doc = new Document("C:\\Temp\\in.docx");
Document tmpDoc = (Document)doc.deepClone(false);
tmpDoc.ensureMinimum();

Paragraph startPara = null;
Paragraph endPara = null;

// Get start and end paragraphs. The code is for demonstration purposes
// and assumes there it is required to extract paragraphs that contain
// shapes and all paragraphs between them
Iterable<Paragraph> paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph p : paragraphs)
{
    if (p.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
    {
        if (startPara == null)
            startPara = p;

        endPara = p;
    }
}

Node current = startPara;
if (current != null)
{
    do
    {
        Node dstnode = tmpDoc.importNode(current, true, ImportFormatMode.USE_DESTINATION_STYLES);
        tmpDoc.getFirstSection().getBody().appendChild(dstnode);
        current = current.getNextSibling();
    }
    while (current != null && current != endPara);
}
// Copy the last paragraph
Node dstnode = tmpDoc.importNode(current, true, ImportFormatMode.USE_DESTINATION_STYLES);
tmpDoc.getFirstSection().getBody().appendChild(dstnode);

tmpDoc.save("C:\\Temp\\out.pdf");

out_correct.pdf (66.6 KB)