Extraction issue

e503824 · June 2, 2022, 12:08pm

Dear team,
We are extracting images from docx but below case we are not able to extract one image, its extracted with other images please find source code and input docx

Source code :

if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
					// for duplicate figure caption it-15
					&& (paragraph.getNextSibling() != null
							&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
							|| (paragraph.getNextSibling() != null
									&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
									&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
									&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
											.getCount() > 0
											|| (paragraph.getNextSibling().getNextSibling()) != null
													&& paragraph.getNextSibling().getNextSibling()
															.getNodeType() != NodeType.TABLE
													&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
															.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
															
															//this codition added by pavi-14-12-2021   for duplicate captions
															||(((Paragraph) paragraph.getNextSibling().getNextSibling())
																	.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
							|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
									.matches(matches))
					// for duplicate figure caption
					&& ((paragraph.getPreviousSibling() != null
							&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
							|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
									.matches(matches))
					&& paragraph.getNodeType() != NodeType.TABLE
					&& paragraph.getParentNode().getNodeType() != NodeType.CELL
					&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
					
					//condition added by pavi -14-12-2021
					&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
							!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures"))))
					
			        //|| ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument()))
			        
					
			{

Input and output files : 12.zip (8.6 MB)

alexey.noskov · June 2, 2022, 4:19pm

@e503824 As I already mentioned several times in other your topics, the approach with such complicated if condition will not work as expected in most of the case. The more conditions you add the harder debug process of the condition becomes.
I have tested extraction of images using the code I suggested here and Images are extracted fine.
Your general task is related to document’s content analysis and analysis of document’s content is out of Aspose.Words scope. Aspose.Words provides you a tool for reading document’s content, but it’s analysis must be implemented in your code.

e503824 · June 3, 2022, 5:27am

Dear team,

We already used given source code also but still its not getting extracted

class ImageExtractor extends DocumentVisitor {

    public ImageExtractor(String targetFolder, Document doc) {
        mTargetFolder = targetFolder;
        sourceDoc=doc;
    }

	    /**
	     * Removes images from the source document and inserts bookmark at the image position.
	     */
	 public void RemoveImagesFromSourceDocument()
	 {
	     try {
		 for (String key : mNodesToRemove.keySet())
	     {
	         Node nodeToRemove = mNodesToRemove.get(key);
	
	         DocumentBuilder builder = new DocumentBuilder((Document)nodeToRemove.getDocument());
	         // In case of table move cursor to the next paragraph.
	         if (nodeToRemove.getNodeType() == NodeType.TABLE)
	             builder.moveTo(nodeToRemove.getNextSibling());
	         else
	             builder.moveTo(nodeToRemove);
	         
	         // Insert bookmark.
	         /*builder.startBookmark(key);
	         builder.endBookmark(key);*/
	
	         // Remove image node.
	         nodeToRemove.remove();
	     }}
	     catch(Exception e){logger.info(e.getMessage());}
	 }

	@Override
    public int visitRowStart(Row row) throws Exception {

        if(row.getChildNodes(NodeType.SHAPE, true).getCount()>0)
            mRows.push(row);
        return super.visitRowStart(row);
    }

    @Override
    public int visitGroupShapeStart(GroupShape groupShape) throws Exception {

        if (groupShape.isTopLevel())
            mTopShapes.push(groupShape);

        saveShapeAsPdf();
        
        return super.visitGroupShapeStart(groupShape);
    }

    @Override
    public int visitShapeStart(Shape shape) throws Exception {

        if (shape.isTopLevel() &&
                shape.getChildNodes(NodeType.PARAGRAPH, true).getCount() == 0) {
            mTopShapes.push(shape);
        }
        
        saveShapeAsPdf();
        return super.visitShapeStart(shape);
    }

    @Override
    public int visitParagraphStart(Paragraph paragraph) throws Exception {

        if (isCaptionParagraph(paragraph))
            mCaptions.push(paragraph.toString(SaveFormat.TEXT).trim());

        saveShapeAsPdf();

        return super.visitParagraphStart(paragraph);
    }

    /**
     * Checks whether paragraph is likely to be an image caption.
     */
    private static boolean isCaptionParagraph(Paragraph paragraph) throws Exception {
        // Get only Run text in account because if caption is in a textbox
        // paragraph.toString will return the same value for both
        // paragraph inside textbox shape and for paragraph that contains textbox shape.

        // Some captions are in textboxes.
        boolean isInshape = paragraph.getAncestor(NodeType.SHAPE)!=null;
        
        
        // Some caption are marked as bold
        boolean isBold = false;
        // More conditions might be added here to better distinguish captions.
        // .........
        
        
        
        String paraText = "";
        for (Run r : paragraph.getRuns()) {
            paraText += r.getText();
            isBold |= r.getFont().getBold();
        }

        return (isInshape || isBold) && (paraText.startsWith("Fig") ||
                paraText.startsWith("Scheme") ||
                paraText.startsWith("Plate") ||
                paraText.startsWith("Figure"));
    }

    /**
     * Save the last shape as a separate PDF document.
     */
    private void saveShapeAsPdf() throws Exception {
        try {
        	
    	if (!mTopShapes.empty() && !mCaptions.empty()) {
    		
            String caption = mCaptions.pop();
           
            Node imageNode = mTopShapes.peek();
            
            if(imageNode.getParentNode().getChildNodes(NodeType.TABLE, true) != null) {
            
            // Create e temporary document which will be exported to PDF.
            Document tmp = (Document) imageNode.getDocument().deepClone(false);
            Node tmpSection = tmp.importNode(imageNode.getAncestor(NodeType.SECTION), false, ImportFormatMode.USE_DESTINATION_STYLES);
            tmp.appendChild(tmpSection);
            tmp.ensureMinimum();

            if(mTopShapes.size() > 1 && !mRows.isEmpty())
            {
                Table imagesTable = (Table)mRows.peek().getParentTable().deepClone(false);
                while (!mRows.isEmpty())
                    imagesTable.prependChild(mRows.pop().deepClone(true));

                imageNode = imagesTable;
            }

            Node resultImage = tmp.importNode(imageNode, true, ImportFormatMode.USE_DESTINATION_STYLES);
            if(resultImage.getNodeType() == NodeType.TABLE)
                tmp.getFirstSection().getBody().prependChild(resultImage);
            else
                tmp.getFirstSection().getBody().getFirstParagraph().appendChild(resultImage);
            
            if(resultImage.isComposite()) {
                resultImage.getRange().unlinkFields();
                ((CompositeNode) resultImage).getChildNodes(NodeType.RUN, true).clear();
            }
            
            //System.out.println("caption: "+caption);
            String bookmarkname = AIE.formatImgcaption(caption, AIE.fileName);
            String newBookmarkName=bookmarkname.substring(bookmarkname.lastIndexOf('_') + 1);

            // Format the output file path.
            String outFilePath = mTargetFolder +newBookmarkName  + ".pdf";
            tmp.save(outFilePath);
            

            
            Paragraph pa=(Paragraph) imageNode.getParentNode();
            AIE.insertBookmark(sourceDoc, pa, bookmarkname);
            
           mNodesToRemove.put(newBookmarkName, imageNode);
            
            //imageNode.remove();
            
            AIE.configurationWork(bookmarkname, tmp, outFilePath);
            
            mShapeCounter++;

            // Empty stacks.
            mTopShapes.clear();
            mRows.clear();
            mCaptions.clear();
            
            }
            
        }
        }
        catch(Exception e) {
        	logger.info(e.getMessage());
        	}
        }
    

    // Upon visiting the shapes captions and shapes are pushed in stack.
    // only top level shapes will be collected.
    private Stack<ShapeBase> mTopShapes = new Stack<ShapeBase>();
    private Stack<Row> mRows = new Stack<Row>();
    private Stack<String> mCaptions = new Stack<String>();
    private int mShapeCounter = 0;
    private String mTargetFolder;
    private Document sourceDoc;
    HashMap<String, Node> mNodesToRemove = new HashMap<String, Node>();
    private static org.apache.logging.log4j.Logger logger = LogManager.getLogger(TextFrameImage.class);
    
}

alexey.noskov · June 3, 2022, 3:11pm

@e503824 In your variant of ImageExtractor class you check whether caption text is bold. In the attached document caption is not bold, caption does not pass the condition in isCaptionParagraph method. On my side the following isCaptionParagraph method is used:

/**
 * Checks whether paragraph is likely to be an image caption.
 */
private static boolean isCaptionParagraph(Paragraph paragraph) throws Exception {
    // Get only Run text in account because if caption is in a textbox
    // paragraph.toString will return the same value for both
    // paragraph inside textbox shape and for paragraph that contains textbox shape.

    // Caption often contain SEQ fields.
    boolean hasSeqFields = false;
    for (Field f : paragraph.getRange().getFields())
        hasSeqFields |= (f.getType() == FieldType.FIELD_SEQUENCE);
    // More conditions might be added here to better distinguish captions.
    // .........

    String paraText = "";
    for (Run r : paragraph.getRuns()) {
        paraText += r.getText();
    }

    boolean hasCaptionLikeContent = (paraText.startsWith("Fig") ||
            paraText.startsWith("Scheme") ||
            paraText.startsWith("Plate") ||
            paraText.startsWith("Figure"));

    return  hasSeqFields || hasCaptionLikeContent;
}

You can modify this method according to the your document’s content to catch image caption paragraphs.