Table Image Extraction issue

Dear team,

We are extraction images using aspose java, now we are facing new issue in table images please find below source code we are using

source code :

try {
	if (table.getChildNodes(NodeType.SHAPE, true).getCount() > 0) {
		String imgCaption = "";
		try {
			while (table.getNextSibling().toString(SaveFormat.TEXT).trim().length() == 0
					&& (((Paragraph) table.getNextSibling()).getChildNodes(NodeType.SHAPE, true).getCount() == 0
							&& table.getNextSibling().getNextSibling() != null
							&& table.getNextSibling().getNextSibling().getNodeType() != NodeType.TABLE)) {
				table.getNextSibling().remove();
			}
		} catch (ClassCastException e) {
			logger.info("ClassCastException occur, {0}", e.getMessage());
		}
		if (table.getNextSibling() != null && table.getNextSibling().getText().trim().matches(matches)
				&& !table.toString(SaveFormat.TEXT).toLowerCase().contains("fig")
				&& !table.toString(SaveFormat.TEXT).trim().contains(SCHEME)) {

input : Revised Manuscript(Clean version).docx (1.3 MB)

output : Fig0012.pdf (1.6 KB)
Fig0013.pdf (1.6 KB)
Fig0014.pdf (1.6 KB)
Fig0015.pdf (1.6 KB)

please do needful

@e503824 In your document images are in the table. The table contains several images, so I think in this particular case you should extract whole table instead of extracting a particular shape.
Could you please manually create the expected output you would like to get?

Dear team,

Anything possible to extract these images and could you please share the source code

@e503824 Could you please share the document that will allow us to understand what is your expected output? Because there are more than one shape that belongs to the caption and it is not quite clear what is the expected output.

Dear team,

I have attached in first message

@e503824 You have attached your input and current output documents. What I am asking for is expected output. You can create such documents manually in MS Word. This will allow us to better understand your requirements.

Dear team,

Please find Manually extracted Images

gr15.pdf (67.5 KB)
gr13.pdf (68.1 KB)
gr14.pdf (246.9 KB)
gr12.pdf (414.2 KB)

@e503824 Thank you for additional information. I have modified ImageExtractor class I have suggested you earlier, so it handle images grouped by tables. Now it extracts such images properly:

Document doc = new Document("C:\\Temp\\in.docx");
ImageExtractor extractor = new ImageExtractor("C:\\Temp\\");
doc.accept(extractor);
    private static class ImageExtractor extends DocumentVisitor {
        public ImageExtractor(String targetFolder) {
            mTargetFolder = targetFolder;
        }

        @Override
        public int visitRowStart(Row row) throws Exception {

            if(row.getChildNodes(NodeType.SHAPE, true).getCount()>0)
                mRows.push(row);

            return super.visitRowStart(row);
        }

        @Override
        public int visitGroupShapeStart(GroupShape groupShape) throws Exception {

            if (groupShape.isTopLevel())
                mTopShapes.push(groupShape);

            saveShapeAsPdf();

            return super.visitGroupShapeStart(groupShape);
        }

        @Override
        public int visitShapeStart(Shape shape) throws Exception {

            if (shape.isTopLevel() &&
                    shape.getChildNodes(NodeType.PARAGRAPH, true).getCount() == 0) {
                mTopShapes.push(shape);
            }

            saveShapeAsPdf();

            return super.visitShapeStart(shape);
        }

        @Override
        public int visitParagraphStart(Paragraph paragraph) throws Exception {

            if (isCaptionParagraph(paragraph))
                mCaptions.push(paragraph.toString(SaveFormat.TEXT).trim());

            saveShapeAsPdf();

            return super.visitParagraphStart(paragraph);
        }

        /**
         * Checks whether paragraph is likely to be an image caption.
         */
        private static boolean isCaptionParagraph(Paragraph paragraph) throws Exception {
            // Get only Run text in account because if caption is in a textbox
            // paragraph.toString will return the same value for both
            // paragraph inside textbox shape and for paragraph that contains textbox shape.

            // Some captions are in textboxes.
            boolean isInshape = paragraph.getAncestor(NodeType.SHAPE)!=null;
            // Some caption are marked as bold
            boolean isBold = false;
            // More conditions might be added here to better distinguish captions.
            // .........
            
            String paraText = "";
            for (Run r : paragraph.getRuns()) {
                paraText += r.getText();
                isBold |= r.getFont().getBold();
            }

            return (isInshape || isBold) && (paraText.startsWith("Fig") ||
                    paraText.startsWith("Scheme") ||
                    paraText.startsWith("Plate") ||
                    paraText.startsWith("Figure"));
        }

        /**
         * Save the last shape as a separate PDF document.
         */
        private void saveShapeAsPdf() throws Exception {
            if (!mTopShapes.empty() && !mCaptions.empty()) {
                String caption = mCaptions.pop();
                System.out.println(mShapeCounter);
                System.out.println(caption);

                Node imageNode = mTopShapes.peek();
                // Create e temporary document which will be exported to PDF.
                Document tmp = (Document) imageNode.getDocument().deepClone(false);
                Node tmpSection = tmp.importNode(imageNode.getAncestor(NodeType.SECTION), false, ImportFormatMode.USE_DESTINATION_STYLES);
                tmp.appendChild(tmpSection);
                tmp.ensureMinimum();

                if(mTopShapes.size() > 1 && !mRows.isEmpty())
                {
                    Table imagesTable = (Table)mRows.peek().getParentTable().deepClone(false);
                    while (!mRows.isEmpty())
                        imagesTable.prependChild(mRows.pop().deepClone(true));

                    imageNode = imagesTable;
                }

                Node resultImage = tmp.importNode(imageNode, true, ImportFormatMode.USE_DESTINATION_STYLES);
                if(resultImage.getNodeType() == NodeType.TABLE)
                    tmp.getFirstSection().getBody().prependChild(resultImage);
                else
                    tmp.getFirstSection().getBody().getFirstParagraph().appendChild(resultImage);

                // Format the output file path.
                String outFilePath = mTargetFolder + "image_" + mShapeCounter + ".pdf";
                tmp.save(outFilePath);

                mShapeCounter++;

                // Empty stacks.
                mTopShapes.clear();
                mRows.clear();
                mCaptions.clear();
            }
        }

        // Upon visiting the shapes captions and shapes are pushed in stack.
        // only top level shapes will be collected.
        private Stack<ShapeBase> mTopShapes = new Stack<ShapeBase>();
        private Stack<Row> mRows = new Stack<Row>();
        private Stack<String> mCaptions = new Stack<String>();
        private int mShapeCounter = 0;
        private String mTargetFolder;
    }

Dear team,

It’s getting extracted but its extracting with Figure Caption any thing possible to remove figure caption in the figure, please find extracted image

output : 5.pdf (819.9 KB)

Dear team,

In this case once extracted we need to delete images from documents, please do needful

@e503824 To remove text content, you can remove all runs from the imported node. YOu can add code like the following right after importing image node:

if(resultImage.isComposite())
    ((CompositeNode)resultImage).getChildNodes(NodeType.RUN, true).clear();

You can collect the imageNode nodes into the collection an then remove these nodes after processing the document.

dear team,

please share the source code to remove images from documents

@e503824 In the ImageExtractor add a private dictionary where nodes that should be removed will be put:

// Dictionary with nodes that should be deleted and replaced with bookmark.
HashMap<String, Node> mNodesToRemove = new HashMap<String, Node>();

In the saveShapeAsPdf method put the image node into this dictionary just before importing it:

mNodesToRemove.put("image_" + mShapeCounter, imageNode);

In the ImageExtractor add a public method that loops through the dictionary, removes the image nodes and put bookmark where the removed node was placed.

/**
    * Removes images from the source document and inserts bookmark at the image position.
    */
public void RemoveImagesFromSourceDocument()
{
    for (String key : mNodesToRemove.keySet())
    {
        Node nodeToRemove = mNodesToRemove.get(key);

        DocumentBuilder builder = new DocumentBuilder((Document)nodeToRemove.getDocument());
        // In case of table move cursor to the next paragraph.
        if (nodeToRemove.getNodeType() == NodeType.TABLE)
            builder.moveTo(nodeToRemove.getNextSibling());
        else
            builder.moveTo(nodeToRemove);

        // Insert bookmark.
        builder.startBookmark(key);
        builder.endBookmark(key);

        // Remove image node.
        nodeToRemove.remove();
    }
}

Call this method after accepting the visitor, like shown in the following code:

Document doc = new Document("C:\\Temp\\in.docx");
ImageExtractor extractor = new ImageExtractor("C:\\Temp\\");
doc.accept(extractor);
extractor.RemoveImagesFromSourceDocument();
doc.save("C:\\Temp\\out.docx");