Table Image Extraction issue

e503824 · May 19, 2022, 8:26am

Dear team,

We are extraction images using aspose java, now we are facing new issue in table images please find below source code we are using

source code :

try {
	if (table.getChildNodes(NodeType.SHAPE, true).getCount() > 0) {
		String imgCaption = "";
		try {
			while (table.getNextSibling().toString(SaveFormat.TEXT).trim().length() == 0
					&& (((Paragraph) table.getNextSibling()).getChildNodes(NodeType.SHAPE, true).getCount() == 0
							&& table.getNextSibling().getNextSibling() != null
							&& table.getNextSibling().getNextSibling().getNodeType() != NodeType.TABLE)) {
				table.getNextSibling().remove();
			}
		} catch (ClassCastException e) {
			logger.info("ClassCastException occur, {0}", e.getMessage());
		}
		if (table.getNextSibling() != null && table.getNextSibling().getText().trim().matches(matches)
				&& !table.toString(SaveFormat.TEXT).toLowerCase().contains("fig")
				&& !table.toString(SaveFormat.TEXT).trim().contains(SCHEME)) {

input : Revised Manuscript(Clean version).docx (1.3 MB)

output : Fig0012.pdf (1.6 KB)
Fig0013.pdf (1.6 KB)
Fig0014.pdf (1.6 KB)
Fig0015.pdf (1.6 KB)

please do needful

alexey.noskov · May 19, 2022, 8:23pm

@e503824 In your document images are in the table. The table contains several images, so I think in this particular case you should extract whole table instead of extracting a particular shape.
Could you please manually create the expected output you would like to get?

e503824 · May 20, 2022, 5:20am

Dear team,

Anything possible to extract these images and could you please share the source code

alexey.noskov · May 20, 2022, 6:13am

@e503824 Could you please share the document that will allow us to understand what is your expected output? Because there are more than one shape that belongs to the caption and it is not quite clear what is the expected output.

e503824 · May 20, 2022, 6:15am

Dear team,

I have attached in first message

alexey.noskov · May 20, 2022, 6:52am

@e503824 You have attached your input and current output documents. What I am asking for is expected output. You can create such documents manually in MS Word. This will allow us to better understand your requirements.

e503824 · May 20, 2022, 7:12am

Dear team,

Please find Manually extracted Images

gr15.pdf (67.5 KB)
gr13.pdf (68.1 KB)
gr14.pdf (246.9 KB)
gr12.pdf (414.2 KB)

alexey.noskov · May 20, 2022, 2:55pm

@e503824 Thank you for additional information. I have modified ImageExtractor class I have suggested you earlier, so it handle images grouped by tables. Now it extracts such images properly:

Document doc = new Document("C:\\Temp\\in.docx");
ImageExtractor extractor = new ImageExtractor("C:\\Temp\\");
doc.accept(extractor);

    private static class ImageExtractor extends DocumentVisitor {
        public ImageExtractor(String targetFolder) {
            mTargetFolder = targetFolder;
        }

        @Override
        public int visitRowStart(Row row) throws Exception {

            if(row.getChildNodes(NodeType.SHAPE, true).getCount()>0)
                mRows.push(row);

            return super.visitRowStart(row);
        }

        @Override
        public int visitGroupShapeStart(GroupShape groupShape) throws Exception {

            if (groupShape.isTopLevel())
                mTopShapes.push(groupShape);

            saveShapeAsPdf();

            return super.visitGroupShapeStart(groupShape);
        }

        @Override
        public int visitShapeStart(Shape shape) throws Exception {

            if (shape.isTopLevel() &&
                    shape.getChildNodes(NodeType.PARAGRAPH, true).getCount() == 0) {
                mTopShapes.push(shape);
            }

            saveShapeAsPdf();

            return super.visitShapeStart(shape);
        }

        @Override
        public int visitParagraphStart(Paragraph paragraph) throws Exception {

            if (isCaptionParagraph(paragraph))
                mCaptions.push(paragraph.toString(SaveFormat.TEXT).trim());

            saveShapeAsPdf();

            return super.visitParagraphStart(paragraph);
        }

        /**
         * Checks whether paragraph is likely to be an image caption.
         */
        private static boolean isCaptionParagraph(Paragraph paragraph) throws Exception {
            // Get only Run text in account because if caption is in a textbox
            // paragraph.toString will return the same value for both
            // paragraph inside textbox shape and for paragraph that contains textbox shape.

            // Some captions are in textboxes.
            boolean isInshape = paragraph.getAncestor(NodeType.SHAPE)!=null;
            // Some caption are marked as bold
            boolean isBold = false;
            // More conditions might be added here to better distinguish captions.
            // .........
            
            String paraText = "";
            for (Run r : paragraph.getRuns()) {
                paraText += r.getText();
                isBold |= r.getFont().getBold();
            }

            return (isInshape || isBold) && (paraText.startsWith("Fig") ||
                    paraText.startsWith("Scheme") ||
                    paraText.startsWith("Plate") ||
                    paraText.startsWith("Figure"));
        }

        /**
         * Save the last shape as a separate PDF document.
         */
        private void saveShapeAsPdf() throws Exception {
            if (!mTopShapes.empty() && !mCaptions.empty()) {
                String caption = mCaptions.pop();
                System.out.println(mShapeCounter);
                System.out.println(caption);

                Node imageNode = mTopShapes.peek();
                // Create e temporary document which will be exported to PDF.
                Document tmp = (Document) imageNode.getDocument().deepClone(false);
                Node tmpSection = tmp.importNode(imageNode.getAncestor(NodeType.SECTION), false, ImportFormatMode.USE_DESTINATION_STYLES);
                tmp.appendChild(tmpSection);
                tmp.ensureMinimum();

                if(mTopShapes.size() > 1 && !mRows.isEmpty())
                {
                    Table imagesTable = (Table)mRows.peek().getParentTable().deepClone(false);
                    while (!mRows.isEmpty())
                        imagesTable.prependChild(mRows.pop().deepClone(true));

                    imageNode = imagesTable;
                }

                Node resultImage = tmp.importNode(imageNode, true, ImportFormatMode.USE_DESTINATION_STYLES);
                if(resultImage.getNodeType() == NodeType.TABLE)
                    tmp.getFirstSection().getBody().prependChild(resultImage);
                else
                    tmp.getFirstSection().getBody().getFirstParagraph().appendChild(resultImage);

                // Format the output file path.
                String outFilePath = mTargetFolder + "image_" + mShapeCounter + ".pdf";
                tmp.save(outFilePath);

                mShapeCounter++;

                // Empty stacks.
                mTopShapes.clear();
                mRows.clear();
                mCaptions.clear();
            }
        }

        // Upon visiting the shapes captions and shapes are pushed in stack.
        // only top level shapes will be collected.
        private Stack<ShapeBase> mTopShapes = new Stack<ShapeBase>();
        private Stack<Row> mRows = new Stack<Row>();
        private Stack<String> mCaptions = new Stack<String>();
        private int mShapeCounter = 0;
        private String mTargetFolder;
    }

e503824 · May 23, 2022, 6:40am

Dear team,

It’s getting extracted but its extracting with Figure Caption any thing possible to remove figure caption in the figure, please find extracted image

output : 5.pdf (819.9 KB)

e503824 · May 23, 2022, 7:53am

Dear team,

In this case once extracted we need to delete images from documents, please do needful

alexey.noskov · May 23, 2022, 5:16pm

@e503824 To remove text content, you can remove all runs from the imported node. YOu can add code like the following right after importing image node:

if(resultImage.isComposite())
    ((CompositeNode)resultImage).getChildNodes(NodeType.RUN, true).clear();

You can collect the imageNode nodes into the collection an then remove these nodes after processing the document.

e503824 · May 24, 2022, 3:38am

dear team,

please share the source code to remove images from documents

alexey.noskov · May 24, 2022, 4:34pm

@e503824 In the ImageExtractor add a private dictionary where nodes that should be removed will be put:

// Dictionary with nodes that should be deleted and replaced with bookmark.
HashMap<String, Node> mNodesToRemove = new HashMap<String, Node>();

In the saveShapeAsPdf method put the image node into this dictionary just before importing it:

mNodesToRemove.put("image_" + mShapeCounter, imageNode);

In the ImageExtractor add a public method that loops through the dictionary, removes the image nodes and put bookmark where the removed node was placed.

/**
    * Removes images from the source document and inserts bookmark at the image position.
    */
public void RemoveImagesFromSourceDocument()
{
    for (String key : mNodesToRemove.keySet())
    {
        Node nodeToRemove = mNodesToRemove.get(key);

        DocumentBuilder builder = new DocumentBuilder((Document)nodeToRemove.getDocument());
        // In case of table move cursor to the next paragraph.
        if (nodeToRemove.getNodeType() == NodeType.TABLE)
            builder.moveTo(nodeToRemove.getNextSibling());
        else
            builder.moveTo(nodeToRemove);

        // Insert bookmark.
        builder.startBookmark(key);
        builder.endBookmark(key);

        // Remove image node.
        nodeToRemove.remove();
    }
}

Call this method after accepting the visitor, like shown in the following code:

Document doc = new Document("C:\\Temp\\in.docx");
ImageExtractor extractor = new ImageExtractor("C:\\Temp\\");
doc.accept(extractor);
extractor.RemoveImagesFromSourceDocument();
doc.save("C:\\Temp\\out.docx");