How to remove formulas in image extraction

e503824 · May 24, 2022, 10:10am

Dear team,

We are extracting images from docx, but below case its extracting formulas also how to skip formulas in below case

Source code :

String pdf;
NodeCollection shapes = interimdoc.getChildNodes(NodeType.SHAPE, true);
LayoutCollector collector = new LayoutCollector(interimdoc);
int imageIndex = 1;

for (Shape shape : (Iterable<Shape>)shapes)
{
	String text = "NoMatch";
	try
	{
		text = shape.getParentParagraph().getAncestor(NodeType.TABLE).getPreviousSibling()
				.toString(SaveFormat.TEXT);
	}
	catch (Exception e)
	{
		logger.info(e.getMessage());
	}

	try
	{

		if (shape.hasImage() && !text.contains(AIE.docName))
		{

			String imgName = "FX" + imageIndex;
			pdf = AIE.pdfFolder + imgName + ".pdf";
			imageIndex++;
			// Create an intermediate document to where shape will be imported to.
			Document itermDoc = (Document)interimdoc.deepClone(false);
			// use section imported from the source document to keep the same page size and orientation.
			itermDoc.appendChild(itermDoc.importNode(shape.getAncestor(NodeType.SECTION), false,
					ImportFormatMode.USE_DESTINATION_STYLES));

			// Add required nodes since we did not import child nodes from the source document.
			itermDoc.ensureMinimum();

			Node shapeNode = shape;
			while (shapeNode.getParentNode().getNodeType() == NodeType.GROUP_SHAPE)
			{
				shapeNode = shapeNode.getParentNode();
			}

			// Import shape and put it into the document.
			Node importedShape = itermDoc.importNode(shapeNode, true, ImportFormatMode.USE_DESTINATION_STYLES);
			itermDoc.getFirstSection().getBody().getFirstParagraph().appendChild(importedShape);

			// Save as PDF.
			itermDoc.save(pdf);
		}

input : Manuscript (clean version) (1).docx (2.7 MB)

output : FX1.pdf (34.8 KB)
FX2.pdf (25.9 KB)
FX3.pdf (26.3 KB)
FX4.pdf (20.8 KB)
FX5.pdf (21.8 KB)
FX6.pdf (33.6 KB)
FX7.pdf (32.1 KB)
FX8.pdf (34.8 KB)
FX9.pdf (22.3 KB)

please do needful

alexey.noskov · May 24, 2022, 4:34pm

@e503824 I have answered this question in another you thread.

e503824 · May 25, 2022, 4:38am

Dear team,

In this case some formulas extracted in caption below and caption above conditions also please fine source code and sample input file

Source code :

for (Paragraph paragraph : (Iterable<Paragraph>)paragraphs)
{
	try
	{
		//System.out.println("Para above: "+paragraph.getText().toString());
		if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung")
						&& paragraph.getNodeType() != NodeType.TABLE)
				//						//changes by pavi -starts check sample  D:\testing\AIE\Iteration 16_4 points\Document contains Duplicate figure captions\Revised-MANUSCRIPT
				&& ((paragraph.getNextSibling() != null
				&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE)
				|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
						.matches(matches))

				//	&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
				//changes by pavi -end 
				&& paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0
				&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
				&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)//duplicate caption by pavi
				&& !(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions")) &&

				!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures")))
		{

			// supplymentry check sample: JCIS_SRE_2020_1_2nd_revision.docx
			if (AIE.supplymentryCheck(paragraph.toString(SaveFormat.TEXT).trim()))
			{
				AIE.insertBookmark(interimdoc, paragraph, AIE.fileName);
				continue;
			}

Input : new.docx (311.4 KB)

Output : Fig0015A.pdf (8.2 KB)
Fig0015.pdf (11.3 KB)

need to skip these formulas please do needful

alexey.noskov · May 25, 2022, 3:30pm

@e503824 As I already mentioned increasing number of conditions in the if statement makes your code more and more complicated and hard to handle and debug.
I modified DocumentVisitor implementation I have already suggested to skip equations. Now the code works properly with the document you have attached:

private static class ImageExtractor extends DocumentVisitor {
    public ImageExtractor(String targetFolder) {
        mTargetFolder = targetFolder;
    }

    @Override
    public int visitRowStart(Row row) throws Exception {

        if(row.getChildNodes(NodeType.SHAPE, true).getCount()>0)
            mRows.push(row);

        return super.visitRowStart(row);
    }

    @Override
    public int visitGroupShapeStart(GroupShape groupShape) throws Exception {

        if (groupShape.isTopLevel())
            mTopShapes.push(groupShape);

        saveShapeAsPdf();

        return super.visitGroupShapeStart(groupShape);
    }

    @Override
    public int visitShapeStart(Shape shape) throws Exception {

        if (shape.isTopLevel() &&
                (shape.getChildNodes(NodeType.PARAGRAPH, true).getCount() == 0) &&
                !isOleEquation(shape)) {
            mTopShapes.push(shape);
        }

        saveShapeAsPdf();

        return super.visitShapeStart(shape);
    }

    @Override
    public int visitParagraphStart(Paragraph paragraph) throws Exception {

        if (isCaptionParagraph(paragraph)) {
            mCaptions.push(paragraph.toString(SaveFormat.TEXT).trim());
            saveShapeAsPdf();
            return VisitorAction.SKIP_THIS_NODE;
        }

        return super.visitParagraphStart(paragraph);
    }

    /**
        * Removes images from the source document and inserts bookmark at the image position.
        */
    public void RemoveImagesFromSourceDocument()
    {
        for (String key : mNodesToRemove.keySet()) {

            ArrayList<Node> nodesToRemove  = mNodesToRemove.get(key);
            if(nodesToRemove.size() == 0)
                continue;

            Node firstNode = nodesToRemove.get(0);

            DocumentBuilder builder = new DocumentBuilder((Document)firstNode.getDocument());
            // In case of table move cursor to the next paragraph.
            if(firstNode.getNodeType() == NodeType.TABLE)
                builder.moveTo(firstNode.getNextSibling());
            else
                builder.moveTo(firstNode);

            // Insert bookmark.
            builder.startBookmark(key);
            builder.endBookmark(key);

            // Remove all image nodes.
            for (Node n : nodesToRemove) {
                n.remove();
            }
        }
    }

    /**
        * Checks whether paragraph is likely to be an image caption.
        */
    private static boolean isCaptionParagraph(Paragraph paragraph) throws Exception {
        // Get only Run text in account because if caption is in a textbox
        // paragraph.toString will return the same value for both
        // paragraph inside textbox shape and for paragraph that contains textbox shape.

        // Caption often contain SEQ fields.
        boolean hasSeqFields = false;
        for (Field f : paragraph.getRange().getFields())
            hasSeqFields |= (f.getType() == FieldType.FIELD_SEQUENCE);
        // More conditions might be added here to better distinguish captions.
        // .........

        String paraText = "";
        for (Run r : paragraph.getRuns()) {
            paraText += r.getText();
        }

        boolean hasCaptionLikeContent = (paraText.startsWith("Fig") ||
                paraText.startsWith("Scheme") ||
                paraText.startsWith("Plate") ||
                paraText.startsWith("Figure"));

        return  hasSeqFields || hasCaptionLikeContent;
    }

    /**
        * Check whether shape is an embedded Equation.DSMT4 OLE object
        */
    private static boolean isOleEquation(Shape shape)
    {
        return (shape.getOleFormat() != null) && (shape.getOleFormat().getProgId().equals("Equation.DSMT4"));
    }

    /**
        * Save the last shape as a separate PDF document.
        */
    private void saveShapeAsPdf() throws Exception {
        if (!mTopShapes.empty() && !mCaptions.empty()) {
            String caption = mCaptions.pop();
            System.out.println(mShapeCounter);
            System.out.println(caption);

            // Create e temporary document which will be exported to PDF.
            Document tmp = (Document) mTopShapes.peek().getDocument().deepClone(false);
            Node tmpSection = tmp.importNode(mTopShapes.peek().getAncestor(NodeType.SECTION), false, ImportFormatMode.USE_DESTINATION_STYLES);
            tmp.appendChild(tmpSection);
            tmp.ensureMinimum();

            // There might be several shape to import under one caption.
            ArrayList<Node> nodesToImport = new ArrayList<Node>();
            if(mTopShapes.size() > 1 && !mRows.isEmpty())
            {
                Table imagesTable = (Table)mRows.peek().getParentTable().deepClone(false);
                while (!mRows.isEmpty()) {
                    Row r = mRows.pop();
                    imagesTable.prependChild(r.deepClone(true));
                }

                nodesToImport.add(imagesTable);
            }
            else
            {
                while (!mTopShapes.isEmpty()) {
                    ShapeBase s = mTopShapes.pop();
                    nodesToImport.add(s);
                }
            }

            String key = "image_" + mShapeCounter;
            mNodesToRemove.put(key, nodesToImport);

            for (Node imageNode :  nodesToImport) {
                Node resultImage = tmp.importNode(imageNode, true, ImportFormatMode.USE_DESTINATION_STYLES);

                if (resultImage.isComposite()) {
                    resultImage.getRange().unlinkFields();
                    ((CompositeNode) resultImage).getChildNodes(NodeType.RUN, true).clear();
                }

                if (resultImage.getNodeType() == NodeType.TABLE)
                    tmp.getFirstSection().getBody().prependChild(resultImage);
                else
                    tmp.getFirstSection().getBody().getFirstParagraph().prependChild(resultImage);
            }

            // Format the output file path.
            String outFilePath = mTargetFolder + key + ".pdf";
            tmp.save(outFilePath);

            mShapeCounter++;

            // Empty stacks.
            mTopShapes.clear();
            mRows.clear();
            mCaptions.clear();
        }
    }

    // Upon visiting the shapes captions and shapes are pushed in stack.
    // only top level shapes will be collected.
    private Stack<ShapeBase> mTopShapes = new Stack<ShapeBase>();
    private Stack<Row> mRows = new Stack<Row>();
    private Stack<String> mCaptions = new Stack<String>();
    private int mShapeCounter = 0;
    private String mTargetFolder;

    // Dictionary with nodes that should be deleted and replaced with bookmark.
    HashMap<String, ArrayList<Node>> mNodesToRemove = new HashMap<String, ArrayList<Node>>();
}