New extraction issue

e503824 · May 25, 2022, 1:13pm

Dear team,

We are extracting images from docx but below case one image is not extracting please find below source code

Source code :

if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
					// for duplicate figure caption it-15
					&& (paragraph.getNextSibling() != null
							&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
							|| (paragraph.getNextSibling() != null
									&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
									&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
									&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
											.getCount() > 0
											|| (paragraph.getNextSibling().getNextSibling()) != null
													&& paragraph.getNextSibling().getNextSibling()
															.getNodeType() != NodeType.TABLE
													&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
															.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
															
															//this codition added by pavi-14-12-2021   for duplicate captions
															||(((Paragraph) paragraph.getNextSibling().getNextSibling())
																	.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
							|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
									.matches(matches))
					// for duplicate figure caption
					&& ((paragraph.getPreviousSibling() != null
							&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
							|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
									.matches(matches))
					&& paragraph.getNodeType() != NodeType.TABLE
					&& paragraph.getParentNode().getNodeType() != NodeType.CELL
					&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
					
					//condition added by pavi -14-12-2021
					&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
							!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures"))))
					
			        //|| ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument()))
			        
					
			{

input file : og.zip (6.8 MB)

please do needful

alexey.noskov · May 25, 2022, 3:52pm

@e503824 The code I have provided here seems to properly extract images from your document.

e503824 · May 26, 2022, 4:49am

Dear team,

I have tried same method also it was not extracted can you please help me on this

alexey.noskov · May 26, 2022, 4:06pm

@e503824 I have checked one more time and as I can see the images are extracted properly. The only thing is that with the figure 8 the formula is also extracted. This occurs because equation is in the table. This can be fixed by modifying visitRowStart as the following:

@Override
public int visitRowStart(Row row) throws Exception {

    if(rowHasImage(row))
        mRows.push(row);

    return super.visitRowStart(row);
}

/**
* Checks whether row has shapes except formulas.
*/
private static boolean rowHasImage(Row row)
{
    NodeCollection shapes = row.getChildNodes(NodeType.SHAPE, true);
    if(shapes.getCount() == 0)
        return false;

    boolean hasImage = false;
    for (Shape s : (Iterable<Shape>)shapes) {
        hasImage |= !isOleEquation(s);
    }

    return  hasImage;
}

Could you pleas be more specific and elaborate what exactly does not work on your side.