Extraction Issue 10

e503824 · June 22, 2022, 8:59am

Dear team,

We are extracting images from docx using aspose java but below case we are notable to extract, please refer below source code and input file, please do needful

Source Code :

if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
		|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
		|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
		|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
		|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
		// for duplicate figure caption it-15
		&& (paragraph.getNextSibling() != null
				&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
				|| (paragraph.getNextSibling() != null
						&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
						&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
						&& (((Paragraph)paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
								.getCount() > 0
								|| (paragraph.getNextSibling().getNextSibling()) != null
										&& paragraph.getNextSibling().getNextSibling()
												.getNodeType() != NodeType.TABLE
										&& ((((Paragraph)paragraph.getNextSibling().getNextSibling())
												.getChildNodes(NodeType.SHAPE, true).getCount() == 0)

												//this codition added by pavi-14-12-2021   for duplicate captions
												|| (((Paragraph)paragraph.getNextSibling().getNextSibling())
														.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
				|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
						.matches(matches))
		// for duplicate figure caption
		&& ((paragraph.getPreviousSibling() != null
				&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
				|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
						.matches(matches))
		&& paragraph.getNodeType() != NodeType.TABLE
		&& paragraph.getParentNode().getNodeType() != NodeType.CELL
		&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)

		//condition added by pavi -14-12-2021
		&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions")) ||
				!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures"))))

//|| ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument()))


{

Input File : Manuscipt.docx (730.4 KB)

alexey.noskov · June 22, 2022, 5:50pm

@e503824 I have tested extraction of images using the code I suggested here and Images are extracted fine.
The only modification I have done is limit the maximum length of caption paragraph in isCaptionParagraph method:

/**
    * Checks whether paragraph is likely to be an image caption.
    */
private static boolean isCaptionParagraph(Paragraph paragraph) throws Exception {
    // Get only Run text in account because if caption is in a textbox
    // paragraph.toString will return the same value for both
    // paragraph inside textbox shape and for paragraph that contains textbox shape.

    // Caption often contain SEQ fields.
    boolean hasSeqFields = false;
    for (Field f : paragraph.getRange().getFields())
        hasSeqFields |= (f.getType() == FieldType.FIELD_SEQUENCE);
    // More conditions might be added here to better distinguish captions.
    // .........

    String paraText = "";
    for (Run r : paragraph.getRuns()) {
        paraText += r.getText();
    }

    boolean hasCaptionLikeContent = (paraText.startsWith("Fig") ||
            paraText.startsWith("Scheme") ||
            paraText.startsWith("Plate") ||
            paraText.startsWith("Figure") ||
            paraText.startsWith("Flowchart"));

    return  (hasSeqFields || hasCaptionLikeContent) && paraText.length()<100;
}