Image extraction issue 2

e503824 · April 27, 2022, 12:15pm

Dear team,

We are extracting images from docx using aspose java, but in this case some images having image caption and few images don’t have image captions, how to extract with our caption images using aspose, please find source code and input file

if ((paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
	// for duplicate figure caption it-15
	&& (paragraph.getNextSibling() != null
			&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
			|| (paragraph.getNextSibling() != null
					&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
					&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
				&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
						.getCount() > 0
							|| (paragraph.getNextSibling().getNextSibling()) != null
									&& paragraph.getNextSibling().getNextSibling()
											.getNodeType() != NodeType.TABLE
									&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
											.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
											
											//this codition added by pavi-14-12-2021   for duplicate captions
											||(((Paragraph) paragraph.getNextSibling().getNextSibling())
													.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
			|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
	.matches(matches))
	// for duplicate figure caption
	&& ((paragraph.getPreviousSibling() != null
			&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
			|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
				.matches(matches))
	&& paragraph.getNodeType() != NodeType.TABLE
	&& paragraph.getParentNode().getNodeType() != NodeType.CELL
	&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)

	//condition added by pavi -14-12-2021
	&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
			!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures")))
	
       || ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument())))

input : Revised Manuscript.docx (4.1 MB)

alexey.noskov · April 27, 2022, 6:05pm

@e503824 To extract all images from the document you can use the following simple code:

Document doc = new Document("C:\\Temp\\in.docx");
Iterable<Shape> shapes = doc.getChildNodes(NodeType.SHAPE, true);
for(Shape s : shapes)
{
    // Import shape into another document and save.
}