Image extraction Issue1

e503824 · April 27, 2022, 8:31am

Dear team,

we are extracting images from docx, In this case we are notable to extract in a single pdf, please find the source code which we are using and input docx

if ((paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
	// for duplicate figure caption it-15
	&& (paragraph.getNextSibling() != null
		&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
			|| (paragraph.getNextSibling() != null
					&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
					&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
					&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
							.getCount() > 0
							|| (paragraph.getNextSibling().getNextSibling()) != null
									&& paragraph.getNextSibling().getNextSibling()
											.getNodeType() != NodeType.TABLE
									&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
											.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
											
											//this codition added by pavi-14-12-2021   for duplicate captions
											||(((Paragraph) paragraph.getNextSibling().getNextSibling())
													.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
			|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
					.matches(matches))
	// for duplicate figure caption
	&& ((paragraph.getPreviousSibling() != null
			&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
			|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
					.matches(matches))
	&& paragraph.getNodeType() != NodeType.TABLE
	&& paragraph.getParentNode().getNodeType() != NodeType.CELL
	&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
	
	//condition added by pavi -14-12-2021
	&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
			!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures")))
	
       || ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument())))

input : JECHEM-D-22-00426R2.docx (836.8 KB)

output : JECHEM-D-22-00426R2_Fig0001.pdf (187.0 KB)
JECHEM-D-22-00426R2_Fig0003.pdf (74.4 KB)
JECHEM-D-22-00426R2_Fig0005.pdf (65.6 KB)
JECHEM-D-22-00426R2_Fig0007.pdf (85.6 KB)
JECHEM-D-22-00426R2_Fig0008.pdf (235.4 KB)
JECHEM-D-22-00426R2_Scheme0001.pdf (110.1 KB)

please do needful

alexey.noskov · April 27, 2022, 6:05pm

@e503824 The problem occurs because shapes in your document are floating and are can be placed in any paragraph on the page. For example see the structure of nodes where Fig 3 and Fig 4 are placed:

As you can see both images are placed in the paragraph before Fig 3 caption paragraph. So in this case it is not possible to detect which caption the image belongs to. That is why both Fig 3 and Fig 4 are extracted into the same PDF document in your output.
The same issue is with 5th and 6th figures. Both images are in the paragraph that contains Fig 5 caption.
If you simply need to extract all images from the document, you can probably use code like this:

Document doc = new Document("C:\\Temp\\in.docx");
Iterable<Shape> shapes = doc.getChildNodes(NodeType.SHAPE, true);
for(Shape s : shapes)
{
    // Import shape into another document and save.
}