I’ve extracted images from the document. we faced one new scenario, in this case, the figure caption was treated as a list format. how to extract images from the document.
In this document,
Figs 1 and 2 is a Table image.
Fig 3 is a single image
Fig 4 contains a label(we need to be extracted the image with a label)
@Mahesh39 You should update list labels before extraction of your images by calling Document.updateListLabels. Like this:
Document doc = new Document("C:\\Temp\\in.docx");
doc.updateListLabels();
ImageExtractor extractor = new ImageExtractor("C:\\Temp\\");
doc.accept(extractor);
And modify the isCaptionParagraph method like this:
/**
* Checks whether paragraph is likely to be an image caption.
*/
private static boolean isCaptionParagraph(Paragraph paragraph) throws Exception {
// Get only Run text in account because if caption is in a textbox
// paragraph.toString will return the same value for both
// paragraph inside textbox shape and for paragraph that contains textbox shape.
// Caption often contain SEQ fields.
boolean hasSeqFields = false;
for (Field f : paragraph.getRange().getFields())
hasSeqFields |= (f.getType() == FieldType.FIELD_SEQUENCE);
// More conditions might be added here to better distinguish captions.
// .........
String paraText = paragraph.isListItem() ? paragraph.getListLabel().getLabelString() : "";
for (Run r : paragraph.getRuns()) {
paraText += r.getText();
}
boolean hasCaptionLikeContent = (paraText.startsWith("Fig") ||
paraText.startsWith("Scheme") ||
paraText.startsWith("Plate") ||
paraText.startsWith("Figure") ||
paraText.startsWith("Flowchart"));
return (hasSeqFields || hasCaptionLikeContent);
}