List figures

Mahi39 · July 26, 2022, 4:56am

Hi Team,

I’ve extracted images from the document. we faced one new scenario, in this case, the figure caption was treated as a list format. how to extract images from the document.

In this document,
Figs 1 and 2 is a Table image.
Fig 3 is a single image
Fig 4 contains a label(we need to be extracted the image with a label)

Input: Input.docx (160.5 KB)

Thanks in advance.

Regards,
Mahi

alexey.noskov · July 26, 2022, 8:00am

@Mahesh39 You should update list labels before extraction of your images by calling Document.updateListLabels. Like this:

Document doc = new Document("C:\\Temp\\in.docx");
doc.updateListLabels();
ImageExtractor extractor = new ImageExtractor("C:\\Temp\\");
doc.accept(extractor);

And modify the isCaptionParagraph method like this:

/**
 * Checks whether paragraph is likely to be an image caption.
 */
private static boolean isCaptionParagraph(Paragraph paragraph) throws Exception {
    // Get only Run text in account because if caption is in a textbox
    // paragraph.toString will return the same value for both
    // paragraph inside textbox shape and for paragraph that contains textbox shape.

    // Caption often contain SEQ fields.
    boolean hasSeqFields = false;
    for (Field f : paragraph.getRange().getFields())
        hasSeqFields |= (f.getType() == FieldType.FIELD_SEQUENCE);
    // More conditions might be added here to better distinguish captions.
    // .........

    String paraText = paragraph.isListItem() ? paragraph.getListLabel().getLabelString() : "";
    for (Run r : paragraph.getRuns()) {
        paraText += r.getText();
    }

    boolean hasCaptionLikeContent = (paraText.startsWith("Fig") ||
            paraText.startsWith("Scheme") ||
            paraText.startsWith("Plate") ||
            paraText.startsWith("Figure") ||
            paraText.startsWith("Flowchart"));

    return  (hasSeqFields || hasCaptionLikeContent);
}