Extract images from the documet based on Fig Caption using Java

Dear Team

I am facing an issue while extracting below mentioned document.
1)Document contains both portrait and landscape images.
2)Fig caption contains in floating point numbers
Here i need seperate method to extract these kind of scenario. Kindly give some solution regarding my recuirement.
Input file::CTA Chen-Chap 12–2020-0331 v.zip (4.5 MB)
Output file::correct output.zip (1.1 MB)

@jan.kathir

Please use the following code example to get the desired output. We have attached the output PDF files with this post for your kind reference.

Docs.zip (1.4 MB)

Document doc = new Document(MyDir + "CTA Chen-Chap 12--2020-0331 v.docx");
int i = 1;
ArrayList nodes = new ArrayList();

for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node previousPara = paragraph.getPreviousSibling();
        nodes.add(paragraph);
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH
                && (((Paragraph) previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0 || ((Paragraph) previousPara).toString(SaveFormat.TEXT).trim().length() == 0) )
        {

            if(previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        if(nodes.size() > 0)
        {
            //Reverse the node collection.
            Collections.reverse(nodes);

            //Extract the consecutive shapes and export them into new document
            Document dstDoc = new Document();
            for (Paragraph para : (Iterable<Paragraph>)nodes)
            {
                NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(para, true);
                dstDoc.getFirstSection().getPageSetup().setOrientation(para.getParentSection().getPageSetup().getOrientation());
                dstDoc.getFirstSection().getBody().appendChild(newNode);
            }

            dstDoc.getRange().replace(ControlChar.PAGE_BREAK, "", new FindReplaceOptions());
            //Remove the first empty paragraph
            if(dstDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
                dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

            if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() > 0)
                dstDoc.getLastSection().getBody().getLastParagraph().remove();

            dstDoc.save(MyDir + "output"+i+".pdf");
            i++;
            nodes.clear();
        }
    }
}

@tahir.manzoor
I can extract images but all the images which extracted was wrong.it was extracted empty pdf file i have attached the wrong output below please go through it .Kindly provide the solution and refer the expected output above mentioned.

Wrong output::out.zip (2.6 MB)

Input file::CTA Chen-Chap 12–2020-0331 v.zip (5.1 MB)

kindly test the input file and provide the solution .

@jan.kathir

Please note that the code example shared in this forum thread will not work for all your cases. First you need to list down all your use cases and then write the code accordingly. You need to use the same approach shared earlier with you e.g. bookmark the content and extract them. You need to change the condition in while loop only.

For this new case, please use following code snippet to get the desired output.

doc.updatePageLayout();
if(dstDoc.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
    dstDoc.save(MyDir + "output"+i+".pdf");