Extract images from the document with text "Fig" using Java

jan.kathir · March 31, 2020, 1:26pm

Dear Team,

We need to extract the image from source document, using figure caption beside the image.

I’ve attached sample document for your reference.
Sample::beside.zip (279.1 KB)

tahir.manzoor · March 31, 2020, 6:45pm

Could you please ZIP and attach your expected output document? We will then provide you more information about your query.

jan.kathir · April 1, 2020, 4:06am

@tahir.manzoor
I had been attached required output .please find the attached file and provide the solution as soon as possible
output sample::Output.zip (141.7 KB)

tahir.manzoor · April 1, 2020, 3:32pm

@jan.kathir

Please use the same code example shared in your other thread to extract the images.

For this case, please use the following code snippet to extract the image.

if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
        && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() > 0 )
{
    Document dstDoc = new Document();
    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
    Paragraph paragraph1 = (Paragraph)paragraph.deepClone(true);
    paragraph1.getRuns().clear();
    Node newNode = importer.importNode(paragraph1, true);
    dstDoc.getFirstSection().getBody().appendChild(newNode);
    dstDoc.save(MyDir + "out"+i+".docx");
}

jan.kathir · April 2, 2020, 9:40am

@tahir.manzoor
Thankyou for your Valuable response its working fine now .
I had one more issue in the attached document

while i extracting it shows some error message beside of the extracted image.
2)I need to extract fig caption beside the image and followed by another. I had
attached sample document and sample output based on my requirement.
Kindly take it up and provide some solution .
Sample::sample.zip (1.1 MB)

Output::Output.zip (226.0 KB)

Extracted Wrong output::Extracted wrong output.zip (370.8 KB)

tahir.manzoor · April 2, 2020, 5:19pm

@jan.kathir

We are working over your query and will get back to you soon.

jan.kathir · April 3, 2020, 4:32am

@tahir.manzoor
Thanks in advance .Kindly provide the solution as soon as possible .

tahir.manzoor · April 3, 2020, 1:56pm

@jan.kathir

We have tested the scenario using the following code example and have not found any error message. Please use this code example to get the desired output.

Document doc = new Document(MyDir + "Fig beside.docx");
int i = 1;
ArrayList nodes = new ArrayList();

for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node previousPara = paragraph.getPreviousSibling();
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH
                && previousPara.toString(SaveFormat.TEXT).trim().length() == 0)
        {
            if(previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
            System.out.println(previousPara);
        }

        if(nodes.size() > 0)
        {
            //Reverse the node collection.
            Collections.reverse(nodes);

            //Extract the consecutive shapes and export them into new document
            Document dstDoc = new Document();
            for (Paragraph para : (Iterable<Paragraph>)nodes)
            {
                NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(para, true);
                dstDoc.getFirstSection().getPageSetup().setOrientation(para.getParentSection().getPageSetup().getOrientation());
                dstDoc.getFirstSection().getBody().appendChild(newNode);
            }
            //Remove the first empty paragraph
            if(dstDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
                dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
            nodes.clear();
        }
    }

    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
            && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() > 0 )
    {
        Document dstDoc = new Document();
        NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        Paragraph paragraph1 = (Paragraph)paragraph.deepClone(true);
        paragraph1.getRuns().clear();
        Node newNode = importer.importNode(paragraph1, true);
        dstDoc.getFirstSection().getBody().appendChild(newNode);
        dstDoc.save(MyDir + "base_out"+i+".docx");
    }
}

tahir.manzoor · April 3, 2020, 2:01pm

@jan.kathir

Please use the following code example to achieve this requirement. The code examples in your cases are almost same. So, you need to build the logic according to your requirement.

Document doc = new Document(MyDir + "input.docx");
int i = 1;
ArrayList nodes = new ArrayList();

for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure"))
    {
        Node previousPara = paragraph.getPreviousSibling();
        nodes.add(paragraph);
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH)
        {

            if(previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        if(nodes.size() > 0)
        {
            //Reverse the node collection.
            Collections.reverse(nodes);

            //Extract the consecutive shapes and export them into new document
            Document dstDoc = new Document();
            for (Paragraph para : (Iterable<Paragraph>)nodes)
            {
                NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(para, true);
                dstDoc.getFirstSection().getPageSetup().setOrientation(para.getParentSection().getPageSetup().getOrientation());
                dstDoc.getFirstSection().getBody().appendChild(newNode);
            }
            //Remove the first empty paragraph
            if(dstDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
                dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
            nodes.clear();
        }
    }
}

jan.kathir · April 6, 2020, 5:32am

@tahir.manzoor
Both the cases are not working .It remains same as before. For your reference i had attached extracted samples. kindly provide some solution for this scenario.
For the scenario 1(while i extracting it shows some error message beside of the extracted image) the extracted image::Fig beside_Fig0004.zip (370.8 KB)
Here you can find Error message next to image
For the scenario 2 (I need to extract fig caption beside the image and followed by another.)the extracted image ::Fig beside_Fig0004.zip (370.8 KB)
Here both the images extracting with single caption I need it to extract the images seperate as I mentioned above Output::[Output.zip](https://forum.aspose.com/uploads/default/36607) (226.0 KB)
It will be more helpful if you give same method to work both the scenarios.
Kindly provide as soon as possible.

tahir.manzoor · April 6, 2020, 3:10pm

@jan.kathir

In this case, you can use following code example to get the desired output.

if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
        && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() > 0 )
{
    int replaces = paragraph.deepClone(true).getRange().replace("Fig", "Fig");
    if(replaces > 1)
    {
        for (Shape shape : (Iterable<Shape>) paragraph.getChildNodes(NodeType.SHAPE, true))
        {
            Document dstDoc = new Document();
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(shape, true);
            dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
            dstDoc.save(MyDir + "base_out"+i+".pdf");
            i++;
        }
    }
    else
    {
        Document dstDoc = new Document();
        NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        Paragraph paragraph1 = (Paragraph)paragraph.deepClone(true);
        paragraph1.getRuns().clear();
        Node newNode = importer.importNode(paragraph1, true);
        dstDoc.getFirstSection().getBody().appendChild(newNode);
        dstDoc.getRange().getFields().clear();
        dstDoc.save(MyDir + "base_out"+i+".pdf");
        i++;
    }
}