Fig caption problem

priyanga · November 3, 2017, 9:01am

Hi Team,

The extraction of images based on the paragraph node fig caption as keyword for the extraction process.
The extraction read the next sibling and extract the images.
But the new source document having previous sibling as figure caption and also having consecutive images .
So, the extraction process skip some images.
please, help me to extract the images using fig caption as previous sibling .

The sample code Test.zip (34.7 KB)

The input document source.zip (1.2 MB)

The actual output ActualOutput.zip (1.2 MB)

The expected output expected_output.zip (1.1 MB)

Thanks & regards,
Priyanga G

tahir.manzoor · November 3, 2017, 4:16pm

@priyanga,

Thanks for your inquiry. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "source.docx");
int i = 1;
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node nextPara = paragraph.getNextSibling();
        if (nextPara != null
                && nextPara.getNodeType() == NodeType.PARAGRAPH
                && nextPara.toString(SaveFormat.TEXT).trim().length() == 0
                && ((Paragraph)nextPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            Document dstDoc = new Document();
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(nextPara, true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);

            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
        }
    }
}

priyanga · November 6, 2017, 5:20am

Hi @tahir.manzoor

Thanks a lot .

It extract the images with previous sibling with single images only

But some document having single fig caption with two images.How can i extract those images.

The input sample is Test2.zip (2.1 MB)

The expected output is expectedOutput.zip (2.1 MB)

Thanks & Regards,
Priyanga G

priyanga · November 6, 2017, 6:37am

Hi @tahir.manzoor ,

Thank you very much,

Another issue in fig caption

I have integrate the code with my sample code.

The input document having both previous caption as well as next caption with consecutive images.

In that case,the images with next sibling is executed first.

Some input sample having previous caption only .In that case also images with next sibling is executed first then consider the previous cases. It also take 2nd fig caption for 1st image.finally skip the images.please help me resolve the issue.please provide solution for this and previous post.

The sample document is Test2.zip (2.1 MB)

Thanks & Regards,
priyanga G

tahir.manzoor · November 6, 2017, 6:48am

@priyanga,

Thanks for your inquiry. We already shared the same approach with you in following forum post.

We have modified the same code for this scenario. Please check the following code example.

Document doc = new Document(MyDir + "Test2.docx");
int i = 1;
ArrayList nodes = new ArrayList();

//Get the paragraphs that start with "Fig".
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node nextPara = paragraph.getNextSibling();
        while (nextPara != null
                && nextPara.getNodeType() == NodeType.PARAGRAPH
                && nextPara.toString(SaveFormat.TEXT).trim().length() == 0
                && ((Paragraph)nextPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            if(nextPara != null)
                nodes.add(nextPara);
            nextPara = nextPara.getNextSibling();
        }

        //Extract the consecutive shapes and export them into new document
        Document dstDoc = new Document();
        for (Paragraph para : (Iterable<Paragraph>)nodes)
        {
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(para, true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);
        }
        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
        nodes.clear();
    }
}

priyanga · November 6, 2017, 12:16pm

Hi @tahir.manzoor

Exactly,The output is nice.

But once i was integrate the share code .

some empty pages will appear

and also it grab images those having next sibling as fig caption.let me know how to delete empty pages.

I am using this(dstdoc.removeallchildren ) method for before appending into it.it also not remove the empty pages.and also how to overcome the clashes between the previous sibling and next sibling caption.

I have attached the sample document Test.zip (327.2 KB)

The expected output expected output.zip (327.8 KB)

The actual output Actual output.zip (345.1 KB)

Many Thanks in advance,
Priyanga G

tahir.manzoor · November 6, 2017, 4:40pm

@priyanga,

Please make sure that you are integrating the code correctly.

We have not found this issue while using the shared code example. Could you please share some more detail about this issue? We will investigate the issues and provide you more information on this.

priyanga · November 7, 2017, 4:44am

Hi @tahir.manzoor ,

The sample you have shared is fine .It gave the expected output for fig caption as previous sibling.

But I m extracting the images using various section.

but some document output is mismatched.because the actual image having next sibling as fig caption but it consider the previous sibling as fig caption.for example.fig 5 is came as fig 4.

The sample code Test.zip (42.2 KB)

The actual output Actual output.zip (1.1 MB)The output folder having empty documents.
And figure5 is extracted as fig 4.And fig 8 is extracted as fig 7.

Thanks & regards,
priyanga G

tahir.manzoor · November 7, 2017, 6:44am

@priyanga,

Thanks for your inquiry.

The code shared in this forum thread to extract the shapes works fine. We used Test2.docx as input document and have not found any issue.

As per my understanding, you have document that contains shapes with Fig caption. Some Fig captions are before shape node and some are after shape node. There is no exact condition based on which we decide either the Fig caption is before or after Shape node.

The code shared with you works fine. You just need to use it according to your requirement. Hope this answers your query.

priyanga · November 8, 2017, 6:43am

Hi @tahir.manzoor,

Thanks for your feedback .Yes,The shared code is extract the shapes works fine

Yes ,exactly The input document having Fig captions are before shape node and some are after shape node.

The shared code is previous post is working fine.

Please,let me know how to bookmark the same paragraph nodes in the previously shared code and remove the extracted images.

Thanks and Regards,
Priyanga G

tahir.manzoor · November 8, 2017, 10:46am

@priyanga,

Thanks for your inquiry. Following code example shows how to bookmark the Fig caption and Shape nodes. This also removes the content of bookmark1 (first Fig caption and Shape node). Hope this helps you.

Document doc = new Document(MyDir + "Test2.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
ArrayList nodes = new ArrayList();
int bookmark = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
//Get the paragraphs that start with "Fig".
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node nextPara = paragraph.getNextSibling();
        builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
        builder.startBookmark("Bookmark" + bookmark);
        while (nextPara != null
                && nextPara.getNodeType() == NodeType.PARAGRAPH
                && nextPara.toString(SaveFormat.TEXT).trim().length() == 0
                && ((Paragraph)nextPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        {
            if(nextPara != null)
                nodes.add(nextPara);
            nextPara = nextPara.getNextSibling();
        }

        //nextPara contains the caption of next shape
        //Move the cursor to the end of paragraph
        builder.moveToParagraph(paragraphs.indexOf((Paragraph)nextPara.getPreviousSibling()), -1);
        builder.endBookmark("Bookmark" + bookmark);
        bookmark++;

        //Extract the consecutive shapes and export them into new document
        Document dstDoc = new Document();
        for (Paragraph para : (Iterable<Paragraph>)nodes)
        {
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(para, true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);
        }
        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
        nodes.clear();
    }
}

//Remove the content of first bookmark.
doc.getRange().getBookmarks().get("bookmark1").setText("");

doc.save(MyDir + "output.docx");