Extracting images from document using with legends like a,b

tahir.manzoor · October 9, 2017, 4:51pm

Thanks for your inquiry. The shared expected output documents do not contain the group shape. We suggest you please read following article.
Extract Selected Content Between Nodes

If you still face problem, please share the conditions based on which you want to extract the content. We will then provide you more information on this.

priyanga · October 10, 2017, 12:29pm

Hi @tahir.manzoor

Thanks for timely reply.

The following document images existing with some text box tools . so that the text box tools are disorted from the image.the reported images are fig_3 and fig_4.please is this any work around solution for this problem.

Thanks & regards,
priyanga G

tahir.manzoor · October 10, 2017, 4:59pm

@priyanga,

Thanks for your inquiry. In your case, we suggest you please bookmark the desired content and extract them using the code example shared in following link. Hope this helps you.
Extract Content from a Bookmark

priyanga · October 11, 2017, 10:26am

Hi @tahir.manzoor,
Thank you very much.
I had tried the extraction based on bookmarked content .It also gives the same output .The output is disorted.
please kindly give any other possibilities to recover from this issue.

how to set current image as current node and iterate through the shapes once if(shapes ==null)then extract the same.

Regards.
Priyanga.G

tahir.manzoor · October 11, 2017, 5:53pm

@priyanga,

Thanks for your inquiry. We are working over this scenario and will get back to you soon.

tahir.manzoor · October 12, 2017, 3:42pm

@priyanga,

Thanks for your patience. In this case, we suggest you please bookmark the content that you want to extract. Extract the content according to your requirement. Following code example shows how to extract the content of “Figure 04: Vijaya Jadkar et al.”. Hope this helps you.

Please check the code of extractContent and generateDocument methods.

Document doc = new Document(MyDir + "test.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

//Extract the figure "Figure 04: Vijaya Jadkar et al."
for (Paragraph  paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {

    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure 04: Vijaya Jadkar et al."))
    {
        builder.moveTo(paragraph);
        builder.startBookmark("Figure_04");
        builder.endBookmark("Figure_04");
    }
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure 05: Vijaya Jadkar et al."))
    {
        builder.moveTo(paragraph);
        builder.startBookmark("Figure_05");
        builder.endBookmark("Figure_05");
        break;
    }
}

Node start = doc.getRange().getBookmarks().get("Figure_04").getBookmarkStart().getParentNode();
Node end = doc.getRange().getBookmarks().get("Figure_05").getBookmarkStart().getParentNode();
ArrayList nodes = extractContent(start, end, true);

Document dstDoc = generateDocument(doc, nodes);
dstDoc.getLastSection().getBody().getLastParagraph().remove();
dstDoc.save(MyDir + "output.docx");

priyanga · October 13, 2017, 7:05am

Hi @tahir.manzoor,

Thank you very much.It was an good thing.

Once i was intergrated with my code it will not working for grouping the shapes.

I have attached the

source code source.zip (1.0 KB)
please provide a solution for this.
Thanks
&
Regards,
priyanga G

tahir.manzoor · October 13, 2017, 5:06pm

@priyanga,

Thanks for your inquiry. Please make sure that you have integrated the code correctly. The code example shared in my previous post works fine. You can test it with your input document. First, you need to insert the bookmarks for paragraphs e.g. “Figure 04: Vijaya Jadkar et al.”, “Figure 05: Vijaya Jadkar et al.” and then extract the contents between bookmarks.

priyanga · October 16, 2017, 1:12pm

Hi @tahir.manzoor,

Hi i have proper output before integrating the code.once tried to remove the paragraph it remove paragraph along with images.

once integrated it shows the empty pages in the out put folder.

The source isTest.zip (42.8 KB)

The expected output isOutputFolder.zip (1.6 MB)

please kindly give suggestion to resolve the issue.I am awaiting for your quick reply.

Thanks and Regards
priyanga.G

tahir.manzoor · October 16, 2017, 5:08pm

@priyanga,

Thanks for your inquiry. Please use the following code snippet to remove the empty pages from the end of output document. Hope this helps you.

if(doc.getRange().getBookmarks().get("_GoBack") !=  null)
    doc.getRange().getBookmarks().get("_GoBack").remove();

while (doc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
{
    if(doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.SHAPE, true).getCount() > 0
            || doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.GROUP_SHAPE, true).getCount() > 0
            || doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.FORM_FIELD, true).getCount() > 0
            || doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.FOOTNOTE, true).getCount() > 0
            || doc.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.COMMENT, true).getCount() > 0
            )
        break;

    //Check if last paragraph contains the page break
    if(doc.getLastSection().getBody().getLastParagraph().isEndOfDocument())
    {
	doc.getLastSection().getBody().getLastParagraph().getRange().replace(ControlChar.PAGE_BREAK, "", new FindReplaceOptions());
    }

    if (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
            (doc.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH))
        break;

    doc.getLastSection().getBody().getLastParagraph().remove();

    // If the current section becomes empty, we should remove it.
    if (!doc.getLastSection().getBody().hasChildNodes())
        doc.getLastSection().remove();

    // We should exit the loop if the document becomes empty.
    if (!doc.hasChildNodes())
        break;
}

priyanga · October 17, 2017, 4:30am

Hi @tahir.manzoor

I am extracting the group shape using the following code.but it is work separately.once it was integrated it is not working.

Document interimdoc1 = new Document(interim);
DocumentBuilder builder1 = new DocumentBuilder(interimdoc1);
builder1.moveToDocumentStart();
builder1.startBookmark("bookmark0");
builder1.endBookmark("bookmark0");
System.out.println("execute");
ArrayList<String> bookmarks = new ArrayList<String>();
bookmarks.add("bookmark0");
System.out.println("execute2");
i = 1;
NodeCollection paragraphs = interimdoc1.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph paragraph : (Iterable<Paragraph>)paragraphs)
{
    System.out.println("execute3");
    if (paragraph.toString(SaveFormat.TEXT).trim().contains("Fig"))
    {
        builder1.moveTo(paragraph.getRuns().get(0));
        builder1.startBookmark("bookmark" + i);
        builder1.endBookmark("bookmark" + i);
        bookmarks.add("bookmark" + i);
        System.out.println("execute4");
        i++;
    }
}

for (int b = 0; b < bookmarks.size() - 1; b++)
{
    Bookmark bookmark1 = interimdoc1.getRange().getBookmarks().get(bookmarks.get(b));
    Bookmark bookmark2 = interimdoc1.getRange().getBookmarks().get(bookmarks.get(b + 1));
    ArrayList nodes1 = extractContent(bookmark1.getBookmarkStart(), bookmark2.getBookmarkEnd(), true);
    Document dstDoc = generateDocument(interimdoc1, nodes1);
    System.out.println("execute5");
    tableGroupImage = filefoldername + page + "_" + "Fig_new" + i + "_" + "fig" + ".docx";
    //			    for (Paragraph paragraph : (Iterable<Paragraph>) dstDoc.getChildNodes(NodeType.PARAGRAPH, true))
    //			    {
    //			        if(paragraph.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
    //			        {
    //			            paragraph.remove();
    //			        }
    //			    }
    dstDoc.save(tableGroupImage);

    if (interimdoc1.getRange().getBookmarks().get("_GoBack") != null)
        interimdoc1.getRange().getBookmarks().get("_GoBack").remove();

    while (interimdoc1.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().length() == 0)
    {
        if (interimdoc1.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.SHAPE, true).getCount() > 0
                || interimdoc1.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.GROUP_SHAPE, true).getCount() > 0
                || interimdoc1.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.FORM_FIELD, true).getCount() > 0
                || interimdoc1.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.FOOTNOTE, true).getCount() > 0
                || interimdoc1.getLastSection().getBody().getLastParagraph().getChildNodes(NodeType.COMMENT, true).getCount() > 0
                )
            break;

        //Check if last paragraph contains the page break
        if (interimdoc1.getLastSection().getBody().getLastParagraph().isEndOfDocument())
        {
            interimdoc1.getLastSection().getBody().getLastParagraph().getRange().replace(ControlChar.PAGE_BREAK, "", new FindReplaceOptions());
        }

        if (interimdoc1.getLastSection().getBody().getLastParagraph().getPreviousSibling() != null &&
                (interimdoc1.getLastSection().getBody().getLastParagraph().getPreviousSibling().getNodeType() != NodeType.PARAGRAPH))
            break;

        interimdoc1.getLastSection().getBody().getLastParagraph().remove();

        // If the current section becomes empty, we should remove it.
        if (!interimdoc1.getLastSection().getBody().hasChildNodes())
            interimdoc1.getLastSection().remove();

        // We should exit the loop if the document becomes empty.
        if (!interimdoc1.hasChildNodes())
            break;
    }



}

/** REMOVE EMPTY PAGES END **/

The input documentTest.zip (1.7 MB)
the expected outputOutputFolder.zip (1.7 MB)
I am attached the source code in my previous post.please kindly help me to extract the images.

Many thanks for your quick reply,
regards,
priyanga G

priyanga · October 17, 2017, 9:21am

Hi @tahir.manzoor

In source document group shape mixed with normal shapes.(pg.no.15,16,17)

The function for extract group shape separately and normal shapes separately as provided by you.

Now i want to extract images as one

please help me to extract the shapes .

Thanks & Regards,
priyanga G

tahir.manzoor · October 17, 2017, 4:31pm

@priyanga,

Thanks for your inquiry.

You are adding bookmarks for text that starts with “Fig”. Your document contains many paragraphs that start with same text. You need to bookmark the text that you want to extract e.g. “Figure 04: Vijaya Jadkar et al.”, “Figure 05: Vijaya Jadkar et al.”

You are using trimming code at incorrect place. Please use this code before saving the document.

priyanga · October 21, 2017, 7:02am

Hi @tahir.manzoor,

Thanks for your timely solution. I am doing the same process for more samples .How can i extract the shapes without mentioning exact name e.g. Figure 04: Vijaya Jadkar et al.”and “Figure 05: Vijaya Jadkar et al.”

I am awaiting for your quick reply.

Thanks a lot,
priyanga G

tilal.ahmad · October 21, 2017, 3:04pm

@priyanga

Thanks for your inquiry. If you do not want to specify shape caption then you need to iterate through shapes node and get their parent paragraph, as suggested in above post and proceed accordingly.

priyanga · October 23, 2017, 1:00pm

Hi team,

Thank you very much for

I am extracting images and saved in separate document based on the paragraph node and fig caption as a keyword for the extraction

using the page splitter for converting the documents to pages and then the extraction process begins.
The problems are
some of the images and fig caption are separated .For example images in page no.1 and fig caption in page number 2.how can i extract those images.

extracted images are saved as filename For example-page1_Fig1_fig1.docx

The input document isTest.zip (351.7 KB)

Thanks & Regards,
priyanga G

tahir.manzoor · October 23, 2017, 4:59pm

@priyanga,

Thanks for your inquiry. In this scenario, we suggest you following solution.

Iterate through all paragraphs.
Get the paragraph’s text using Node.toString method.
Check if the paragraph’s text is started with “Fig.”.
If true, get the previous node that contains the Shape nodes.
Extract the content as suggested in this forum thread. The start node will be Shape node and end node will be paragraph that starts with “Fig”.

Hope this helps you.

priyanga · October 25, 2017, 1:17pm

Hi @tahir.manzoor

Thank you very much.\

int i = 1;
ArrayList nodes = null;

// Get the paragraphs that start with "Fig".
for (Paragraph paragraph : (Iterable<Paragraph>) interimdoc
		.getChildNodes(NodeType.PARAGRAPH, true)) {
        // If want to include captions with Image
	nodes = new ArrayList();
	if (paragraph.toString(SaveFormat.TEXT).trim()
			.startsWith("Fig"))

	{
		nodes.add(paragraph);
		Node previousPara = paragraph.getPreviousSibling();
		while (previousPara != null
				&& previousPara.getNodeType() == NodeType.PARAGRAPH
				&& previousPara
						.toString(SaveFormat.TEXT)
						.trim().length() == 0
				&& ((Paragraph) previousPara).getChildNodes(
						NodeType.SHAPE, true).getCount() > 0) {
			if (previousPara != null)
				nodes.add(previousPara);
			previousPara = previousPara.getPreviousSibling();
		}

		if (nodes.size() > 0) {
			// Reverse the node collection.

			Collections.reverse(nodes);

			// Extract the consecutive shapes and export them into
			// new document
			Document dstDoc = new Document();
			dstDoc.removeAllChildren();
			dstDoc.ensureMinimum();

			for (Paragraph para : (Iterable<Paragraph>) nodes)

			{
				NodeImporter importer = new NodeImporter(interimdoc,
						dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
				Node newNode = importer.importNode(para, true);
				dstDoc.getFirstSection().getBody().appendChild(newNode);
				dstDoc.save("E:/data/image_" + i + ".docx");
			}
			i++;
			nodes.clear();

		}

	}

}

This is what you mentioning in the previous post.

regards,
priyanga G

tahir.manzoor · October 25, 2017, 4:22pm

@priyanga,

Thanks for your inquiry. Yes, you can use the same approach to get the desired output.

priyanga · December 26, 2017, 6:50am

Hi @tahir.manzoor,

Thanks for your great support .

Still I am having some extraction problem.some of the images are not extracted.please kindly help me to resolve and extract those images.

source code src.zip (23.0 KB)

The input list.zip (525.9 KB)

The expected output expected output.zip (599.0 KB)

The actual output actual output.zip (119.5 KB)

The showcases are nearing please, kindly help me.

Thanks & regards,
priyanga G