Extracting images from document using with legends like a,b

priyanga · October 21, 2017, 7:02am

Thanks for your timely solution. I am doing the same process for more samples .How can i extract the shapes without mentioning exact name e.g. Figure 04: Vijaya Jadkar et al.”and “Figure 05: Vijaya Jadkar et al.”

I am awaiting for your quick reply.

Thanks a lot,
priyanga G

tilal.ahmad · October 21, 2017, 3:04pm

@priyanga

Thanks for your inquiry. If you do not want to specify shape caption then you need to iterate through shapes node and get their parent paragraph, as suggested in above post and proceed accordingly.

priyanga · October 23, 2017, 1:00pm

Hi team,

Thank you very much for

I am extracting images and saved in separate document based on the paragraph node and fig caption as a keyword for the extraction

using the page splitter for converting the documents to pages and then the extraction process begins.
The problems are
some of the images and fig caption are separated .For example images in page no.1 and fig caption in page number 2.how can i extract those images.

extracted images are saved as filename For example-page1_Fig1_fig1.docx

The input document isTest.zip (351.7 KB)

Thanks & Regards,
priyanga G

tahir.manzoor · October 23, 2017, 4:59pm

@priyanga,

Thanks for your inquiry. In this scenario, we suggest you following solution.

Iterate through all paragraphs.
Get the paragraph’s text using Node.toString method.
Check if the paragraph’s text is started with “Fig.”.
If true, get the previous node that contains the Shape nodes.
Extract the content as suggested in this forum thread. The start node will be Shape node and end node will be paragraph that starts with “Fig”.

Hope this helps you.

priyanga · October 25, 2017, 1:17pm

Hi @tahir.manzoor

Thank you very much.\

int i = 1;
ArrayList nodes = null;

// Get the paragraphs that start with "Fig".
for (Paragraph paragraph : (Iterable<Paragraph>) interimdoc
		.getChildNodes(NodeType.PARAGRAPH, true)) {
        // If want to include captions with Image
	nodes = new ArrayList();
	if (paragraph.toString(SaveFormat.TEXT).trim()
			.startsWith("Fig"))

	{
		nodes.add(paragraph);
		Node previousPara = paragraph.getPreviousSibling();
		while (previousPara != null
				&& previousPara.getNodeType() == NodeType.PARAGRAPH
				&& previousPara
						.toString(SaveFormat.TEXT)
						.trim().length() == 0
				&& ((Paragraph) previousPara).getChildNodes(
						NodeType.SHAPE, true).getCount() > 0) {
			if (previousPara != null)
				nodes.add(previousPara);
			previousPara = previousPara.getPreviousSibling();
		}

		if (nodes.size() > 0) {
			// Reverse the node collection.

			Collections.reverse(nodes);

			// Extract the consecutive shapes and export them into
			// new document
			Document dstDoc = new Document();
			dstDoc.removeAllChildren();
			dstDoc.ensureMinimum();

			for (Paragraph para : (Iterable<Paragraph>) nodes)

			{
				NodeImporter importer = new NodeImporter(interimdoc,
						dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
				Node newNode = importer.importNode(para, true);
				dstDoc.getFirstSection().getBody().appendChild(newNode);
				dstDoc.save("E:/data/image_" + i + ".docx");
			}
			i++;
			nodes.clear();

		}

	}

}

This is what you mentioning in the previous post.

regards,
priyanga G

tahir.manzoor · October 25, 2017, 4:22pm

@priyanga,

Thanks for your inquiry. Yes, you can use the same approach to get the desired output.

priyanga · December 26, 2017, 6:50am

Hi @tahir.manzoor,

Thanks for your great support .

Still I am having some extraction problem.some of the images are not extracted.please kindly help me to resolve and extract those images.

source code src.zip (23.0 KB)

The input list.zip (525.9 KB)

The expected output expected output.zip (599.0 KB)

The actual output actual output.zip (119.5 KB)

The showcases are nearing please, kindly help me.

Thanks & regards,
priyanga G

tahir.manzoor · December 26, 2017, 4:11pm

@priyanga,

Thanks for your inquiry. Your document contains empty paragraphs between shape and Fig caption e.g. Figure. In your case, we suggest you please bookmark these content and extract them using the approach shared here:
Extract Content from a Bookmark

Document doc = new Document(MyDir + "list.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
//Get the paragraphs that start with "Figure".
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().contains("Figure"))
    {
        Node PreviousPara = paragraph.getPreviousSibling();
        while (PreviousPara != null
                && PreviousPara.getNodeType() == NodeType.PARAGRAPH
                && !PreviousPara.toString(SaveFormat.TEXT).trim().contains("Figure")
                && (
                    PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0 ||
                    PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                    PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                    PreviousPara.toString(SaveFormat.TEXT).trim().contains("(c)") ||
                    PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)"))
                 )
        {
            PreviousPara = PreviousPara.getPreviousSibling();
        }

        if(PreviousPara == null)
        {
            builder.moveToDocumentStart();
            builder.startBookmark("Bookmark" + bookmark);
        }
        else
        {
            builder.moveToParagraph(paragraphs.indexOf((Paragraph)PreviousPara), -1);
            builder.startBookmark("Bookmark" + bookmark);
        }

        builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
        builder.endBookmark("Bookmark" + bookmark);
        bookmark++;
    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);
        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
    }
}

The output document will have the empty paragraphs. You can remove them using Node.remove method according to your requirement.