Fig caption for labeled images

priyanga · October 21, 2017, 7:00am

Hi Team,

I am able to extracting the labeled images next sibling like (a),(b).but I want to get the fig caption for those Labeled images.for example fig caption1.having (a),(b),© three images.The images are extracted as
separately.please let me know how to extract the whole figcaption1
,()I have attached the code

DocumentBuilder builder = new DocumentBuilder(interimdoc);
i = 1;
NodeCollection shapes = interimdoc.getChildNodes(NodeType.SHAPE, true);
for (Shape shape : (Iterable<Shape>)shapes)
{
	if (shape.hasChart() || shape.hasImage())
	{
		Paragraph paragraph = shape.getParentParagraph();

		//		     

		Node node = shape.getParentParagraph().getNextSibling();
		//Modify this condition according to your requirement
		if (node != null && node.getNodeType() == NodeType.PARAGRAPH
				&& (
				((Paragraph)node).isListItem() || node.toString(SaveFormat.TEXT).contains("Figure")
						|| node.toString(SaveFormat.TEXT).contains("(a)")
						|| node.toString(SaveFormat.TEXT).contains("(b)")
						|| node.toString(SaveFormat.TEXT).contains("(c)")
				))
		{
			Document dstDoc = new Document();

			NodeImporter importer = new NodeImporter(interimdoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
			Node newNode = importer.importNode(shape, true);
			dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
			if (dstDoc.getFirstSection().getBody().getFirstParagraph().toString(SaveFormat.TEXT).trim()
					.length() == 0)
				dstDoc.getFirstSection().getBody().getFirstParagraph().remove();
			/** OUTPUT FILENAME START **/
			String Imgcaption = paragraph.toString(SaveFormat.TEXT);

			int k = 0;
			while (k < Imgcaption.length() && !Character.isDigit(Imgcaption.charAt(k)))
				k++;
			int j = k;
			while (j < Imgcaption.length() && Character.isDigit(Imgcaption.charAt(j)))
				j++;

			int l = Integer.parseInt(Imgcaption.substring(k, j));
			//	int l = Integer.parseInt(Imgcaption);

			strI = Integer.toString(l);
			Pattern pattern = Pattern.compile(strI);
			Matcher matcher = pattern.matcher(Imgcaption);
			while (matcher.find())
			{
				name = Imgcaption.substring(0, matcher.end());
				name = name.replace(".", "_");
			}
			if (name.startsWith("Fig"))
			{
				name = "Fig" + "_" + l;
			}

			/** OUTPUT FILENAME END **/
			filename = filefoldername + page + "_" + "Fig_" + i + "_" + "Fig_label" + name + ".docx";
			dstDoc.save(filename);
			i++;
		}

	}
}

}catch (NumberFormatException e){
//something went wrong
e.printStackTrace();
}

I am awaiting for your quick reply.
Many thanks in advance.
Thanks & Regards,
priyanga G

tilal.ahmad · October 21, 2017, 3:06pm

@priyanga

Thanks for your inquiry. Please share your input, existing output and expected output documents here as ZIP file. We will look into these and will guide you accordingly.

priyanga · October 23, 2017, 5:10am

Hi @tilal.ahmad ,

Thanks for your timely reply.Can I expect solution soon due to showcase nearing…

The input document is Wave Propagation(004).zip (1.1 MB)
The existing output Wave Propagation(2).zip (1.0 MB)
The expected output expected output.zip (1.1 MB)

Thanks & regards,
priyanga G

tilal.ahmad · October 23, 2017, 2:44pm

@priyanga

Thanks for sharing the resources. But I am afraid Wave Propagations(2).zip and expected output.zip files have same results.page13_Fig_1_Fig_3.zip (160.1 KB)

However, as per my understanding you want to extract image and related caption together. Please find updated code snippet. Hopefully it will help you to accomplish the task.

Document interimdoc = new Document("Wave Propagation(004)_page13.docx");
int i = 1;
ArrayList nodes = null;

// Get the paragraphs that start with "Fig".
for (Paragraph paragraph : (Iterable<Paragraph>)interimdoc
        .getChildNodes(NodeType.PARAGRAPH, true))
{
    // If want to include captions with Image
    nodes = new ArrayList();
    if (paragraph.toString(SaveFormat.TEXT).trim()
            .startsWith("Fig"))

    {
        nodes.add(paragraph);
        Node previousPara = paragraph.getPreviousSibling();
        while (previousPara != null
                && previousPara.getNodeType() == NodeType.PARAGRAPH
                && previousPara
                        .toString(SaveFormat.TEXT)
                        .trim().length() == 0
                && ((Paragraph)previousPara).getChildNodes(
                        NodeType.SHAPE, true).getCount() > 0)
        {
            if (previousPara != null)
                nodes.add(previousPara);
            previousPara = previousPara.getPreviousSibling();
        }

        if (nodes.size() > 0)
        {
            // Reverse the node collection.

            Collections.reverse(nodes);

            // Extract the consecutive shapes and export them into
            // new document
            Document dstDoc = new Document();
            dstDoc.removeAllChildren();
            dstDoc.ensureMinimum();

            for (Paragraph para : (Iterable<Paragraph>)nodes)

            {
                NodeImporter importer = new NodeImporter(interimdoc,
                        dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                Node newNode = importer.importNode(para, true);
                dstDoc.getFirstSection().getBody().appendChild(newNode);
                dstDoc.save("E:/data/image_" + i + ".docx");
            }
            i++;
            nodes.clear();

        }

    }

}

priyanga · October 24, 2017, 4:13am

Hi @tilal,
Thank you very much.

Its working fine .but fig 6 as empty.Already we are facing the same problem.please help me out to resolve the same.

regards,
priyanga G

tilal.ahmad · October 24, 2017, 2:59pm

@priyanga

Thanks for your feedback. Please note while condition is failing for Fig 6 because its parent paragraph contains some text runs. You can remove following condition from while loop, it will help you to resolve the issue.

&& previousPara.toString(SaveFormat.TEXT).trim().length() == 0

Furthermore, please check document explorer example, a very useful example. It will help you to understand the document object model(DOM) of a document and tune your code accordingly.

priyanga · October 25, 2017, 6:15am

Hi @tilal.ahmad ,

Aspose always giving good solutions.

Thank you very much.

tilal.ahmad · October 25, 2017, 2:34pm

A post was split to a new topic: Extracting Shapes and Group Shapes from a Word document