Image extraction issue 5

e503824 · May 9, 2022, 7:20am

Dear team,

We are using image extraction using aspose java but some documents having chart file how to extract chart files using aspose please find below source code and document for your referance

if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
	// for duplicate figure caption it-15
	&& (paragraph.getNextSibling() != null
			&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
			|| (paragraph.getNextSibling() != null
					&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
					&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
					&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
							.getCount() > 0
							|| (paragraph.getNextSibling().getNextSibling()) != null
									&& paragraph.getNextSibling().getNextSibling()
											.getNodeType() != NodeType.TABLE
									&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
											.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
											
											//this codition added by pavi-14-12-2021   for duplicate captions
											||(((Paragraph) paragraph.getNextSibling().getNextSibling())
													.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
			|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
					.matches(matches))
	// for duplicate figure caption
	&& ((paragraph.getPreviousSibling() != null
			&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
			|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
					.matches(matches))
	&& paragraph.getNodeType() != NodeType.TABLE
	&& paragraph.getParentNode().getNodeType() != NodeType.CELL
	&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
					
	//condition added by pavi -14-12-2021
	&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
			!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures")))
				
       || ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument())))
{

input : revised manuscript clean version.DOCX (1.4 MB)

alexey.noskov · May 9, 2022, 2:59pm

@e503824 The problem occurs because in your document some of shapes are in different section than its caption. Please see the attached screenshot:

In this case you cannot use Node.getNextSibling and Node.getPreviousSibling because as you can see from screenshot, the paragraph with image caption does not have any sibling nodes. In this case, you can try using Node.nextPreOrder and Node.previousPreOrder to traverse the document tree structure.

You can use code suggested in our documentation to extract images from your documents.

e503824 · May 10, 2022, 4:21am

Dear team,

Please share me the source code to update our code

alexey.noskov · May 10, 2022, 2:23pm

@e503824 As I already mentioned your code is too complicated and is hard to handle. But if you you still would like to continue with it, you can try removing section breaks before image extraction. You can use code like the following to concatenate all section in one. Also, it is required to remove empty paragraphs from the document:

Document doc = new Document("C:\\Temp\\in.docx");

// Merge all section into one.
while (doc.getSections().getCount() > 1)
{
    int lastSectionIndex = doc.getSections().getCount() - 1;
    doc.getLastSection().prependContent(doc.getSections().get(lastSectionIndex - 1));
    doc.getSections().get(lastSectionIndex - 1).remove();
}

// Now remove all empty paragraphs.
Iterable<Paragraph> paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph p : paragraphs)
{
    if (!p.hasChildNodes())
        p.remove();
}

// Here is your code to extract images
// ...........

Or alternatively, you can use the approach used by your colleague.