Image extraction Issue 4

Dear team,

we are extracting images from docx using aspose java, But we are notable to extract below mentioned docx please find the source code

if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
	|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
	// for duplicate figure caption it-15
	&& (paragraph.getNextSibling() != null
			&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
			|| (paragraph.getNextSibling() != null
					&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
					&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
					&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
							.getCount() > 0
							|| (paragraph.getNextSibling().getNextSibling()) != null
									&& paragraph.getNextSibling().getNextSibling()
											.getNodeType() != NodeType.TABLE
									&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
											.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
											
											//this codition added by pavi-14-12-2021   for duplicate captions
											||(((Paragraph) paragraph.getNextSibling().getNextSibling())
													.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
			|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
					.matches(matches))
	// for duplicate figure caption
	&& ((paragraph.getPreviousSibling() != null
			&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
			|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
					.matches(matches))
	&& paragraph.getNodeType() != NodeType.TABLE
	&& paragraph.getParentNode().getNodeType() != NodeType.CELL
	&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
	
	//condition added by pavi -14-12-2021
	&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
			!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures")))
	
       || ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument())))
{

Input Document : davids_et_al_2022_05_Interim.docx (3.7 MB)

please do needful

@e503824 In your case both image and it’s caption are in a group shape. But since you are iterating the paragraphs in the document, the following condition, which I already suggested works as expected and passes for the images in your document:

Table parentTable = (Table)para.getAncestor(NodeType.TABLE);
Node next = para.getNextSibling();
while (next != null && !next.isComposite())
    next = next.getNextSibling();

Node prev = para.getPreviousSibling();
while (prev != null && !prev.isComposite())
    prev = prev.getPreviousSibling();

CompositeNode nextNode = (CompositeNode)next;
CompositeNode prevNode = (CompositeNode)prev;
if ((para != null && para.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        || (nextNode != null && nextNode.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        || (prevNode != null && prevNode.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
        || (parentTable != null && parentTable.getChildNodes(NodeType.SHAPE, true).getCount() > 0))
{
    System.out.println(paraText);
}

In your case the following part of condition will pass (para != null && para.getChildNodes(NodeType.SHAPE, true).getCount() > 0).

Dear team,

We have tried but its not working i have shared our source code please find

Table parentTable = (Table)paragraph.getAncestor(NodeType.TABLE);
		Node next = paragraph.getNextSibling();
		while (next != null && !next.isComposite())
		    next = next.getNextSibling();

		Node prev = paragraph.getPreviousSibling();
		while (prev != null && !prev.isComposite())
		    prev = prev.getPreviousSibling();

		CompositeNode nextNode = (CompositeNode)next;
		CompositeNode prevNode = (CompositeNode)prev;
		
		if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
				// for duplicate figure caption it-15
				&& (paragraph.getNextSibling() != null
						&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
						|| (paragraph.getNextSibling() != null
								&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
								&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
								&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
										.getCount() > 0
										|| (paragraph.getNextSibling().getNextSibling()) != null
												&& paragraph.getNextSibling().getNextSibling()
														.getNodeType() != NodeType.TABLE
												&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
														.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
														
														//this codition added by pavi-14-12-2021   for duplicate captions
														||(((Paragraph) paragraph.getNextSibling().getNextSibling())
																.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
						|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
								.matches(matches))
				// for duplicate figure caption
				&& ((paragraph.getPreviousSibling() != null
				&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
					|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
								.matches(matches))
				&& paragraph.getNodeType() != NodeType.TABLE
				&& paragraph.getParentNode().getNodeType() != NodeType.CELL
				&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
				&&(paragraph != null && paragraph.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
				  || (paragraph != null && nextNode.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
				  || (prevNode != null && prevNode.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
		          || (parentTable != null && parentTable.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
				
				//condition added by pavi -14-12-2021
				&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
						!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures")))
				
		        || ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument())))
		{

@e503824 You have simply added the additional conditions, but you should replace your conditions with new conditions i have suggested. As I already mentioned you have too many conditions in one if statement. This is a very error prone practice. You should refactor your code to make it easier to debug and handle. Here is code that works on my side (the same as yours, but refactoring a little):

if ((paraText.startsWith(FIG) || paraText.startsWith(SCHEME) || paraText.startsWith(PLATE)))
{
    if (!paraText.startsWith("Figure Captions") && !(paraText.startsWith("Figures and captions")))
    {
        try
        {
            Table parentTable = (Table)para.getAncestor(NodeType.TABLE);
            Node next = para.getNextSibling();
            while (next != null && !next.isComposite())
                next = next.getNextSibling();

            Node prev = para.getPreviousSibling();
            while (prev != null && !prev.isComposite())
                prev = prev.getPreviousSibling();

            CompositeNode nextNode = (CompositeNode)next;
            CompositeNode prevNode = (CompositeNode)prev;
            if ((para != null && para.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
                    || (nextNode != null && nextNode.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
                    || (prevNode != null && prevNode.getChildNodes(NodeType.SHAPE, true).getCount() > 0)
                    || (parentTable != null && parentTable.getChildNodes(NodeType.SHAPE, true).getCount() > 0))
            {
                System.out.println(paraText);
            }

        }
        catch (NullPointerException e)
        {
            System.out.println("Exception " + e.getMessage());
        }

    }
}