Find image and Image extraction issue

e503824 · May 20, 2022, 11:31am

Dear team,

we are extracting and finding images from docx but below case we are notable to find and extract images please find below source code for your referance

Find all images Source code :

private static void findAllfigures(Document initDoc, String nameAppend) throws NullPointerException {
		String matches = "Fig.*(?:[ \\r\\n\\t].*)+|Scheme.*|Plate.*|Abbildung.*|Fig.*(?:[ \\r\\n\\t]*)+|FIG.*(?:[ \\r\\n\\t].*)+";
		
		try {
			for (Paragraph para : (Iterable<Paragraph>) initDoc.getChildNodes(NodeType.PARAGRAPH, true)) {
			if ((para.getText().trim().startsWith(FIG)||para.getText().trim().startsWith(CFIG) || para.getText().trim().startsWith(SCHEME)
			|| para.getText().trim().startsWith(PLATE)) && !AIE.supplymentryCheck(para.toString(SaveFormat.TEXT).trim())) {
			if (!(para.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions")) 
				&& (!(para.toString(SaveFormat.TEXT).trim().startsWith("Figures and captions")))) {
	 
	  
						try {
							Table parentTable = (Table)para.getAncestor(NodeType.TABLE);
							if((((Paragraph) para.getNextSibling()).getChildNodes(NodeType.SHAPE, true).getCount() > 0) 
									||(((Paragraph) para.getPreviousSibling()).getChildNodes(NodeType.SHAPE, true).getCount() > 0) 
									&& parentTable!=null && parentTable.getChildNodes(NodeType.SHAPE, true).getCount()>0)
					
									{
		
		
								String allFignames = null;
								{
									allFignames = formatImgcaption(para.toString(SaveFormat.TEXT).trim(), nameAppend);
									
								}
								allimages.add(allFignames);
								
								
							}
							
						} catch (NullPointerException e) {
							logger.info("Exception ", e.getMessage());
//							e.printStackTrace();
						}

					}
				}
			}
			initDoc.save(interim);
		} catch (Exception e) {
			logger.info("Exception ", e.getMessage());
//			e.printStackTrace();
		}
	}

Extraction conditions : 

if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
					// for duplicate figure caption it-15
					&& (paragraph.getNextSibling() != null
							&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
							|| (paragraph.getNextSibling() != null
									&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
									&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
									&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
											.getCount() > 0
											|| (paragraph.getNextSibling().getNextSibling()) != null
													&& paragraph.getNextSibling().getNextSibling()
															.getNodeType() != NodeType.TABLE
													&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
															.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
															
															//this codition added by pavi-14-12-2021   for duplicate captions
															||(((Paragraph) paragraph.getNextSibling().getNextSibling())
																	.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
							|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
									.matches(matches))
					// for duplicate figure caption
					&& ((paragraph.getPreviousSibling() != null
							&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
							|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
									.matches(matches))
					&& paragraph.getNodeType() != NodeType.TABLE
					&& paragraph.getParentNode().getNodeType() != NodeType.CELL
					&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
					
					//condition added by pavi -14-12-2021
					&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
							!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures"))))
					
			        //|| ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument()))
			        
					
			{

Input docx : Revised Manuscript_58 (Clean version).docx (5.2 MB)

alexey.noskov · May 20, 2022, 2:56pm

@e503824 The image extraction method I have suggested in another your thread works perfectly for this case:
https://forum.aspose.com/t/table-image-extraction-issue/246418/8