How to remove math formulas

e503824 · May 26, 2022, 5:18am

Dear team,

we are extracting images from docx but in this case its extracting formulas also, we need to skip these formulas. please find source code

Source code :

if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
				|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
				// for duplicate figure caption it-15
				&& (paragraph.getNextSibling() != null
						&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
						|| (paragraph.getNextSibling() != null
								&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
								&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
								&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
										.getCount() > 0
										|| (paragraph.getNextSibling().getNextSibling()) != null
												&& paragraph.getNextSibling().getNextSibling()
														.getNodeType() != NodeType.TABLE
												&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
														.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
														
														//this codition added by pavi-14-12-2021   for duplicate captions
														||(((Paragraph) paragraph.getNextSibling().getNextSibling())
																.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
						|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
								.matches(matches))
				// for duplicate figure caption
				&& ((paragraph.getPreviousSibling() != null
						&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
						|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
								.matches(matches))
				&& paragraph.getNodeType() != NodeType.TABLE
				&& paragraph.getParentNode().getNodeType() != NodeType.CELL
				&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
				
				//condition added by pavi -14-12-2021
				&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
						!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures"))))
				
		        //|| ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument()))
		        
				
		{

input and output document : Fig0015.pdf (11.3 KB)
Fig0015A.pdf (8.2 KB)
new.docx (311.4 KB)

alexey.noskov · May 26, 2022, 4:06pm

@e503824 I already answered this question in another your post.
The code I have provided here properly extracts images from your document and skips the formulas.
Also, it is not required to create new post for the same problem every time, this makes it difficult to refer back and forth to already provided solution.