Extraction issue 14

e503824 · August 23, 2022, 10:08am

Dear team,

We are extracting images from document using aspose java but below case its Not extracting part images please find below source code and Input & output File

Source Code :

if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
					&& !paragraph.toString(SaveFormat.TEXT).toLowerCase().startsWith("abbreviations")
					// for duplicate figure caption it-15
					&& (paragraph.getNextSibling() != null
							&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
							|| (paragraph.getNextSibling() != null
									&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
									&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
									&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
											.getCount() > 0
											|| (paragraph.getNextSibling().getNextSibling()) != null
													&& paragraph.getNextSibling().getNextSibling()
															.getNodeType() != NodeType.TABLE
													&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
															.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
															
															//this codition added by pavi-14-12-2021   for duplicate captions
															||(((Paragraph) paragraph.getNextSibling().getNextSibling())
																	.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
							|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
									.matches(matches))
					// for duplicate figure caption
					&& ((paragraph.getPreviousSibling() != null
							&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
							|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
									.matches(matches))
					&& paragraph.getNodeType() != NodeType.TABLE
					&& paragraph.getParentNode().getNodeType() != NodeType.CELL
					&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
					
					//condition added by pavi -14-12-2021
					&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
							!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures"))))
					
			        //|| ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument()))
			        
					
			{

Input and Output : New folder.zip (182.7 KB)

alexey.noskov · August 23, 2022, 1:45pm

@e503824 I cannot reproduce the problem The images are properly extracted using ImageExtractor class provided in another your thread:
https://forum.aspose.com/t/need-to-extract-double-column-layout/250823/4

e503824 · August 25, 2022, 5:29am

Dear team,

We are extracting images from document using ASPOSE java, But below case some part images are not extracting please find source code and Input & Output Files

Please Note : We need to extract this images in Caption Below conditions

Source Code :

if ((paragraph.toString(SaveFormat.TEXT).toLowerCase().trim().startsWith("fig")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Scheme")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Plate")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abb")
					|| paragraph.toString(SaveFormat.TEXT).startsWith("Abbildung"))
					&& !paragraph.toString(SaveFormat.TEXT).toLowerCase().startsWith("abbreviations")
					// for duplicate figure caption it-15
					&& (paragraph.getNextSibling() != null
							&& !paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
							|| (paragraph.getNextSibling() != null
									&& paragraph.getNextSibling().getNodeType() != NodeType.TABLE
									&& paragraph.getNextSibling().toString(SaveFormat.TEXT).trim().matches(matches)
									&& (((Paragraph) paragraph.getNextSibling()).getChildNodes(NodeType.SHAPE, true)
											.getCount() > 0
											|| (paragraph.getNextSibling().getNextSibling()) != null
													&& paragraph.getNextSibling().getNextSibling()
															.getNodeType() != NodeType.TABLE
													&& ((((Paragraph) paragraph.getNextSibling().getNextSibling())
															.getChildNodes(NodeType.SHAPE, true).getCount() == 0)
															
															
															||(((Paragraph) paragraph.getNextSibling().getNextSibling())
																	.getChildNodes(NodeType.SHAPE, true).getCount() > 0))))
							|| paragraph.getParentSection().getBody().getLastParagraph().getText().trim()
									.matches(matches))
					// for duplicate figure caption
					&& ((paragraph.getPreviousSibling() != null
							&& paragraph.getPreviousSibling().getNodeType() != NodeType.TABLE)
							|| paragraph.getParentSection().getBody().getFirstParagraph().getText().trim()
									.matches(matches))
					&& paragraph.getNodeType() != NodeType.TABLE
					&& paragraph.getParentNode().getNodeType() != NodeType.CELL
					&& !paragraph.toString(SaveFormat.TEXT).contains(AIE.docName)
					
					&& (!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figure Captions"))||
							!(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Figures"))))
					
			        //|| ((paragraph.getNextSibling() == null) && (builder.getCurrentParagraph().isEndOfDocument()))
			        
					
			{

Input And Output : New folder.zip (182.7 KB)

e503824 · August 25, 2022, 5:37am

And we neet to extract these labels also, please do needful

Please Note : We need to extract this images in Caption Below conditions

alexey.noskov · August 25, 2022, 9:07am

@e503824 This question is already answered in another your thread:
https://forum.aspose.com/t/extraction-issue-14/250868

e503824 · August 25, 2022, 9:09am

dear team,

We need to extract under caption below conditions please help me out

alexey.noskov · August 25, 2022, 6:39pm

@e503824 It is not quite clear what the problem is. The code I have provided in other your thread properly extracts images and captions from the attached document. Could you please elaborate your problem in more details.

e503824 · August 26, 2022, 3:18am

Dear team,

Given source code is not working for us
Document have Figure caption in Below of the figure
Sometimes given source code was extracting unwanted images with wrong File name
Given source Code not extracting part images

Please find how we are calling classes

CaptionBelow.captionBelow(interimdoc);

CaptionAbove.captionAbove(interimdoc);

TableImage.imagesInTable(interimdoc);

TextFrameImage.textFrameImageWithCaption(interimdoc);

String outGAFilePath = tempFolder + "\\PDF\\GA1.pdf";
File gaFile = new File(outGAFilePath);

if (fileName.toLowerCase().replaceAll("\\s", "").contains(GRAPHICALABSTRACT) || (!Kromatrix.figjArray.isEmpty() && (fileName.toLowerCase().startsWith("fig") || fileName.toLowerCase().startsWith("scheme") || fileName.toLowerCase().startsWith("plate"))))
{
	if (!gaFile.exists())
	{
		FixedGraphic.fixedImage(interimdoc);
	}
}

That’s why we are using these methods

alexey.noskov · August 26, 2022, 2:59pm

@e503824 Please note, the problem is not in Aspose.Words, the problem is in the logic you are using to analyze the document content, which is out of Aspose.Words scope. The logic implementation is not responsibility of Aspose.Words support.

Also, as I mentioned the code provided here зroperly extracts the images from the attached document. Here is output produced by this code on my side:
Fig. 7. Comparison of torsional stiffnes.pdf (50.9 KB)
Fig. 8. The relationship between kc and .pdf (40.0 KB)
As you can see both captions (see file name) and images are extracted properly. So it is not quite clear what does not work on your side.

e503824 · August 27, 2022, 1:56pm

Dear team,

please note : We need to extract Labels also please refer screenshot

Missing Items.png (2.0 KB)

alexey.noskov · August 27, 2022, 6:21pm

@e503824 Aspose.Words does not provide document content analysis features. Aspose.Words is a tool that allows to work with documents. The logic required to analyze the documents content depend on your requirements and needs.
As I can see you do not have any issues related to Aspose.Words itself, but to content analysis only, which is out of Aspose.Words scope.