How to get keyword

akshayapria · November 11, 2017, 5:43am

Hi Team,
The requirement is extracting the images and saved into new document.For the extraction process using paragraph node and fig caption as keyword. The extracted documents are stored in the name of fig captions of each images.

My issue is how can i get fig caption.

I am using the following for get fig caption.String Imgcaption = paragraph.toString(SaveFormat.TEXT)for all method.this is not working for this type.

I have enclosed the code .please help me to get the fig caption for that code.
The source code is
Document interimdoc12 = new Document(interim);
NodeCollection shapes_otherimg = interimdoc12.getChildNodes(NodeType.SHAPE, true);

		for (Shape shape : (Iterable<Shape>) shapes_otherimg) {
		    if (shape.hasImage() && shape.getParentParagraph().getNextSibling() != null
		            && shape.getParentParagraph().getNextSibling().getNodeType() == NodeType.PARAGRAPH) {
		    
		        
		        **_String figcaption = (String)shape.getParentParagraph().getNextSibling().getText();_**

				
				ArrayList nodes1 = extractContent(shape.getParentParagraph(), shape.getParentParagraph(), true);
				
		        filename =folder_name +"_" +  "Fig_D" + i + "_" +  figcaption +".docx";
		        System.out.println(filename);
		        generateDocument(interimdoc12, nodes1).save(filename);

		        Paragraph fig1 = (Paragraph) shape.getParentParagraph();
		        /**
		         * REMOVAL OF NODE(START,END) FROM SOURCE WORD DOC START
		         **/
		        shape.getParentParagraph().insertBefore(new BookmarkStart(interimdoc12, "Image_" + i), shape);
		        fig1.appendChild(new BookmarkEnd(interimdoc12, "Image_" + i));
		        deletecaption( filename);
		        i++;
		      
		        for (Bookmark bookmark : interimdoc12.getRange().getBookmarks()) {
		            if (bookmark.getName().startsWith("Image_")) {
		                bookmark.setText("");
		            }
		        }
		        interimdoc12.save(interim);
		    }
		}

Thanks & regards,
pria.

tilal.ahmad · November 11, 2017, 8:51am

@akshayapria

Thanks for your inquiry. Please share your sample input document along with some more details of your requirements. We will look into it and will guide you accordingly.

akshayapria · November 11, 2017, 10:42am

Hi @tilal.ahmad
The extracted images are saved in the name of fig caption name(fig_1_fig_1) .In my output the first three figure are come without fig name.

The above mentioned code not read the exact fig caption.

The source code isTest.zip (41.0 KB)

The input document test.zip (505.4 KB)

The actual output is actual output.zip (655.2 KB)

The expected expected_output.zip (655.2 KB)

Thanks and regards,
pria

tilal.ahmad · November 13, 2017, 2:58am

@akshayapria

Thanks for sharing the details. We are looking into the issue and will update you shortly.

tilal.ahmad · November 13, 2017, 6:00am

@akshayapria

You are not processing figure caption in your “Section E” code, so it is not storing caption in extracted image name. Please check following sample code snippet to capture figure caption from list paragraph. Furthermore, It seems you are processing images twice in “Section E” code.

interimdoc.updateListLabels();
NodeCollection shapes1 = interimdoc.getChildNodes(NodeType.SHAPE, true);
for (Shape shape : (Iterable<Shape>) shapes1)
{
    if(shape.hasChart() || shape.hasImage())
    {
        Paragraph paragraph = shape.getParentParagraph();

        //Modify this condition according to your requirement
    if (paragraph.toString(com.aspose.words.SaveFormat.TEXT).contains("a)") ||
    paragraph.toString(com.aspose.words.SaveFormat.TEXT).contains("b)") ||
    paragraph.toString(com.aspose.words.SaveFormat.TEXT).contains("c)"))
    {
    	com.aspose.words.Document dstDoc = new com.aspose.words.Document();

        NodeImporter importer = new NodeImporter(interimdoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        Node newNode = importer.importNode(paragraph, true);
        dstDoc.getFirstSection().getBody().appendChild(newNode);
        String Imgcaption = (String) shape.getParentParagraph().getNextSibling().toString(com.aspose.words.SaveFormat.TEXT);
//			
		
		filename = folder_name +"_"  +  "Fig_legend" + i + "_" + "FIG" + ".docx";
        dstDoc.save(filename);
        i++;
    }

    Node node = shape.getParentParagraph().getNextSibling();
    //Modify this condition according to your requirement
    if(node != null && node.getNodeType() == NodeType.PARAGRAPH
            && (
            ((Paragraph)node).isListItem() 
                    || node.toString(com.aspose.words.SaveFormat.TEXT).contains("(a)")
                    || node.toString(com.aspose.words.SaveFormat.TEXT).contains("(b)")
                    || node.toString(com.aspose.words.SaveFormat.TEXT).contains("(c)")
            ))
    {
    	
    	com.aspose.words.Document dstDoc = new com.aspose.words.Document();
       
        NodeImporter importer = new NodeImporter(interimdoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
        Node newNode = importer.importNode(shape, true);
        dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
                   
		/** OUTPUT FILENAME START **/
		String Imgcaption = ((Paragraph)node).getListLabel().getLabelString();
		int k = 0;
		while (k < Imgcaption.length() && !Character.isDigit(Imgcaption.charAt(k)))
			k++;
		int j = k;
		while (j < Imgcaption.length() && Character.isDigit(Imgcaption.charAt(j)))
			j++;
		int l = Integer.valueOf(Imgcaption.substring(k,j));//.parseInt(Imgcaption.substring(k, j));
		strI = Integer.toString(l);
		Pattern pattern = Pattern.compile(strI);
		Matcher matcher = pattern.matcher(Imgcaption);	
		while (matcher.find()) {
			name = Imgcaption.substring(0, matcher.end());
			name = name.replace(".", "_");
		}
		if (name.startsWith("Fig")) {
			name = "Fig" + "_" + l;
		}
		/** OUTPUT FILENAME END **/
        filename = folder_name +"_"  +  "Fig_legend" + i + "_" +  name + ".docx";
            dstDoc.save(filename);
           i++;
        }

    }
}

akshayapria · November 14, 2017, 6:17am

Hi @tilal.ahmad,

Thank you very much.

The issue is some text also extracted along with this output

please let me know how to remove the text.

The actual output is actual output.zip (469.6 KB)

Thanks & regards,
pria

tilal.ahmad · November 14, 2017, 1:28pm

@akshayapria

Thanks for your feedback. Filename text issue does not seem Aspose.Words Jar related issue. Please debug the Filename code to refine it as per your requirements and to remove all whitespaces and non-visible characters use following code snippet.

name=name.replaceAll("\\s+", "");