Extract picture framed images from the document using aspose.words in java

priyanga · March 30, 2018, 4:45am

Hi Team,

My requirement is to extract images from the document and save into new document.

Issue 1: some of images are inside the picture frame.

please ,kindly help me to resolve the issue.

input: test3.zip (821.7 KB)

expected output:Expected output.zip (821.4 KB)

Thanks and regards,
priyanga G

tahir.manzoor · March 30, 2018, 11:34am

@priyanga,

Thanks for your inquiry. Your document contains the GroupShape node. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "test3.docx");
int i = 1;

NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);

for (Node paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().contains("Fig"))
    {
        Node node = paragraph.getPreviousSibling();

        if (node != null &&  node.getNodeType() == NodeType.PARAGRAPH
                && ((Paragraph)node).getChildNodes(NodeType.GROUP_SHAPE, true).getCount()>0)
        { 
            Document dstDoc = new Document();
            NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
            Node newNode = importer.importNode(node, true);
            dstDoc.getFirstSection().getBody().appendChild(newNode);
            dstDoc.save(MyDir + "output"+i+".docx");
            i++;
        }
    }
}

priyanga · April 2, 2018, 12:53pm

Hi @tahir.manzoor,

Its working fine for that particular file.

In some documents most of the images are not extracted.please kindly help me to solve the issues.

Input doc:
Input_1:input1.zip (758.6 KB)

Input_2:input2.zip (431.1 KB)

Expected output:

output_1: output_1.zip (762.3 KB)

output_2: output_2.zip (164.9 KB)

Thanks & regards,
priyanga G

tahir.manzoor · April 2, 2018, 5:16pm

@priyanga,

Thanks for your inquiry. In this case, we suggest you please bookmark the desired content and extract them using the code example shared in following link.
Extract Content from a Bookmark

Please check the code example shared in your other threads to bookmark the content and extract them. You can use the same approach to get the desired output for the document shared in this thread.

priyanga · April 3, 2018, 4:35am

Hi @tahir.manzoor,

Thanks for your feedback .

As per the feedback.
I have bookmark and try to extract the content .it shows the empty documents.please kindly help me to solve the issue.here i have attach the code.

Document doc = new Document(MyDir + "FiguresAAA-test.docx");
	DocumentBuilder builder = new DocumentBuilder(doc);
	int bookmark = 1;
	int i = 1;
	NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
	for (Paragraph paragraph : (Iterable<Paragraph>) paragraphs)
	{
	if(paragraph.toString(SaveFormat.TEXT).trim().contains("Fig"))
	{
	Node node = paragraph.getPreviousSibling();
	    if (node != null &&  node.getNodeType() == NodeType.PARAGRAPH
	            && ((Paragraph)node).getChildNodes(NodeType.GROUP_SHAPE, true).getCount()>0)
	    { 
	    	 if(node == null)
		        {
		            builder.moveToDocumentStart();
		            builder.startBookmark("Bookmark" + bookmark);
		        }
		        else
		        {
//		        	System.out.println(PreviousPara.getText());
		            builder.moveToParagraph(paragraphs.indexOf((Paragraph)node), -1);
		            builder.startBookmark("Bookmark" + bookmark);
		        }

		        builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
		        builder.endBookmark("Bookmark" + bookmark);
		        bookmark++;
		    }
		}
		 
		for (Bookmark bm : doc.getRange().getBookmarks())
		{
		    if(bm.getName().startsWith("Bookmark"))
		    {
	        Document dstDoc = new Document();
	        NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
	        dstDoc.save(MyDir + "output"+i+".docx");
	        i++;
	    }
	}
	}

}
}

Thanks & Regards,
Priyanga G

tahir.manzoor · April 3, 2018, 12:12pm

@priyanga,

Thanks for your inquiry. In this case, you need to get the Fig caption from text box. We shared the code example in following forum link.

How to get the content from text box in word document

After inserting the Fig caption out of text box, please bookmark the contents and extract them.

priyanga · April 3, 2018, 1:17pm

Hi @tahir.manzoor,

Thanks for your feedback

In the particular document,the figure captions are not in the text box.So,please kindly help me to solve the issue.

Input :final_BEEE-D-16-00065__Paper revised.zip (431.1 KB)

Thanks & regards,
priyanga G

tahir.manzoor · April 3, 2018, 5:56pm

@priyanga,

Thanks for your inquiry. Please note that the code example shared in this forum thread will not work for all your cases. First you need to list down all your use cases and then write the code accordingly.

We suggest you please check the code example of document explorer from Aspose.Words for Java examples repository at GitHub. You can check the nodes of imported document in document explorer.

For the shared document, the shapes are in GroupShape. You need to use NodeImporter to export the GroupShape into new document.