image.zip (48.4 KB)
Hi team,
Requiring a work around solution to copy/extract shapes in docx to new docx based on figure caption starting with “Fig” using paragraph and curNode.
Extraction of group images also expected
Thanks for your inquiry. In your document, there are two shapes and many empty paragraphs. The last paragraph’s text starts with “Fig”. Both shapes are in separate paragraphs. Could you please share the parameters that you want to use to extract the contents?
You can simply remove the paragraph that starts with “Fig” text to get the desired output. Please check following code example.
The document shared here is just one page of a journal document, hence trimming the paragraph and saving it as new doc is not working out. I require these kind of simultaneous images to get extracted by using a filter like a shape followed by another shape… until a text starting with “Fig” is found. the selected shapes should have to be saved in the new doc altogether as group image.
Based on previous searches, if next sibling of a shape is again a shape, then consider that shape as curNode and again search till a string starts with “Fig” is found.
I am in need of such a workaround, hope you can help me out.
Thanks for sharing your requirement in detail. Please spare us some time for the analysis of your desired output. We will get back to you soon with code example according to your requirement.
While executing the input document test15.zip (363.8 KB)
as attached with the above mentioned logic, output is created as expected but with many blank documents. Is there a way to create only documents with images and avoiding blank document creation.
By using the above logic, All group images in problem document output after execution is produced with mismatch, the order in which images are created is reverse. Expected Output is attached. Kindly help out.
Also requesting a solution to delete/remove extracted contents from source document after embedding into new document in order to avoid repetition of images.
Thanks for your inquiry. You want to extract images from Word document before the text that starts with “Fig” or “Figure”. You also want to remove the empty paragraphs from the output document. We already shared the solution to your queries in following thread. Please use the same approach and modify the code according to your use cases.
But my problem is group images created are in reverse order from source document. And many blank documents are created during execution. In addition to it I request you to delete/remove the images extracted after execution from source document.
//Get the paragraphs that start with “Fig”.
for (Paragraph paragraph : (Iterable) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
if(paragraph.toString(SaveFormat.TEXT).trim().startsWith(“Fig”))
{
Node previousPara = paragraph.getPreviousSibling();
while (previousPara != null
&& previousPara.getNodeType() == NodeType.PARAGRAPH
&& previousPara.toString(SaveFormat.TEXT).trim().length() == 0
&& ((Paragraph)previousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
{
if(previousPara != null)
nodes.add(previousPara);
previousPara = previousPara.getPreviousSibling();
}
//Extract the consecutive shapes and export them into new document
Document dstDoc = new Document();
for (Paragraph para : (Iterable<Paragraph>)nodes)
{
NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
Node newNode = importer.importNode(para, true);
dstDoc.getFirstSection().getBody().appendChild(newNode);
}
dstDoc.save(MyDir + "output"+i+".docx");
i++;
nodes.clear();
}
}
I have the following difficulties:
Many Blank/Empty Documents are created during execution.
Consecutive images(Group Images) are appearing in reverse order.
(For example: If 3 consecutive images are in source document then, first images appears in last and last image appears at first in output document. )
After extraction of images to new document, inorder to avoid repetition of same image being extracted again, I request a work around solution delete/remove that image from source document.
I am in need of such a workaround, hope you can help me out.
Thanking you for helping out.
The above mentioned solution is working fine for consecutive images except for reversal order.
Thank you @tahir.manzoor, the code is doing excellently as we expected. but due to extraction of inline images that are extracted using different mechanisms that you had mentioned at earlier stages, we get duplicate images, so to avoid that, we request a form to delete/remove images extracted after extraction from source document to avoid duplication. I am very thankful to your continuous support and solutions. We are able to perform well with the absolutely perfect replies from @tahir.manzoor
Thanking You
Priya Dharshini J P
Thanks for your inquiry. We have not found the duplicate images issue in output documents. Could you please share the following resources here for testing?
Your input document.
Please share the page numbers of input document whose content are duplicating.
Please share the output documents that shows the undesired behavior.