Extract Label from image

Saranya_Sekar · September 25, 2018, 7:55am

Hi Team,

I want to extract the label for this document. I am using the following source code.Sample input is Sample1.zip (596.6 KB) Expected output is sample_output.zip (610.6 KB)

private static void unNumberedImageExtrac(Document interimdoc) throws Exception
{
Document doc = new Document(filearg);

	DocumentBuilder builder = new DocumentBuilder(doc);
	int i = 1;
	NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
	for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
	{
	    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
	    {

	        Node PreviousPara = paragraph.getPreviousSibling();
	        
	        if (PreviousPara != null &&
	              (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
	                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)")||
	                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(c)")||
	                         PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)"))

	                ) {
	        	PreviousPara = PreviousPara.getPreviousSibling();
	            try{
	            if (PreviousPara != null && ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {

	                for (Shape shape : (Iterable<Shape>) ((Paragraph) PreviousPara).getChildNodes(NodeType.SHAPE, true))
	                {
	                	Document dstDoc = new Document();
	                    NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
	                    Node newNode = importer.importNode(shape, true);
	                    dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);
	                    dstDoc.getPreviousSibling();
	                    dstDoc.save(folderName + "output_B" + i + ".docx");
	                    dstDoc.save(folderName + "output_B" + i + ".jpeg");
	                    dstDoc.save(folderName + "output_B" + i + ".pdf");
	                    i++;
	                }
	            }
	            }
	            catch(Exception e){

	            }
	        }
	    }
	}
	
}

tahir.manzoor · September 25, 2018, 9:57am

@Saranya_Sekar

Thanks for your inquiry. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "Sample1.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
        Node PreviousPara = paragraph.getPreviousSibling();

        if (PreviousPara != null &&
                (PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") )

                )
        {
            Node label = PreviousPara;
            if(label != null)
            {
                PreviousPara = label.getPreviousSibling();
                if (PreviousPara != null && PreviousPara.isComposite() && ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0) {

                    for (Shape shape : (Iterable<Shape>) ((CompositeNode) PreviousPara).getChildNodes(NodeType.SHAPE, true))
                    {
                        Document dstDoc = new Document();
                        NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
                        Node newNode = importer.importNode(shape, true);
                        dstDoc.getFirstSection().getBody().getFirstParagraph().appendChild(newNode);


                        newNode = importer.importNode(label, true);
                        dstDoc.getFirstSection().getBody().appendChild(newNode);

                        if(i%2 == 0)
                            dstDoc.getFirstSection().getBody().getLastParagraph().getRange().replace("(a)", "", new FindReplaceOptions());
                        else
                            dstDoc.getFirstSection().getBody().getLastParagraph().getRange().replace("(b)", "", new FindReplaceOptions());

                        dstDoc.save(MyDir + "output" + i + ".docx");
                        i++;
                    }
                }

            }
        }
    }
}

MikeLak · September 25, 2018, 10:27am

@tahir.manzoor Thank you for sharing the code. It is working fine.

tahir.manzoor · September 25, 2018, 3:50pm

@Saranya_Sekar

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

MikeLak · September 26, 2018, 4:36am

Thank you much

Saranya_Sekar · September 26, 2018, 9:00am

@tahir.manzoor I have to extract the whole labelled image together with the file name saved as image number.sample input is a.zip (297.1 KB)
Expected output is 7.zip (296.9 KB)
and bookmark has to be placed in the location of extracted image in the interim document.sample interim document is ManuscriptRevisedClean_Interim.zip (15.8 KB)

tahir.manzoor · September 26, 2018, 4:02pm

@Saranya_Sekar,

Thanks for your inquiry. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "a.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
ArrayList tables = new ArrayList();
int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

        while (PreviousPara != null && PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0 ||
                        (
                            PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                            PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                            PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                            PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)")
                        )
                )
        {
            PreviousPara = PreviousPara.getPreviousSibling();
            if(((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
                break;
        }
        
        if(PreviousPara == null)
        {
            builder.moveToDocumentStart();
            builder.insertParagraph();
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
        {
            Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
            builder.moveTo(node);
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveTo(paragraph);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.TABLE)
        {
            if(((Table)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
                tables.add(((Table)PreviousPara));
        }

    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
            dstDoc.getLastSection().getBody().getLastParagraph().remove();

        if(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
    }
}

Saranya_Sekar · September 27, 2018, 3:30am

Previous code extracts paragraphs along with images sample input is for nocturnal radiative cooling testing mode.zip (680.3 KB)
But the above code is also extracting paragraphs. The derived output is for nocturnal radiative cooling testing mode_output.zip (694.3 KB)
I am expecting the output as for nocturnal radiative cooling testing mode_Expected_output (2).zip (593.0 KB)
Can you please suggest code with intermediate document with images removed and bookmark placed there.Test_interim.zip (14.4 KB)
and the position of a,b in images are not aligned properly.
Is this code applicable for placing the bookmark. Kindly share the code.
((Paragraph) paragraph).getChildNodes(NodeType.SHAPE, true).clear();

Paragraph p = ((Paragraph) paragraph);

p.getChildNodes(NodeType.SHAPE, true).clear();

p.appendChild(new BookmarkStart(interimdoc, "MyBookmark"));

Run run = new Run(interimdoc, "<Fig>Numbered_Figure</Fig>");

run.getFont().setColor(Color.RED);

p.getRuns().add(run);

p.appendChild(new BookmarkEnd(interimdoc, "MyBookmark"));

tahir.manzoor · September 27, 2018, 4:16am

@Saranya_Sekar

Please get the code of extractContent and generateDocument methods from the following article.
Extract Selected Content Between Nodes

Saranya_Sekar · September 27, 2018, 6:28am

With the above code paragraphs are also extracted.I have mentioned in my previous reply about the issue.

tahir.manzoor · September 27, 2018, 3:26pm

@Saranya_Sekar

Thanks for your inquiry.

We have modified code example according to your requirement. Please use the following code example.

Document doc = new Document(MyDir + "for nocturnal radiative cooling testing mode.docx");

DocumentBuilder builder = new DocumentBuilder(doc);
ArrayList tables = new ArrayList();
int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

        while (PreviousPara != null && PreviousPara.toString(SaveFormat.TEXT).trim().length() == 0 ||
                (
                        PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                                PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                                PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)") ||
                                PreviousPara.toString(SaveFormat.TEXT).trim().contains("(d)")
                )
                )
        {
            PreviousPara = PreviousPara.getPreviousSibling();
            if(((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
                break;
        }

        if(PreviousPara == null)
        {
            builder.moveToDocumentStart();
            builder.insertParagraph();
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
        {
            while(PreviousPara != null && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() == 0)
                PreviousPara = PreviousPara.getNextSibling();

            Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
            builder.moveTo(node);
            builder.startBookmark("Bookmark" + bookmark);
            builder.moveTo(paragraph);
            builder.endBookmark("Bookmark" + bookmark);
            bookmark++;
        }
        else if(PreviousPara.getNodeType() == NodeType.TABLE)
        {
            if(((Table)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
                tables.add(((Table)PreviousPara));
        }

    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

        dstDoc.updatePageLayout();
        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
            dstDoc.getLastSection().getBody().getLastParagraph().remove();

        while(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
    }
}

We are writing the code example for this case. We will share it here when it is ready. Thanks for your patience.

Saranya_Sekar · September 28, 2018, 3:34am

@tahir.manzoor The above code is not working. It is not extracting the labelled images.

tahir.manzoor · September 28, 2018, 1:58pm

@Saranya_Sekar

Thanks for your inquiry. Could you please share the page numbers of images that are not extracted along with their screenshots? Please also share the expected output documents. We will then provide you more information about your query.

Saranya_Sekar · October 1, 2018, 3:28am

@tahir.manzoor Thanks for your support. The input document is Sample_Document.zip (2.7 MB)
and the output document is Expected_Output.zip (604.5 KB)
This is the actual file.please support.

Can you please suggest code with intermediate document with images removed and bookmark placed there.

tahir.manzoor · October 1, 2018, 2:42pm

@Saranya_Sekar

Thanks for sharing the document. Please use the following code example to get the expected output.

Document doc = new Document(MyDir + "Sample_Document.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

        if(PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                                PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)"))
        {
            PreviousPara = PreviousPara.getPreviousSibling();
            if(((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
            {
                if(PreviousPara == null)
                {
                    builder.moveToDocumentStart();
                    builder.insertParagraph();
                    builder.startBookmark("Bookmark" + bookmark);
                    builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
                    builder.endBookmark("Bookmark" + bookmark);
                    bookmark++;
                }
                else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
                {
                    while(PreviousPara != null && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() == 0)
                        PreviousPara = PreviousPara.getNextSibling();

                    Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
                    builder.moveTo(node);
                    builder.startBookmark("Bookmark" + bookmark);
                    builder.moveTo(paragraph);
                    builder.endBookmark("Bookmark" + bookmark);
                    bookmark++;
                }
            }
        }
    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

        dstDoc.updatePageLayout();
        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
            dstDoc.getLastSection().getBody().getLastParagraph().remove();

        while(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
    }
}

Saranya_Sekar · October 3, 2018, 4:13am

I am placing bookmark once I remove the figure.But the cation a,b is not removed.My code is as below.Sample_Document_Interim.zip (895.5 KB)
Here I have attached the interim file saved but a,b is not removed.Sample input file is Sample_Document.zip (2.7 MB) but the bookmark must be placed in the location where the image is removed.
Help please.Also how to save the extracted figure name same as Figure like Fig 7 and Fig 17 are extracted images here.I want the same name as that.

private static void labelledImagesExtraction(Document interimdoc) throws Exception
{
Document doc = interimdoc;
DocumentBuilder builder = new DocumentBuilder(doc);

	int bookmark = 1;
	int i = 1;
	NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
	for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
	{
	    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
	    {

	        Node PreviousPara = paragraph.getPreviousSibling();

	        if(PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
	                                PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)"))
	        {
	            PreviousPara = PreviousPara.getPreviousSibling();
	            if(((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
	            {
	                if(PreviousPara == null)
	                {
	                    builder.moveToDocumentStart();
	                    builder.insertParagraph();
	                    builder.startBookmark("Bookmark" + bookmark);
	                    builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
	                    builder.endBookmark("Bookmark" + bookmark);
	                    bookmark++;
	                }
	                else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
	                {
	                    while(PreviousPara != null && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() == 0)
	                        PreviousPara = PreviousPara.getNextSibling();

	                    Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
	                    builder.moveTo(node);
	                    builder.startBookmark("Bookmark" + bookmark);
	                    builder.moveTo(paragraph);
	                    builder.endBookmark("Bookmark" + bookmark);
	                    bookmark++;
	                }
	            }
	        }
	    }
	}

	for (Bookmark bm : doc.getRange().getBookmarks())
	{
	    if(bm.getName().startsWith("Bookmark"))
	    {
	        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
	        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

	        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
	        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
	        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
	        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

	        dstDoc.updatePageLayout();
	        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
	            dstDoc.getLastSection().getBody().getLastParagraph().remove();

	        while(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
	            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

	        dstDoc.save(folderName + "output"+i+".docx");
	        i++;
	    }
	}
	
	for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
	{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
    	String strI="",name="",file_name="", folder_name="";
    Node PreviousPara = paragraph.getPreviousSibling();
    if(PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
            PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)"))
    {
        
    	PreviousPara = PreviousPara.getPreviousSibling();
    	PreviousPara.remove();
    	try {
			String Imgcaption = paragraph.toString(SaveFormat.TEXT).trim();
			int k = 0;
			while (k < Imgcaption.length() && !Character.isDigit(Imgcaption.charAt(k)))
				k++;
			int j = k;
			while (j < Imgcaption.length() && Character.isDigit(Imgcaption.charAt(j)))
				j++;
			int l = Integer.valueOf(Imgcaption.substring(k, j));
			strI = Integer.toString(l);
			Pattern pattern = Pattern.compile(strI);
			Matcher matcher = pattern.matcher(Imgcaption);
			while (matcher.find()) {
				name = Imgcaption.substring(0, matcher.end());
				name = name.replace(".", "_");
			}
			if (name.startsWith("Fig")) {
				name = "Fig" + "_" + l;
			}
			/** OUTPUT FILENAME END **/
			
		} catch (Exception e) {
		}
    	
      ((Paragraph) paragraph).getChildNodes(NodeType.SHAPE, true).clear();
      Paragraph p = ((Paragraph) paragraph);
      p.getChildNodes(NodeType.SHAPE, true).clear();
      p.appendChild(new BookmarkStart(interimdoc,
                    "MyBookmark"));
      Run run = new Run(interimdoc, "[" + name + "]");
      run.getFont().setSize(12);
      run.getFont().setStrikeThrough(false);
      run.getFont().setColor(Color.RED);
      p.getRuns().add(run);
      p.appendChild(new BookmarkEnd(interimdoc,
                    "MyBookmark"));


        	interimdoc.save(interim);
        }
    	}
	}
}

tahir.manzoor · October 3, 2018, 3:14pm

@Saranya_Sekar

Thanks for your inquiry. Please use the following modified code for this new case. We have attached the output documents with this post for your kind reference. Fig_output.zip (2.6 MB)

Document doc = new Document(MyDir + "Sample_Document.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

        if(PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)"))
        {
            PreviousPara = PreviousPara.getPreviousSibling();
            if(((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
            {
                if(PreviousPara == null)
                {
                    builder.moveToDocumentStart();
                    builder.insertParagraph();
                    builder.startBookmark("Bookmark" + bookmark);
                    builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
                    builder.endBookmark("Bookmark" + bookmark);
                    bookmark++;
                }
                else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
                {
                    while(PreviousPara != null && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() == 0)
                        PreviousPara = PreviousPara.getNextSibling();

                    Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
                    builder.moveTo(node);
                    builder.startBookmark("Bookmark" + bookmark);
                    builder.moveTo(paragraph);
                    builder.endBookmark("Bookmark" + bookmark);
                    bookmark++;
                }
            }
        }
    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

        dstDoc.updatePageLayout();
        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
            dstDoc.getLastSection().getBody().getLastParagraph().remove();

        while(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

        dstDoc.updatePageLayout();
        String filename = bm.getBookmarkEnd().getParentNode().toString(SaveFormat.TEXT);
        dstDoc.save(MyDir + filename.substring(0, 7) + "_out.docx");
        i++;
    }
}

//Add fig text at the place of images
for (Bookmark bm : doc.getRange().getBookmarks()) {
    if (bm.getName().startsWith("Bookmark")) {
        bm.getBookmarkEnd().getParentNode().insertBefore(new BookmarkEnd(doc, bm.getName()), bm.getBookmarkEnd().getParentNode().getFirstChild());
    }
}
doc.updatePageLayout();
for (Bookmark bm : doc.getRange().getBookmarks()) {
    if (bm.getName().startsWith("Bookmark")) {
        bm.getBookmarkEnd().getParentNode().insertBefore(new BookmarkEnd(doc, bm.getName()), bm.getBookmarkEnd().getParentNode().getFirstChild());
        String figText = bm.getBookmarkEnd().getParentNode().toString(SaveFormat.TEXT);
        bm.setText("<Fig>"+figText.substring(0, 7)+"</Fig>" + ControlChar.PARAGRAPH_BREAK);
    }
}

doc.save(MyDir + "output.docx");

Saranya_Sekar · October 4, 2018, 7:24am

@tahir.manzoor
I want code without ExtractContents class.Since we have to optimize our code. Can you please provide any other way instead of Extract content class.The above code gives the expected output. I have another scenario in which images are not in the same line. This set of code is not applicable for those cases.The sample input is Multiple_Label.zip (1.5 MB)

and the expected output is Multiple-Label-output.zip (1.5 MB)

Please suggest how to place the bookmark below the fig caption as well.

tahir.manzoor · October 4, 2018, 5:35pm

@Saranya_Sekar

Thanks for your inquiry. Please note that the code examples shared in this forum thread will not work for all your cases. First, you need to list down all your use cases and then write the code accordingly. You need to use the same approach i.e. bookmark the content and extract them. The approach to extract the images is almost the same. You need to change the condition in while loop and if statement according to your requirement.

Saranya_Sekar · October 6, 2018, 4:01am

@tahir.manzoor
Can we have any other way to use this method without ExtractContent class. We need to optimise the code.