Extract Label from image

Saranya_Sekar · September 28, 2018, 3:34am

@tahir.manzoor The above code is not working. It is not extracting the labelled images.

tahir.manzoor · September 28, 2018, 1:58pm

Thanks for your inquiry. Could you please share the page numbers of images that are not extracted along with their screenshots? Please also share the expected output documents. We will then provide you more information about your query.

Saranya_Sekar · October 1, 2018, 3:28am

@tahir.manzoor Thanks for your support. The input document is Sample_Document.zip (2.7 MB)
and the output document is Expected_Output.zip (604.5 KB)
This is the actual file.please support.

Can you please suggest code with intermediate document with images removed and bookmark placed there.

tahir.manzoor · October 1, 2018, 2:42pm

@Saranya_Sekar

Thanks for sharing the document. Please use the following code example to get the expected output.

Document doc = new Document(MyDir + "Sample_Document.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

        if(PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                                PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)"))
        {
            PreviousPara = PreviousPara.getPreviousSibling();
            if(((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
            {
                if(PreviousPara == null)
                {
                    builder.moveToDocumentStart();
                    builder.insertParagraph();
                    builder.startBookmark("Bookmark" + bookmark);
                    builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
                    builder.endBookmark("Bookmark" + bookmark);
                    bookmark++;
                }
                else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
                {
                    while(PreviousPara != null && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() == 0)
                        PreviousPara = PreviousPara.getNextSibling();

                    Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
                    builder.moveTo(node);
                    builder.startBookmark("Bookmark" + bookmark);
                    builder.moveTo(paragraph);
                    builder.endBookmark("Bookmark" + bookmark);
                    bookmark++;
                }
            }
        }
    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

        dstDoc.updatePageLayout();
        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
            dstDoc.getLastSection().getBody().getLastParagraph().remove();

        while(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

        dstDoc.save(MyDir + "output"+i+".docx");
        i++;
    }
}

Saranya_Sekar · October 3, 2018, 4:13am

I am placing bookmark once I remove the figure.But the cation a,b is not removed.My code is as below.Sample_Document_Interim.zip (895.5 KB)
Here I have attached the interim file saved but a,b is not removed.Sample input file is Sample_Document.zip (2.7 MB) but the bookmark must be placed in the location where the image is removed.
Help please.Also how to save the extracted figure name same as Figure like Fig 7 and Fig 17 are extracted images here.I want the same name as that.

private static void labelledImagesExtraction(Document interimdoc) throws Exception
{
Document doc = interimdoc;
DocumentBuilder builder = new DocumentBuilder(doc);

	int bookmark = 1;
	int i = 1;
	NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
	for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
	{
	    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
	    {

	        Node PreviousPara = paragraph.getPreviousSibling();

	        if(PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
	                                PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)"))
	        {
	            PreviousPara = PreviousPara.getPreviousSibling();
	            if(((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
	            {
	                if(PreviousPara == null)
	                {
	                    builder.moveToDocumentStart();
	                    builder.insertParagraph();
	                    builder.startBookmark("Bookmark" + bookmark);
	                    builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
	                    builder.endBookmark("Bookmark" + bookmark);
	                    bookmark++;
	                }
	                else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
	                {
	                    while(PreviousPara != null && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() == 0)
	                        PreviousPara = PreviousPara.getNextSibling();

	                    Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
	                    builder.moveTo(node);
	                    builder.startBookmark("Bookmark" + bookmark);
	                    builder.moveTo(paragraph);
	                    builder.endBookmark("Bookmark" + bookmark);
	                    bookmark++;
	                }
	            }
	        }
	    }
	}

	for (Bookmark bm : doc.getRange().getBookmarks())
	{
	    if(bm.getName().startsWith("Bookmark"))
	    {
	        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
	        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

	        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
	        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
	        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
	        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

	        dstDoc.updatePageLayout();
	        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
	            dstDoc.getLastSection().getBody().getLastParagraph().remove();

	        while(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
	            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

	        dstDoc.save(folderName + "output"+i+".docx");
	        i++;
	    }
	}
	
	for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
	{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {
    	String strI="",name="",file_name="", folder_name="";
    Node PreviousPara = paragraph.getPreviousSibling();
    if(PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
            PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)"))
    {
        
    	PreviousPara = PreviousPara.getPreviousSibling();
    	PreviousPara.remove();
    	try {
			String Imgcaption = paragraph.toString(SaveFormat.TEXT).trim();
			int k = 0;
			while (k < Imgcaption.length() && !Character.isDigit(Imgcaption.charAt(k)))
				k++;
			int j = k;
			while (j < Imgcaption.length() && Character.isDigit(Imgcaption.charAt(j)))
				j++;
			int l = Integer.valueOf(Imgcaption.substring(k, j));
			strI = Integer.toString(l);
			Pattern pattern = Pattern.compile(strI);
			Matcher matcher = pattern.matcher(Imgcaption);
			while (matcher.find()) {
				name = Imgcaption.substring(0, matcher.end());
				name = name.replace(".", "_");
			}
			if (name.startsWith("Fig")) {
				name = "Fig" + "_" + l;
			}
			/** OUTPUT FILENAME END **/
			
		} catch (Exception e) {
		}
    	
      ((Paragraph) paragraph).getChildNodes(NodeType.SHAPE, true).clear();
      Paragraph p = ((Paragraph) paragraph);
      p.getChildNodes(NodeType.SHAPE, true).clear();
      p.appendChild(new BookmarkStart(interimdoc,
                    "MyBookmark"));
      Run run = new Run(interimdoc, "[" + name + "]");
      run.getFont().setSize(12);
      run.getFont().setStrikeThrough(false);
      run.getFont().setColor(Color.RED);
      p.getRuns().add(run);
      p.appendChild(new BookmarkEnd(interimdoc,
                    "MyBookmark"));


        	interimdoc.save(interim);
        }
    	}
	}
}

tahir.manzoor · October 3, 2018, 3:14pm

@Saranya_Sekar

Thanks for your inquiry. Please use the following modified code for this new case. We have attached the output documents with this post for your kind reference. Fig_output.zip (2.6 MB)

Document doc = new Document(MyDir + "Sample_Document.docx");
DocumentBuilder builder = new DocumentBuilder(doc);

int bookmark = 1;
int i = 1;
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
    {

        Node PreviousPara = paragraph.getPreviousSibling();

        if(PreviousPara.toString(SaveFormat.TEXT).trim().contains("(a)") ||
                PreviousPara.toString(SaveFormat.TEXT).trim().contains("(b)"))
        {
            PreviousPara = PreviousPara.getPreviousSibling();
            if(((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
            {
                if(PreviousPara == null)
                {
                    builder.moveToDocumentStart();
                    builder.insertParagraph();
                    builder.startBookmark("Bookmark" + bookmark);
                    builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
                    builder.endBookmark("Bookmark" + bookmark);
                    bookmark++;
                }
                else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
                {
                    while(PreviousPara != null && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() == 0)
                        PreviousPara = PreviousPara.getNextSibling();

                    Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
                    builder.moveTo(node);
                    builder.startBookmark("Bookmark" + bookmark);
                    builder.moveTo(paragraph);
                    builder.endBookmark("Bookmark" + bookmark);
                    bookmark++;
                }
            }
        }
    }
}

for (Bookmark bm : doc.getRange().getBookmarks())
{
    if(bm.getName().startsWith("Bookmark"))
    {
        ArrayList nodes =  ExtractContents.extractContent(bm.getBookmarkStart(), bm.getBookmarkEnd(), true);
        Document dstDoc = ExtractContents.generateDocument(doc, nodes);

        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

        dstDoc.updatePageLayout();
        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
            dstDoc.getLastSection().getBody().getLastParagraph().remove();

        while(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

        dstDoc.updatePageLayout();
        String filename = bm.getBookmarkEnd().getParentNode().toString(SaveFormat.TEXT);
        dstDoc.save(MyDir + filename.substring(0, 7) + "_out.docx");
        i++;
    }
}

//Add fig text at the place of images
for (Bookmark bm : doc.getRange().getBookmarks()) {
    if (bm.getName().startsWith("Bookmark")) {
        bm.getBookmarkEnd().getParentNode().insertBefore(new BookmarkEnd(doc, bm.getName()), bm.getBookmarkEnd().getParentNode().getFirstChild());
    }
}
doc.updatePageLayout();
for (Bookmark bm : doc.getRange().getBookmarks()) {
    if (bm.getName().startsWith("Bookmark")) {
        bm.getBookmarkEnd().getParentNode().insertBefore(new BookmarkEnd(doc, bm.getName()), bm.getBookmarkEnd().getParentNode().getFirstChild());
        String figText = bm.getBookmarkEnd().getParentNode().toString(SaveFormat.TEXT);
        bm.setText("<Fig>"+figText.substring(0, 7)+"</Fig>" + ControlChar.PARAGRAPH_BREAK);
    }
}

doc.save(MyDir + "output.docx");

Saranya_Sekar · October 4, 2018, 7:24am

@tahir.manzoor
I want code without ExtractContents class.Since we have to optimize our code. Can you please provide any other way instead of Extract content class.The above code gives the expected output. I have another scenario in which images are not in the same line. This set of code is not applicable for those cases.The sample input is Multiple_Label.zip (1.5 MB)

and the expected output is Multiple-Label-output.zip (1.5 MB)

Please suggest how to place the bookmark below the fig caption as well.

tahir.manzoor · October 4, 2018, 5:35pm

@Saranya_Sekar

Thanks for your inquiry. Please note that the code examples shared in this forum thread will not work for all your cases. First, you need to list down all your use cases and then write the code accordingly. You need to use the same approach i.e. bookmark the content and extract them. The approach to extract the images is almost the same. You need to change the condition in while loop and if statement according to your requirement.

Saranya_Sekar · October 6, 2018, 4:01am

@tahir.manzoor
Can we have any other way to use this method without ExtractContent class. We need to optimise the code.

tahir.manzoor · October 6, 2018, 10:55am

@Saranya_Sekar

Thanks for your inquiry. Yes, you can extract the nodes from the document without using ExtractContent class. You need to get the nodes between BookmarkStart node and BookmarkEnd node. Please use Node.NextSibling property to get the next sibling of node and use NodeImporter.ImportNode method to import the it into new document.

Saranya_Sekar · October 7, 2018, 10:22am

@tahir.manzoor
Can you please provide the sample code for that.

tahir.manzoor · October 8, 2018, 4:41am

@Saranya_Sekar

Thanks for your inquiry. Sure, we will share the code snippet shortly.

tahir.manzoor · October 8, 2018, 5:41am

@Saranya_Sekar

Please use the following method to extract the paragraphs from the source document. We suggest you please read about document object model of Aspose.Words from here:
Aspose.Words Document Object Model

Please also read Programming with Documents.

static ArrayList ExtractContentBetweenParagraphs(Paragraph para1, Paragraph para2) throws Exception
{
    ArrayList nodes = new ArrayList();
    nodes.add(para1);
    Node currentNode = para1;
    while(currentNode != null && !currentNode.equals(para2))
    {
        currentNode = currentNode.getNextSibling();
        nodes.add(currentNode);
    }

    return nodes;
}

ArrayList nodes =  ExtractContentBetweenParagraphs((Paragraph)bm.getBookmarkStart().getParentNode(), (Paragraph) bm.getBookmarkEnd().getParentNode());

Saranya_Sekar · October 8, 2018, 12:00pm

@tahir.manzoor
I have a sample document with various scenarios can you help me to extract all the images.Sample input is Multiple_Label.zip (2.0 MB)
Expected output is Multiple-Label-output.zip (2.0 MB)
Kindly help please.

tahir.manzoor · October 8, 2018, 5:51pm

@Saranya_Sekar

Thanks for your inquiry. Please give us some time. We will check all use cases of this document and share the code example with you.

Saranya_Sekar · October 9, 2018, 3:36am

Kindly include this as sample input Multiple_Label.zip (2.3 MB)
and expected output is Multiple-Label-output.zip (2.3 MB)
Thanks in advance.

Saranya_Sekar · October 9, 2018, 10:06am

I am using this code

private static void labelledImagesExtraction(Document interimdoc) throws Exception
{
Document doc = interimdoc;
DocumentBuilder builder = new DocumentBuilder(doc);

	int bookmark = 1;
	int i = 1;
	NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
	for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
	{
	    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig"))
	    {

	        Node PreviousPara = paragraph.getPreviousSibling();
	        
	        String label = PreviousPara.toString(SaveFormat.TEXT).trim();
	          String pattern = "(.*?)";
	          Pattern regExp = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
	          Matcher match = regExp.compile(pattern).matcher(label);
	          if(match.matches()) 
	        if(PreviousPara.toString(SaveFormat.TEXT).trim().contains(match.group()))
	        {
	            PreviousPara = PreviousPara.getPreviousSibling();
	            if(((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() > 0)
	            {
	                if(PreviousPara == null)
	                {
	                    builder.moveToDocumentStart();
	                    builder.insertParagraph();
	                    builder.startBookmark("Bookmark" + bookmark);
	                    builder.moveToParagraph(paragraphs.indexOf(paragraph), 0);
	                    builder.endBookmark("Bookmark" + bookmark);
	                    bookmark++;
	                }
	                else if(PreviousPara.getNodeType() == NodeType.PARAGRAPH)
	                {
	                    while(PreviousPara != null && ((Paragraph)PreviousPara).getChildNodes(NodeType.SHAPE, true).getCount() == 0)
	                        PreviousPara = PreviousPara.getNextSibling();

	                    Node node = ((Paragraph)PreviousPara).getParentNode().insertBefore(new Paragraph(doc), PreviousPara);
	                    builder.moveTo(node);
	                    builder.startBookmark("Bookmark" + bookmark);
	                    builder.moveTo(paragraph);
	                    builder.endBookmark("Bookmark" + bookmark);
	                    bookmark++;
	                }
	                
	            }
	        }
	    }
	}

	for (Bookmark bm : doc.getRange().getBookmarks())
	{
	    if(bm.getName().startsWith("Bookmark"))
	    {
	    	ArrayList nodes =  ExtractContentBetweenParagraphs((Paragraph)bm.getBookmarkStart().getParentNode(), (Paragraph) bm.getBookmarkEnd().getParentNode());
	        Document dstDoc = generateDocument(doc, nodes);

	        PageSetup sourcePageSetup = ((Paragraph)bm.getBookmarkStart().getParentNode()).getParentSection().getPageSetup();
	        dstDoc.getFirstSection().getPageSetup().setPaperSize(sourcePageSetup.getPaperSize());
	        dstDoc.getFirstSection().getPageSetup().setLeftMargin(sourcePageSetup.getLeftMargin());
	        dstDoc.getFirstSection().getPageSetup().setRightMargin(sourcePageSetup.getRightMargin());

	        dstDoc.updatePageLayout();
	        if(dstDoc.getLastSection().getBody().getLastParagraph().toString(SaveFormat.TEXT).trim().startsWith("Fig"))
	            dstDoc.getLastSection().getBody().getLastParagraph().remove();

	        while(dstDoc.getFirstSection().getBody().getFirstParagraph().getChildNodes(NodeType.SHAPE, true).getCount() == 0)
	            dstDoc.getFirstSection().getBody().getFirstParagraph().remove();

	        dstDoc.updatePageLayout();
	        String filename = bm.getBookmarkEnd().getParentNode().toString(SaveFormat.TEXT);
	        dstDoc.save(folderName + filename.substring(0, 7) + "_out.docx");
	        dstDoc.save(folderName + filename.substring(0, 7) + "_out.pdf");
	        dstDoc.save(folderName + filename.substring(0, 7) + "_out.jpeg");
	        i++;
	    }
	}

	//Add fig text at the place of images
	for (Bookmark bm : doc.getRange().getBookmarks()) {
	    if (bm.getName().startsWith("Bookmark")) {
	        bm.getBookmarkEnd().getParentNode().insertBefore(new BookmarkEnd(doc, bm.getName()), bm.getBookmarkEnd().getParentNode().getFirstChild());
	    }
	}
	doc.updatePageLayout();
	for (Bookmark bm : doc.getRange().getBookmarks()) {
	    if (bm.getName().startsWith("Bookmark")) {
	        bm.getBookmarkEnd().getParentNode().insertBefore(new BookmarkEnd(doc, bm.getName()), bm.getBookmarkEnd().getParentNode().getFirstChild());
	        String figText = bm.getBookmarkEnd().getParentNode().toString(SaveFormat.TEXT);
	        bm.setText("<Fig>"+figText.substring(0, 7)+"</Fig>" + ControlChar.PARAGRAPH_BREAK );
	        interimdoc.save(interim);
	    }
	}
}
	
	
	
	
	
	
	
	
static ArrayList ExtractContentBetweenParagraphs(Paragraph para1, Paragraph para2) throws Exception
{
    ArrayList nodes = new ArrayList();
    nodes.add(para1);
    Node currentNode = para1;
    while(currentNode != null && !currentNode.equals(para2))
    {
        currentNode = currentNode.getNextSibling();
        nodes.add(currentNode);
    }

    return nodes;
}


public static Document generateDocument(Document srcDoc, ArrayList nodes) throws Exception {

    // Create a blank document.
    Document dstDoc = new Document();
    // Remove the first paragraph from the empty document.
    dstDoc.getFirstSection().getBody().removeAllChildren();

    // Import each node from the list into the new document. Keep the original formatting of the node.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);

    for (Node node : (Iterable<Node>) nodes) {
        Node importNode = importer.importNode(node, true);
        dstDoc.getFirstSection().getBody().appendChild(importNode);
    }

    // Return the generated document.
    return dstDoc;
}

tahir.manzoor · October 9, 2018, 1:16pm

@Saranya_Sekar

Thanks for sharing the document and code. We will check the use cases for this document and write the code examples. We will share the examples as soon as possible.

tahir.manzoor · October 15, 2018, 6:12am

@Saranya_Sekar

Please use the following code example to get the desired output. Please check the attachment. CodeExamples.zip (1.2 KB)

Document doc = new Document(MyDir + "Multiple_Label.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
UseCase1(doc, builder);
ExtractImages(doc, "uc1");

UseCase2(doc, builder);
ExtractImages(doc, "uc2");

Saranya_Sekar · October 15, 2018, 6:55am

@tahir.manzoor
Thank you very much for the code sample.