How to get the content from text box in word document

priyanga · February 21, 2018, 11:01am

Hi Team,

My requirement is to get the content of text box in the word document and remove the text box and placed the text in the same place.

please kindly help me to resolve the issue.

The input document Test.zip (1.5 MB)

The expected output document ExpectedOutput.zip (1.5 MB)

Thanks & regards,
Priyanga G

tahir.manzoor · February 21, 2018, 1:44pm

@priyanga,

Thanks for your inquiry. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "test.docx");

NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("FIGURE")
            && paragraph.getAncestor(NodeType.SHAPE) != null)
    {
        Shape shape = (Shape)paragraph.getAncestor(NodeType.SHAPE);
        for (Paragraph  shpPara : (Iterable<Paragraph>)  shape.getChildNodes(NodeType.PARAGRAPH, true))
        {
            Paragraph clonePara = (Paragraph)shpPara.deepClone(true);
            shape.getParentParagraph().getParentNode().insertAfter(clonePara, shape.getParentParagraph());
        }
        shape.remove();
    }
}
doc.save(MyDir + "18.2.docx");

priyanga · February 22, 2018, 1:07pm

Hi @tahir.manzoor ,

Thank you very much .

I am using the above mentioned code. It is working absolutely.

Thanks & regards,
Priyanga G

priyanga · March 2, 2018, 8:39am

Hi @tahir.manzoor,

In this same query,
case 1:some figure captions are present in the left side of the image.

Source Input: Input.zip (1.2 MB)

The expected output: Expected output.zip (1.2 MB)

please kindly help me to get the content (figure caption)of text box and placed the caption in the same place and then remove the text box.

Thanks & Regards,
Priyanga.G

tahir.manzoor · March 2, 2018, 11:40am

@priyanga,

Thanks for your inquiry. The code example shared in my previous post works fine for your document. You just need to change the figure text in following IF condition.

if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig.")

priyanga · March 5, 2018, 4:43am

Hi @tahir.manzoor,

Thank you very much its working fine for this document.

Case 2:I have made the same changes in this regard.it s not working for the following document.please kindly help me to get all fig caption from the text box and paste it on the same place and remove the text box.

Input document: test.zip (758.5 KB)

expected output: expected output.zip (728.0 KB)

Input document 2: NIPIRA1 - WPC - revision.zip (904.5 KB)

Thanks & regards,
priyanga G

tahir.manzoor · March 5, 2018, 4:09pm

@priyanga,

Thanks for your inquiry. In this case, we suggest you please extract the content of text box, get the paragraph next to shape and insert the fig caption before it. Please check the following modified code example.

Document doc = new Document(MyDir + "test.docx");

NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
            && paragraph.getAncestor(NodeType.SHAPE) != null)
    {
        Shape shape = (Shape)paragraph.getAncestor(NodeType.SHAPE);
        for (Paragraph  shpPara : (Iterable<Paragraph>)  shape.getChildNodes(NodeType.PARAGRAPH, true))
        {
            Paragraph clonePara = (Paragraph)shpPara.deepClone(true);
            Node node =  shape.getParentParagraph();
            while(true && node != null)
            {
                node = node.getNextSibling();
                if(node.getNodeType() == NodeType.PARAGRAPH
                        &&((Paragraph)node).getChildNodes(NodeType.SHAPE, true).getCount() == 0
                        && node.toString(SaveFormat.TEXT).trim().length() > 0)
                {
                    break;
                }
            }
            shape.getParentParagraph().getParentNode().insertBefore(clonePara, node);
        }
        shape.remove();
    }
}
doc.save(MyDir + "18.2.docx");

priyanga · March 6, 2018, 7:29am

Hi @tahir.manzoor,

Thanks for your feedback.

I have attached the sample inputs and outputs using the above mentioned code.

The changes are made in first page of the document.please kindly help me to resolve the issues.

Input: TextboxTestfile 1.zip (533.1 KB)
expected output1:TextboxTestfile 1_output.zip (533.0 KB)

actual output1::Testfile2_output.zip (750.6 KB)

Input 2 TextboxTestfile2.zip (829.3 KB)
expected output 2: TextboxTestfile2_output.zip (831.0 KB)

actual output2: 18.2.zip (501.0 KB)

Thanks & regards,
Priyanga G.

tahir.manzoor · March 6, 2018, 5:13pm

@priyanga,

Thanks for your inquiry. We are investigating this issue and will get back to you soon.

priyanga · March 8, 2018, 4:11am

Hi @tahir.manzoor,

Thank you .
waiting eagerly for your reply.

tahir.manzoor · March 8, 2018, 5:46am

@priyanga,

Thanks for your patience. In your shared document, we have found following cases for fig caption and images.

The fig caption and image are in the same Paragraph node. The fig caption is inside Shape node (text box).
The fig caption and image (Shape or GroupShape) are in different paragraph nodes.

In both case, we suggest you please following solution.

Get the fig caption paragraph.
Clone the fig caption paragraph.
Please use the CompositeNode.InsertAfter method to insert the fig caption (paragraph) immediately after the Shape (image) node.
Remove the fig caption mentioned in step 1.

Please check the following code example. Hope this helps you.

Document doc = new Document(MyDir + "TextboxTestfile 1.docx");

ArrayList removeNodes = new ArrayList();
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("Fig")
            && paragraph.getAncestor(NodeType.SHAPE) != null)
    {
        Paragraph para = (Paragraph) paragraph.getAncestor(NodeType.SHAPE).getParentNode();
        if(para.getNextSibling() != null && para.getNextSibling().getNodeType() == NodeType.PARAGRAPH
                && (
                ((Paragraph)para.getNextSibling()).getChildNodes(NodeType.SHAPE, true).getCount() > 0
                || ((Paragraph)para.getNextSibling()).getChildNodes(NodeType.GROUP_SHAPE, true).getCount() > 0
                ))
        {
            (para.getParentSection().getBody()).insertAfter(paragraph.deepClone(true), para.getNextSibling());
        }

        if(para.getChildNodes(NodeType.SHAPE, true).getCount() > 1)
        {
            (para.getParentSection().getBody()).insertAfter(paragraph.deepClone(true), para);
        }

        removeNodes.add(paragraph);
    }
}

for(Node node : (Iterable<Node>)removeNodes)
{
    node.remove();
}
doc.save(MyDir + "TextboxTestfile 1 - 18.3.docx");

priyanga · March 8, 2018, 6:45am

Hi @tahir.manzoor,

Thank you very much .I am excited with your solution.

its working fine for figure captions.

And i have another issue on figure label like a),b)…

So please kindly help me to remove the text box of the figure label and then place the label i the same place.

Input - TextboxTestfile2.zip (829.3 KB)

expected output - TextboxTestfile2_output.zip (831.0 KB)

Thanks and regards,
Priyanga G

tahir.manzoor · March 8, 2018, 8:32am

@priyanga,

Thanks for your inquiry. Please ZIP and attach your expected output Word document here for our reference. We will then provide you code example accordingly.

priyanga · March 8, 2018, 10:03am

Hi @tahir.manzoor,
In my previous post i have already attached input and expected output. Please refer that post.

Regards,
Priyanga

tahir.manzoor · March 8, 2018, 3:04pm

@priyanga,

Thanks for sharing the detail. In your document, the text-boxes and images are in the same paragraph. Please check the attached DOM image. DOM.png (5.8 KB)

You can use the same approach to get the paragraph and insert it after specific image. We suggest you following solution.

Get the paragraph that have text e.g. a), b) and is child node of Shape node.
Clone this paragraph.
Please use the CompositeNode.InsertAfter method to insert it after Shape node in the same paragraph that has image. You can use Shape.HasImage to check either shape has image or not.
Remove the parent node (Shape) of paragraph mentioned in step 1.

Hope this helps you.

priyanga · March 9, 2018, 4:33am

Hi @tahir.manzoor

I have made this get paragraph by the following statement

paragraph.toString(SaveFormat.TEXT).trim().startsWith(“a)”)…

still it shows some error.please kindly help me to resolve this issue.

Thanks & regards,
priyanga G

tahir.manzoor · March 9, 2018, 4:14pm

@priyanga,

Thanks for your inquiry. Your document contains the eight shapes. Four contains the text and four are images. Following code example shows how to extract image and text from Paragraph node that contains the text “a)” and add them in separate Paragraph node.

Document doc = new Document(MyDir + "TextboxTestfile2.docx");

ArrayList removeNodes = new ArrayList();
NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);
for (Paragraph  paragraph : (Iterable<Paragraph>) paragraphs)
{
    if(paragraph.toString(SaveFormat.TEXT).trim().startsWith("a)")
            && paragraph.getAncestor(NodeType.SHAPE) != null)
    {
        Node shapes[] = paragraph.getAncestor(NodeType.SHAPE).getParentNode().getChildNodes(NodeType.SHAPE, true).toArray();
        if(shapes.length == 8)
        {
            for(int i = 0; i < shapes.length/2; i++)
            {
                Paragraph paragraph1 = new Paragraph(doc);
                (paragraph.getParentSection().getBody()).insertAfter(paragraph1, ((Shape)shapes[i]).getParentParagraph());
                paragraph1.appendChild((Shape)shapes[i+4]);
                paragraph1.appendChild(new Run(doc, ((Shape)shapes[i]).toString(SaveFormat.TEXT)));
            }
        }
        paragraph.getAncestor(NodeType.SHAPE).getParentNode().remove();
    }
}

doc.save(MyDir + "18.3.docx");

priyanga · March 12, 2018, 4:44am

Hi @tahir.manzoor,

Thanks for your feedback.

It’s working fine for upto four shapes only.

I want do the same upto a),b),c),…p) for all fig1,2,3…

The previous code provided by you is working for first four images only.like a),b),c),d).

so,please kindly help me to resolve the issue. Kindly please give high and quick responsibility for this scenario.

Input --> TextboxTestfile2.zip (829.3 KB)

expected output–> TextboxTestfile2_output.zip (831.0 KB)

actual output -->18.3.zip (750.3 KB)

Regards,
Priyanga G

tahir.manzoor · March 12, 2018, 3:32pm

@priyanga,

Thanks for your inquiry. There are two cases in your this document.

Case 1 : The eight text boxes are child node of Paragraph.
Case 2.: The text boxes are under the carts.

For the first case, you need to use the a), b), c) or 1), 2), 3) in the if condition as shown below.

We are working over the second case and will share the code soon.

Document doc = new Document(MyDir + "TextboxTestfile2.docx");

ArrayList removeNodes = new ArrayList();
Node paragraphs[] = doc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
for (Node node : paragraphs)
{
    Paragraph paragraph = (Paragraph)node;
    if(
            (
                    paragraph.toString(SaveFormat.TEXT).trim().startsWith("a)")
                    || paragraph.toString(SaveFormat.TEXT).trim().startsWith("e)")
                    || paragraph.toString(SaveFormat.TEXT).trim().startsWith("i)")
                    || paragraph.toString(SaveFormat.TEXT).trim().startsWith("m)")
                    )
            && paragraph.getAncestor(NodeType.SHAPE) != null)
    {

        Node shapes[] = paragraph.getAncestor(NodeType.SHAPE).getParentNode().getChildNodes(NodeType.SHAPE, true).toArray();
        if(shapes.length == 8)
        {
            Paragraph parentParagraph = (Paragraph)paragraph.getAncestor(NodeType.SHAPE).getParentNode();
            if(
                    ((parentParagraph.toString(SaveFormat.TEXT).trim().contains("a)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("b)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("c)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("d)"))
                            || (parentParagraph.toString(SaveFormat.TEXT).trim().contains("e)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("f)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("g)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("h)"))
                            || (parentParagraph.toString(SaveFormat.TEXT).trim().contains("i)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("j)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("k)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("l)"))
                            || (parentParagraph.toString(SaveFormat.TEXT).trim().contains("m)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("n)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("o)") && parentParagraph.toString(SaveFormat.TEXT).trim().contains("p)")))

                    )
            {
                for(int i = 0; i < shapes.length/2; i++)
                {
                    Paragraph paragraph1 = new Paragraph(doc);
                    (paragraph.getParentSection().getBody()).insertAfter(paragraph1, ((Shape)shapes[i]).getParentParagraph());
                    paragraph1.appendChild((Shape)shapes[i+4]);
                    paragraph1.appendChild(new Run(doc, ((Shape)shapes[i]).toString(SaveFormat.TEXT)));
                }
                paragraph.getAncestor(NodeType.SHAPE).getParentNode().remove();
            }
            else
            {
                // Code for second case
            }
        }
    }
}

doc.save(MyDir + "18.3.docx");

tahir.manzoor · March 14, 2018, 7:00am

@priyanga,

For the second case (the text boxes under the chart), please use the following code example. Hope this helps you.

for(Paragraph para : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true))
{
    if((para.toString(SaveFormat.TEXT).trim().contains("a)")
            || para.toString(SaveFormat.TEXT).trim().contains("b)")
            || para.toString(SaveFormat.TEXT).trim().contains("c)"))
            && para.getParentNode().getNodeType() == NodeType.SHAPE)
    {
        para.getParentSection().getBody().insertAfter(para.deepClone(true), para.getParentNode().getParentNode());
    }
}

for(Node para : doc.getChildNodes(NodeType.PARAGRAPH, true).toArray())
{
    if((para.toString(SaveFormat.TEXT).trim().contains("a)")
            || para.toString(SaveFormat.TEXT).trim().contains("b)")
            || para.toString(SaveFormat.TEXT).trim().contains("c)"))
            && para.getParentNode().getNodeType() == NodeType.SHAPE)
    {
        para.getParentNode().remove();
    }
}
doc.save(MyDir + "18.3.docx");