Pagebreaks

PatrickVB · April 24, 2018, 10:55pm

Dear Team,

In the attached document there is a paragraph which has some text, then a page break and then furhter text (of the next page) followed then by a carriage return. (last paragraph on page 3)
Sample.zip (28.1 KB)

I would like to replace that pagebreak with a carriage return followed by a pagebreak, In fact I would like to replace all pagebreaks which are not at the end of a paragraph with a carriage return followed by the pagebreak

So if I have text like this

Some text {PAGEBREAK}
More text on the next page {CARRIAGE_RETURN}

I would like this to become
Some text {CARRIAGE RETURN}
{PAGEBREAK}
More text on hte next page {CARRIAGE RETURN}

or

Some text {PAGEBREAK} {CARRIAGE RETURN}
More text on the next page {CARRIAGE RETURN}

Pagebreaks which are already at the end of a paragraph do not need to be replaced. Example of such text would be the last paragraph of page 1 or last paragraphs of page 2.
So text like this

Some text {CARRIAGE RETURN}{PAGEBREAK}

Does not need to be replaced.

Thanks you very much for your support.

Regards

Patrick Vanbrabant

awais.hafeez · April 25, 2018, 1:24am

@PatrickVB,

Please try using the following code:

Document doc = new Document(MyDir + @"Sample\sample.docx");

NodeCollection col = doc.GetChildNodes(NodeType.Run, true);
ArrayList list = new ArrayList();

foreach (Run run in col)
{
    if (run.Text.Equals(ControlChar.PageBreak))
    {
        if (!string.IsNullOrEmpty(run.ParentParagraph.ToString(SaveFormat.Text).Trim()))
        {
            list.Add(run);
                        
        }
    }
}
           
for (int i=0; i< col.Count; i++)
{
    Run run = (Run)col[i];
    foreach(Run listRun in list)
    {
        if (run == listRun)
        {
            Paragraph para = run.ParentParagraph;
            Paragraph pageBreakPara = (Paragraph)para.Clone(false);
            para.ParentNode.InsertAfter(pageBreakPara, para);

            if (run == para.LastChild)
            {
                pageBreakPara.Runs.Add(run);
            }
            else
            {
                Paragraph remainingPara = (Paragraph)para.Clone(false);
                pageBreakPara.ParentNode.InsertAfter(remainingPara, pageBreakPara);

                Node node = run.NextSibling;
                while (node != null)
                {
                    remainingPara.AppendChild(node);
                    node = node.NextSibling;
                }

                pageBreakPara.Runs.Add(run);
            }
        }
    }
}

doc.Save(MyDir + @"Sample\18.4.docx");

PatrickVB · April 25, 2018, 7:14am

Hi Awais,

Thank you very much for the response. I should have indicated that I’m using aspose for java, but the code is fine. I can read VB as wel

I will try later and inform you about the outcome.

Many thanks.

Patrick

awais.hafeez · April 25, 2018, 8:36am

@PatrickVB,

Please check the following Aspose.Words for Java code:

Document doc = new Document("D:\\Temp\\sample\\sample.docx");

NodeCollection col = doc.getChildNodes(NodeType.RUN, true);
ArrayList list = new ArrayList();

for (Run run : (Iterable<Run>) col)
{
    if (run.getText().equals(ControlChar.PAGE_BREAK))
    {
        if (!run.getParentParagraph().toString(SaveFormat.TEXT).trim().equals(""))
        {
            list.add(run);

        }
    }
}

for (int i=0; i< col.getCount(); i++)
{
    Run run = (Run)col.get(i);
    for(Run listRun : (Iterable<Run>) list)
    {
        if (run == listRun)
        {
            Paragraph para = run.getParentParagraph();
            Paragraph pageBreakPara = (Paragraph)para.deepClone(false);
            para.getParentNode().insertAfter(pageBreakPara, para);

            if (run == para.getLastChild())
            {
                pageBreakPara.getRuns().add(run);
            }
            else
            {
                Paragraph remainingPara = (Paragraph)para.deepClone(false);
                pageBreakPara.getParentNode().insertAfter(remainingPara, pageBreakPara);

                Node node = run.getNextSibling();
                while (node != null)
                {
                    remainingPara.appendChild(node);
                    node = node.getNextSibling();
                }

                pageBreakPara.getRuns().add(run);
            }
        }
    }
}

doc.save("D:\\Temp\\sample\\awjava-18.4.docx");

PatrickVB · April 25, 2018, 4:01pm

Hi Awais,

I have the impression that this piece of code can only function if the pagebreak is always in a separate run. Is this guaranteed always the case.
Because in case the pagebreak is not in a separate run, that paragraph will simply be ignored completely

Regards

Patrick

PatrickVB · April 25, 2018, 4:38pm

Dear Awais,

The result not quite as I expected. The code put one carriage return too much. Both for paragraph on page 2 as paragraph on page 3. The situation on page 1 is correct. rn and should not be changed.

Page 2.
The text of that paragraph was
Results can be seen in figure [PAGEBREAK][CARRIAGE RETURN] (on same line)

And this has been changed into
Results can be seen in figure 4. [CARRIAGE RETURN]
[PAGEBREAK][CARRIAGE RETURN]

Expected result for this paragraph would be no change.
Results can be seen in figure 4.[PAGEBREAK][CARRIAGE RETURN]

And for page 3
THe original text was

WO 08/052568 A1[PAGEBREAK]CLAIMS[CR] (but claims because of pagebreak on new page)

This is converted into
WO 08/052568 A1 [CR]
[PAGEBREAK][CR]
CLAIMS[CR]

Expected result is
WO 08/052568 A1 [PAGEBREAK][CR]
CLAIMS[CR]

I have attached the converted result and my expected result so you can see the differences.
The attached zip contains the original, the updated using your provided code, and the expected result. You will see the slight difference between the updated and the expected.Sample.zip (80.3 KB)

Many thanks for your support

Patrick

awais.hafeez · April 26, 2018, 3:14am

@PatrickVB,

We are checking this scenario and will get back to you soon.

PatrickVB · April 26, 2018, 11:25am

Dear Awais,

Any update on this issue?
Many thanks

Patrick

awais.hafeez · April 26, 2018, 12:12pm

@PatrickVB,

Please try using the following code. Hope, this produces the expected output:

Document doc = new Document("D:\\Temp\\Sample\\Sample.docx");

NodeCollection col = doc.getChildNodes(NodeType.RUN, true);
ArrayList list = new ArrayList();

for (Run run : (Iterable<Run>) col)
{
    if (run.getText().equals(ControlChar.PAGE_BREAK))
    {
        if (!run.getParentParagraph().toString(SaveFormat.TEXT).trim().equals(""))
        {
            list.add(run);

        }
    }
}

for (int i=0; i< col.getCount(); i++) {
    Run run = (Run) col.get(i);
    for (Run listRun : (Iterable<Run>) list) {
        if (run == listRun) {
            Paragraph para = run.getParentParagraph();

            if (run == para.getLastChild()) {

            } else {
                Paragraph remainingPara = (Paragraph) para.deepClone(false);
                para.getParentNode().insertAfter(remainingPara, para);

                Node node = run.getNextSibling();
                while (node != null) {
                    remainingPara.appendChild(node);
                    node = node.getNextSibling();
                }
            }
        }
    }
}

doc.save("D:\\Temp\\Sample\\awjava-18.4.docx");

PatrickVB · April 26, 2018, 12:58pm

Hi Awais,

Based on the sample file the code is working fine.
Many thanks.

Regards

Patrick

awais.hafeez · April 27, 2018, 12:28am

@PatrickVB,

Thanks for your feedback. Please let us know anytime you have any further queries.

PatrickVB · April 30, 2018, 9:22am

Dear Awais,

I have run the solution on a bit wider document set and there is something not quite right.
In the attached zip file you can find the sample, the updatedSample (the result of the algorithm) and the expected out.

You can see that the text part “It is an” is moved to the end of the paragraph. The pararagraph breaking after the pagebreak in not done properly.

Could you please have a look at this scenario. It looks like the issue is arising from the fact that the word “Object” has a different layout and as such is part of a different run.

Many thanks.

PatrickSample.zip (65.6 KB)

PatrickVB · April 30, 2018, 9:39am

And you can find a more elaborate example of the same symptom in sample2.zip.

Please look at the pagebreaks on page 3, 6 and 12. In the updated document, these pagebreaks are not terminated by a carrriage return.

It is the expected result that each pagebreak which is not terminated by a Carriage return is terminated by a carriage return, as you can see in the expected result.

Looking forward to hear from you.

Regards

PatrickSample2.zip (101.1 KB)

PatrickVB · April 30, 2018, 11:29am

Dear Awais,

I have crafted a solution myself, which is working in all mentioned scenarios.
Could you please review…

    Document doc = new Document("D:/Temp/Sample/Sample.docx");

    NodeCollection col = doc.getChildNodes(NodeType.RUN, true);
    ArrayList list = new ArrayList();

    for (Run run : (Iterable<Run>) col)
    {
        if (run.getText().equals(ControlChar.PAGE_BREAK))
        {
            if (!run.getParentParagraph().toString(SaveFormat.TEXT).trim().equals(""))
            {
                list.add(run);
            }
        }
    }

    for (int i=0; i< col.getCount(); i++) {
        Run run = (Run) col.get(i);
        for (Run listRun : (Iterable<Run>) list) {
            if (run == listRun) {
                Paragraph para = run.getParentParagraph();

                if (run == para.getLastChild()) {

                } else {
                    RunCollection runs = para.getRuns();
                    Run[] runElements = runs.toArray();

                    Paragraph remainingPara = (Paragraph) para.deepClone(false);
                    para.getParentNode().insertAfter(remainingPara, para);

                    int index = runs.indexOf(run);
                    for (int cnt = index+1; cnt<runElements.length; cnt++) {
                        Run currentRun = runElements[cnt];
                        remainingPara.appendChild(currentRun);
                    }
                }
            }
        }
    }

    doc.save("D:/Temp/Sample/updatedSample.docx");

Many thanks

Patrick

awais.hafeez · April 30, 2018, 11:43am

@PatrickVB,

You are right; there was an issue with my code.

Your code is good but it will move only the Run nodes to the remainingPara. Shapes and other inline nodes will be left behind. Here is an improved version of your code. Hope, this helps:

Document doc = new Document("D:\\Temp\\Sample\\Sample.docx");

NodeCollection col = doc.getChildNodes(NodeType.RUN, true);
ArrayList list = new ArrayList();

for (Run run : (Iterable<Run>) col)
{
    if (run.getText().equals(ControlChar.PAGE_BREAK))
    {
        if (!run.getParentParagraph().toString(SaveFormat.TEXT).trim().equals(""))
        {
            list.add(run);

        }
    }
}

for (int i=0; i< col.getCount(); i++) {
    Run run = (Run) col.get(i);
    for (Run listRun : (Iterable<Run>) list) {
        if (run == listRun) {
            Paragraph para = run.getParentParagraph();

            if (run == para.getLastChild()) {

            }else {
                NodeCollection nodes = para.getChildNodes();
                Node[] nodeElements = nodes.toArray();

                Paragraph remainingPara = (Paragraph) para.deepClone(false);
                para.getParentNode().insertAfter(remainingPara, para);

                int index = nodes.indexOf(run);
                for (int cnt = index+1; cnt< nodeElements.length; cnt++) {
                    Node currentNode = nodeElements[cnt];
                    remainingPara.appendChild(currentNode);
                }
            }
        }
    }
}

doc.save("D:\\Temp\\Sample\\awjava-18.4.docx");

PatrickVB · April 30, 2018, 11:54am

You are completely right. Thanks for the improvement.

PatrickVB · April 30, 2018, 12:09pm

Dear Awais,

Dear is one thing which is unclear to me. Why are you not simply iterating over the list. Why is there this additional iteration over the collection of Runs (col).
The logic is only executed in case the Run from col is the same as the Run from list, when then not simply iterating over the Runs from List in the first place.

Thanks for the clarification

Patrick

awais.hafeez · April 30, 2018, 12:27pm

@PatrickVB,

Sure, you can simply the code like this:

Document doc = new Document("D:\\Temp\\Sample\\Sample.docx");

NodeCollection col = doc.getChildNodes(NodeType.RUN, true);
ArrayList list = new ArrayList();

for (Run run : (Iterable<Run>) col) {
    if (run.getText().equals(ControlChar.PAGE_BREAK)) {
        if (!run.getParentParagraph().toString(SaveFormat.TEXT).trim().equals("")) {
            list.add(run);
        }
    }
}

for (int i = 0; i < list.size(); i++) {
    Run run = (Run) list.get(i);
    Paragraph para = run.getParentParagraph();

    if (run == para.getLastChild()) {

    } else {
        NodeCollection nodes = para.getChildNodes();
        Node[] nodeElements = nodes.toArray();

        Paragraph remainingPara = (Paragraph) para.deepClone(false);
        para.getParentNode().insertAfter(remainingPara, para);

        int index = nodes.indexOf(run);
        for (int cnt = index + 1; cnt < nodeElements.length; cnt++) {
            Node currentNode = nodeElements[cnt];
            remainingPara.appendChild(currentNode);
        }
    }
}

doc.save("D:\\Temp\\Sample\\awjava-18.4.docx");