Aspose Word - Unable to separate header text from parsing

omer.asalm · June 8, 2022, 2:41pm

Hi,
We are a licenced Aspose User and the problem that we are currently facing is that I am trying to parse the word file and we want to skip the text from header. First of all, I am using the following code to get the nodes for header and footer

var headerFooterNodes = documnet.FirstSection.HeadersFooters[HeaderFooterType.HeaderPrimary].GetChildNodes(NodeType.Paragraph, true);

but I am not getting any nodes. Later I found that I am getting the text by following code

var headerFooterNodes = documnet.FirstSection.Body.GetChildNodes(NodeType.Paragraph, true);

So, I want to know why I am not getting the text in headerfooter but in body. I have attached the file that I am using for your reference.

Secondly, I am working on a problem which basically identifies the wrong paragraph breaks and merge them into a single paragraph. For this I want to skip the text from header and footer, can you show me how can I parse the file paragraph by paragraph and skipping the text from header and footer. Looking forward to your response.

page-2.docx (20.0 KB)

alexey.noskov · June 8, 2022, 5:23pm

@omer.asalm The problem is that your document actually does not have neither header nor footer. Content at the top of your document is simply an absolutely positioned group shape with textboxes and images. So in your case you can simply skip content of group shape or of textboxes.

Document documnet = new Document(@"C:\Temp\in.docx");

NodeCollection paragraphs = documnet.GetChildNodes(NodeType.Paragraph, true);
foreach (Paragraph p in paragraphs)
{
    // skip paragraphs in textboxes
    if (p.GetAncestor(NodeType.Shape) != null)
        continue;

    // process other paragraphs normally. 
}

omer.asalm · June 10, 2022, 8:07am

@alexey.noskov Thank you for your prompt reply and defining the problem. It doesn’t skip the shape at paragraph level but after making an adjustment it worked

Document documnet = new Document(@"C:\Temp\in.docx");

NodeCollection paragraphs = documnet.GetChildNodes(NodeType.Paragraph, true);
foreach (Paragraph p in paragraphs)
{
    // skip paragraphs in textboxes
    var lastNode = p.GetChildNodes(NodeType.Run, true).Last() as Run;
    if (lastNode.GetAncestor(NodeType.Shape) != null)
        continue;

    // process other paragraphs normally. 
}

But another problem that I found is that for some paragraph the font size is not same as I see in the word file. For example, in second paragraph the font size is 9 but for line “Infoboxen am Ende dieser Kapitel” the font size that I am getting is 12 which is not correct. Can you tell me what’s the problem here?

alexey.noskov · June 10, 2022, 8:51am

@omer.asalm As I can see font size if returned properly. Here is a simple code I used for testing:

Document documnet = new Document(@"C:\Temp\in.docx");

NodeCollection paragraphs = documnet.GetChildNodes(NodeType.Paragraph, true);
foreach (Paragraph p in paragraphs)
{
    // skip paragraphs in textboxes
    if (p.GetAncestor(NodeType.Shape) != null)
        continue;

    // process other paragraphs normally. 
    Console.WriteLine(p.ParagraphBreakFont.Size);
    foreach (Run r in p.Runs)
    {
        Console.WriteLine("\t{0}", r.Font.Size);
        Console.WriteLine("\t{0}", r.Text);
    }
}

omer.asalm · June 10, 2022, 10:00am

Hi @alexey.noskov, thanks for your code, surprisingly the problem solved by itself. Here is the final problem that I am facing. I am working on a problem which detects wrong paragraph breaks and merge those paragraphs into one. Here is the code that I am using

var wordFile = Path.Join("AbbyFiles/page-2.docx");
var documnet = new Aspose.Words.Document(wordFile);
var paragraphList = documnet.GetChildNodes(NodeType.Paragraph, true)
    .Where(p => p.ParentNode.NodeType == NodeType.Body && p.ToString(SaveFormat.Text).Trim() != "");

var paragraphs = paragraphList as Node[] ?? paragraphList.ToArray();
if (paragraphs.Any())
{
    for (int i = 0; i < paragraphs.Length - 1; i++)
    {
        var firstPara = (Paragraph)paragraphs.ElementAt(i);
        var secondPara = (Paragraph)paragraphs.ElementAt(i + 1);
        var firstNode = firstPara.GetChildNodes(NodeType.Run, true).Last() as Run;
        var secondNode = secondPara.GetChildNodes(NodeType.Run, true).First() as Run;
        if (firstNode.GetAncestor(NodeType.Shape) != null || secondNode.GetAncestor(NodeType.Shape) != null)
            continue;

        // process other paragraphs normally.
        if (isWrongParagraphBreak(firstNode, secondNode))
        {
            firstPara.Runs[^1].Text = firstPara.Runs[^1].Text.Trim() + " ";
            foreach (Node node in secondPara)
            {
                firstPara.AppendChild(node.Clone(true));
            }
            secondPara.Remove();
        }
    }
    documnet.Save("AbbyFiles/Integrierter-out.docx");
}


bool isWrongParagraphBreak(Run node1, Run node2)
{
    bool hasSameFont = node1.Font.StyleName == node2.Font.StyleName;
    bool hasSameFontSize = Math.Abs(node1.Font.Size - node2.Font.Size) == 0 ||
                            Math.Abs(node1.Font.SizeBi - node2.Font.SizeBi) == 0;
    bool areBold = node1.Font.Bold == node2.Font.Bold;
    bool textNotEnded = !node1.Text.Trim().EndsWith('.');
    bool secondParaStartsWithSmallCase = char.IsLower(node2.Text.Trim()[0]);
    return hasSameFont && hasSameFontSize && textNotEnded; //&& secondParaStartsWithSmallCase;
}

The problem is that during the merge of second last line of second paragraph(please also see the screenshot attached for the lines) which is

begrenzter Sicherheit geprüft wurden. Weitere Informationen zur Prüfungssicherheit finden Sie in

and

begrenzter Sicherheit geprüft wurden. Weitere Informationen zur Prüfungssicherheit finden Sie in

when the code merges it into one paragraph and remove the second paragraph. In the output file, it somehow remove the text from second paragraph. Can you also investigate it? Thanks.

Screenshot 2022-06-10 at 11.55.59.png (20.9 KB)

alexey.noskov · June 10, 2022, 2:53pm

@omer.asalm I have modified your code. Now it works correct:

var documnet = new Aspose.Words.Document(@"C:\Temp\in.docx");
var paragraphList = documnet.GetChildNodes(NodeType.Paragraph, true)
    .Where(p => p.ParentNode.NodeType == NodeType.Body && p.ToString(SaveFormat.Text).Trim() != "");

var paragraphs = paragraphList as Node[] ?? paragraphList.ToArray();
if (paragraphs.Any())
{
    foreach (Paragraph p in paragraphs)
    {
        var firstPara = p;
        var secondPara = p.NextSibling as Paragraph;
        if (secondPara == null)
            continue;

        // Skip pragraphs withtout runs.
        if (firstPara.Runs.Count == 0 || secondPara.Runs.Count == 0)
            continue;

        var firstNode = firstPara.Runs[^1];
        var secondNode = secondPara.Runs[0];

        if (firstNode.GetAncestor(NodeType.Shape) != null || secondNode.GetAncestor(NodeType.Shape) != null)
            continue;

        // process other paragraphs normally.
        if (isWrongParagraphBreak(firstNode, secondNode))
        {
            firstNode.Text = firstNode.Text.TrimEnd() + " ";
            while(secondPara.HasChildNodes)
            {
                firstPara.AppendChild(secondPara.FirstChild);
            }
            secondPara.Remove();
        }
    }
    documnet.Save(@"C:\Temp\out.docx");
}

omer.asalm · June 10, 2022, 6:18pm

Hi @alexey.noskov first of all thank you for putting your time into this. Unfortunately, it is not fixed yet. Please refer to the screenshot, your code didn’t merge the paragraphs. If you enable Formatting marks option in word, you will see this.

Screenshot 2022-06-10 at 20.14.58.png (10.4 KB)

alexey.noskov · June 11, 2022, 4:53am

@omer.asalm Thank you for additional information. The problem occurs because the original code process only pair of paragraphs. But in this case there are 3 paragraphs to merge. To resolve this you should use while loop. I have modified code:

var documnet = new Aspose.Words.Document(@"C:\Temp\in.docx");
var paragraphList = documnet.GetChildNodes(NodeType.Paragraph, true)
    .Where(p => p.ParentNode.NodeType == NodeType.Body && p.ToString(SaveFormat.Text).Trim() != "");

var paragraphs = paragraphList as Node[] ?? paragraphList.ToArray();
if (paragraphs.Any())
{
    foreach (Paragraph p in paragraphs)
    {
        var firstPara = p;
        var secondPara = p.NextSibling as Paragraph;
       
        // process other paragraphs normally.
        while (isWrongParagraphBreak(firstPara, secondPara))
        {
            firstPara.Runs[^1].Text = firstPara.Runs[^1].Text.TrimEnd() + " ";
            while(secondPara.HasChildNodes)
                firstPara.AppendChild(secondPara.FirstChild);

            Paragraph nextPara = secondPara.NextSibling as Paragraph;
            secondPara.Remove();
            secondPara = nextPara;
        }
    }
    documnet.Save(@"C:\Temp\out.docx");
}

bool isWrongParagraphBreak(Paragraph firstPara, Paragraph secondPara)
{
    if (secondPara == null)
        return false;

    // Skip pragraphs withtout runs.
    if (firstPara.Runs.Count == 0 || secondPara.Runs.Count == 0)
        return false;

    var firstNode = firstPara.Runs[^1];
    var secondNode = secondPara.Runs[0];

    if (firstNode.GetAncestor(NodeType.Shape) != null || secondNode.GetAncestor(NodeType.Shape) != null)
        return false;

    bool hasSameFont = firstNode.Font.StyleName == secondNode.Font.StyleName;
    bool hasSameFontSize = Math.Abs(firstNode.Font.Size - secondNode.Font.Size) == 0 ||
                            Math.Abs(firstNode.Font.SizeBi - secondNode.Font.SizeBi) == 0;
    bool areBold = firstNode.Font.Bold == secondNode.Font.Bold;
    bool textNotEnded = !firstNode.Text.Trim().EndsWith(".");
    bool secondParaStartsWithSmallCase = char.IsLower(secondNode.Text.Trim()[0]);
    return hasSameFont && hasSameFontSize && textNotEnded; //&& secondParaStartsWithSmallCase;
}

omer.asalm · June 12, 2022, 11:15am

Hi @alexey.noskov thanks a lot for your modification. It works for the current file. I am gonna test this on couple of more files but again thanks for your prompt help!