Weird behaviour regarding Paragraphs

Are you able to suggest why paragraphs imported using Document Library is sometimes split into multiple Run instead of just 1?

Business Case and Benefits of Product should be 1 Run, but it is being converted into 5 Text Run instead.

I am extracting that information using the following Code:

 private static string GetHeader(string id, Paragraph paragraph, Dictionary<string,string> headerToId)
    {
        var headerText = "";
        var newId = id + "#P_0";

        // Console.WriteLine("Inspecting " + paragraph.GetText() + " ");
        
        if (paragraph.ParagraphFormat.Alignment != ParagraphAlignment.Justify &&
            paragraph.ParagraphFormat.Alignment != ParagraphAlignment.Left)
        {
            // Console.WriteLine("Rejected for Alignment");
            return "";
        }

        if (paragraph.Count == 0)
        {
            // Console.WriteLine("Rejected for Child Count");
            return "";
        }
        
        for (var i = 0; i < paragraph.Count; i++)
        {
            var paragraphChild = paragraph.GetChildNodes(NodeType.Any, false)[i];
            if (paragraphChild is BookmarkStart || paragraphChild is BookmarkEnd)
            {
                continue;
            }
            
            if (!(paragraphChild is Run))
            {
                // Console.WriteLine("Rejected for Child is Not RUN");
                // Console.WriteLine(paragraphChild.NodeType);
                return "";
            }
            
            var run = (Run)paragraphChild;
            if (!run.Font.Bold)
            {
                // Console.WriteLine("Rejected for Bold");
                return "";
            }

            if (run.Text.Trim().Length == 0 && paragraph.Count == 1)
            {
                //Required to Remove Empty Space Paragraph being detected as Header
                return "";
            }
            
            // Console.WriteLine(run.Text);
            if (run.Text.ToUpper().Trim() != run.Text.Trim())
            {
                // Console.WriteLine("Rejected for Trim");
                return "";
            }
            headerText += run.Text;
        }
        
        headerToId[headerText.Trim()] = newId;
        return headerText.Trim();
    }
    

@h1021010231023a

The behavior you’re experiencing, where paragraphs are split into multiple Run objects instead of being consolidated into a single Run, can occur due to several reasons related to the formatting and structure of the text within the paragraph.

  1. Text Formatting: If the text within the paragraph has varying formatting (e.g., different font sizes, styles, or colors), Aspose.Words will create separate Run objects for each segment of text that has different formatting. This is a common occurrence when text is copied from different sources or when manual formatting changes are applied.

  2. Paragraph Structure: The presence of special characters, such as line breaks or tabs, can also lead to the creation of multiple Run objects. If the text includes any formatting marks or hidden characters, these can cause the text to be split into different runs.

  3. Document Conversion: If the document is being imported from another format (like DOCX or RTF), the conversion process might result in multiple runs due to the way the original document was structured.

To address this issue, you can consider the following approaches:

  • Join Runs: If you want to consolidate multiple runs into a single run, you can use the JoinRunsWithSameFormatting method. This method merges adjacent runs that have the same formatting into a single run, which can help simplify the paragraph structure.

    Here’s a sample code snippet to demonstrate how to join runs:

    Document doc = new Document("path/to/your/document.docx");
    Paragraph paragraph = doc.FirstSection.Body.FirstParagraph;
    
    // Join runs with the same formatting
    int mergedRunsCount = paragraph.JoinRunsWithSameFormatting();
    Console.WriteLine($"Number of merged runs: {mergedRunsCount}");
    
  • Inspect Formatting: Before processing the paragraph, you can inspect the formatting of each run to understand why they are being split. This can help you identify any inconsistencies that may be causing the issue.

  • Clean Up Text: If the text contains unwanted characters or formatting, consider cleaning it up before processing. This can involve removing extra spaces, line breaks, or any other formatting that may lead to multiple runs.

By applying these strategies, you should be able to manage the paragraph structure more effectively and reduce the number of runs created during the import process. If you continue to experience issues, please provide additional details about the source of the document and any specific formatting that may be present.

@h1021010231023a Could you please attach your input document here for testing? We will check it and provide you more information.

It is quite rare situation when all text of Paragraph is represented as a one Run. WS word creates new runs when document is edited. You can try using Document.JoinRunsWithSameFormatting method to join runs with same formatting.