Words ，How to obtain data for each paragraph and line?

bbhxwl · November 10, 2025, 4:04pm

How to obtain data for each paragraph and line? I want to check if the first word at the beginning and the first word at the end of each line of data are the words I want. If they are, I will set the background color

Professionalize.Discourse · November 10, 2025, 4:09pm

@bbhxwl

Summary
The current evidence does not contain information about how to obtain data for each paragraph and line in Aspose.Words, nor does it describe the LayoutCollector/LayoutEnumerator approach. Because of this, the detailed code example in the draft answer cannot be confirmed as correct based on the supplied sources.

Why it happens

The retrieved snippets discuss unrelated topics (list numbering, mail‑merge placeholders, PDF conversion) and do not mention line‑level layout APIs.
Without supporting documentation, we cannot verify that LayoutCollector.GetLineNumber or LayoutEnumerator behave as described, nor that shading runs will give the appearance of line‑level highlighting.

What to do next

Check the official Aspose.Words documentation for the version you are using (e.g., 25.10.0) to see if LayoutCollector and LayoutEnumerator are available and how they map nodes to visual lines.
If those classes exist, you can follow the general approach:

Create a LayoutCollector for the document.
Use a LayoutEnumerator (or similar API) to retrieve the visual line number for each Run.
Group runs by line number, build the line text, split into words, and compare the first and last words.
Apply background shading to the runs that belong to matching lines.

If the APIs are not present in your version, you may need to render the document to a format that gives line information (e.g., convert to PDF and analyze the text positions) or manually calculate line breaks based on paragraph width and font metrics.
Test with a small document where you know the expected first/last words, and verify the highlighting visually.
Consider alternative approaches such as using regular expressions on the paragraph text if exact line layout is not critical.

Note: Because the evidence does not confirm the existence or behavior of the layout‑related classes, the above steps are suggestions that should be validated against the official Aspose.Words API reference for your specific SDK version.

I could not fully verify this explanation against the retrieved sources. Please double-check the behaviour in your environment and confirm the exact SDK/version.

bbhxwl · November 10, 2025, 4:40pm

        if (m.CheckAlignRightSide || m.CheckLineStartCharacter)
        {
            #region 记录每个段落

            var bkMap = new Dictionary<Paragraph, (BookmarkStart bs, BookmarkEnd be)>(pars.Count);
            Random r = new Random();
            for (int i = 0; i < pars.Count; i++)
            {
                var p0 = pars[i];
                if (p0.Count > 0 && p0.First()?.NodeType == NodeType.BookmarkStart &&
                    p0.Last()?.NodeType == NodeType.BookmarkEnd)
                {
                    p0.First().Remove();
                    p0.Last().Remove();
                }

                string bkName = $"_seg_{r.Next(1, 999999)}_" + i;
                var bs = new BookmarkStart(doc, bkName);
                var be = new BookmarkEnd(doc, bkName);
                p0.PrependChild(bs);
                p0.AppendChild(be);
                bkMap[p0] = (bs, be);
            }

            #endregion
            doc.UpdatePageLayout();
            var lc = new LayoutCollector(doc);
            var le = new LayoutEnumerator(doc);
            var end = new LayoutEnumerator(doc);
            // int count = 0;
            // int discoveryCount = 0;
            // bool stopAll = false;

            var marks = new List<(Paragraph p, int visibleIndex, int length, Color color)>(capacity: 256);
            foreach (var paragraph in pars)
            {
                int visibleOffset = 0;
                string prevText = "";
                var (bs, be) = bkMap[paragraph];
                var startEnt = lc.GetEntity(bs);
                var endEnt = lc.GetEntity(be);
                if (startEnt == null || endEnt == null) continue; // 页眉/形状内可能拿不到，直接跳过

                // 把两个枚举器分别放到“首行”和“末行”
                le.Current = startEnt;
                le.MoveParent(LayoutEntityType.Line);

                end.Current = endEnt;
                end.MoveParent(LayoutEntityType.Line);
                // 末行的“指纹”（用页码+矩形，允许微小误差）
                int lastPage = end.PageIndex;
                var lastRect = end.Rectangle;
                for (;;)
                {
                    // ① 确保在 Line 层（避免从 Span/其它节点拼出“串行”）
                    if (le.Type != LayoutEntityType.Line)
                    {
                        if (!le.MoveParent(LayoutEntityType.Line)) break;
                    }

                    // —— 只处理“当前这一行” —— 
                    var text = WordApi.GetLineText(le, doc, out int visibleLen);
                    Console.WriteLine(text);

                    var tem = WordApi.ExtractFirstUnit(text);
                    int lineStart = visibleOffset;
                    visibleOffset += visibleLen;

                    var isSymbolFirst = Regex.IsMatch(text, @"^[\p{P}\p{S}]", RegexOptions.CultureInvariant);
                    if (isSymbolFirst)
                        marks.Add((paragraph, lineStart, 1, Color.Blue));

                    if (m.CheckAlignRightSide && prevText.Length >= 0 &&
                        (WordApi.CountWithNumberRunsAsOne(prevText) % 2 != 0 || tem.Length > 1))
                    {
                        if (text.Length >= 2)
                        {
                            string fuHao = text[1].ToString();
                            if (WordApi.ContainsSymbolOrPunctuation(fuHao) || tem.Length > 1)
                            {
                                if (tem.Length > 1)
                                    marks.Add((paragraph, lineStart, tem.Length, Color.Yellow));
                                else
                                    marks.Add((paragraph, lineStart + 1, 1, Color.Yellow));
                            }
                        }
                    }

                    prevText = text;

                    // ② 到达本段“末行”就退出（防止跨段落/跨容器）
                    if (le.PageIndex == lastPage && NearlyEqual(le.Rectangle, lastRect))
                        break;

                    // ③ 前进：逻辑顺序 + 归位到下一行
                    if (!le.MoveNextLogical()) break;
                    if (le.Type != LayoutEntityType.Line)
                    {
                        if (!le.MoveParent(LayoutEntityType.Line)) break;
                    }
                }

                Console.WriteLine();
            }

            // ——统一高亮（避免在扫描时触发布局失效）——
            foreach (var mm in marks)
            {
                WordApi.HighlightVisibleRangeInParagraph(mm.p, mm.visibleIndex, mm.length, mm.color);
            }

            // 删除追加的BookmarkStart 和 BookmarkEnd
            for (int i = 0; i < pars.Count; i++)
            {
                var p0 = pars[i];
                if (p0.Count > 0 && p0.First()?.NodeType == NodeType.BookmarkStart &&
                    p0.Last()?.NodeType == NodeType.BookmarkEnd)
                {
                    p0.First().Remove();
                    p0.Last().Remove();
                }
            }
        }



    public static string GetLineText(LayoutEnumerator lineLe, Document doc, out int visibleLen)
    {
        var cur = new LayoutEnumerator(doc) { Current = lineLe.Current };
        var sb = new StringBuilder();

        if (cur.Type != LayoutEntityType.Line) // 容错
            cur.MoveParent(LayoutEntityType.Line);

        if (cur.MoveFirstChild())
        {
            do
            {
                if (cur.Type == LayoutEntityType.Span)
                {
                    var kind = cur.Kind; // "TEXT" / "SPACE" / "TAB" / "FIELD_CODE" / "NUMBERING" ...
                    if (kind == "TEXT" || kind == "SPACE" || kind == "TAB" || kind == "NUMBERING")
                    {
                        if (!string.IsNullOrEmpty(cur.Text))
                        {
                            foreach (var ch in cur.Text)
                                if (!char.IsControl(ch)) sb.Append(ch);
                        }
                    }
                }
            } while (cur.MoveNext()); // 只在“本行”的孩子间移动
        }

        var s = sb.ToString();
        visibleLen = s.Length;
        return s;
    }

bbhxwl · November 10, 2025, 4:44pm

My code still has some inaccuracies in obtaining it.

vyacheslav.deryushev · November 10, 2025, 6:49pm

@bbhxwl Try to use following code to get the lines and the compare with the words:

Document doc = new Document("input.docx");

string startWord = "word1";
string endWord = "word2";

doc.UpdatePageLayout();

LayoutEnumerator enumerator = new LayoutEnumerator(doc);

// Dictionary to store lines: Y position -> line text
SortedDictionary<float, string> lines = new SortedDictionary<float, string>();
// Navigate through the entire layout tree
NavigateLayoutTree(enumerator, lines);

// Now match lines to runs and highlight
NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);

foreach (Paragraph para in paragraphs)
{
    string fullParaText = para.GetText().ToString();

    // Skip paragraphs that don't contain both words
    if (!fullParaText.Contains(startWord) ||
        !fullParaText.Contains(endWord))
    {
        continue;
    }

    List<Run> paraRuns = new List<Run>();
    StringBuilder paraText = new StringBuilder();

    foreach (Run run in para.GetChildNodes(NodeType.Run, true))
    {
        paraRuns.Add(run);
        paraText.Append(run.Text);
    }

    // Check each line to see if it's in this paragraph
    foreach (var line in lines)
    {
        string lineText = line.Value.Trim();

        if (string.IsNullOrWhiteSpace(lineText))
            continue;

        // Skip lines that don't contain both words
        if (!lineText.Contains(startWord) ||
            !lineText.Contains(endWord))
        {
            continue;
        }

        // Check if this line is part of this paragraph
        if (!fullParaText.Contains(lineText))
            continue;

        // Check if line matches criteria (starts with startWord and ends with endWord)
        bool startsWithWord = Regex.IsMatch(lineText, @"^" + Regex.Escape(startWord) + @"\b", RegexOptions.IgnoreCase);
        bool endsWithWord = Regex.IsMatch(lineText, @"\b" + Regex.Escape(endWord) + @"$", RegexOptions.IgnoreCase);

        if (startsWithWord && endsWithWord)
        {
            // Find position in paragraph
            int lineStart = fullParaText.IndexOf(lineText, StringComparison.Ordinal);
            if (lineStart >= 0)
            {
                int lineEnd = lineStart + lineText.Length;
                int currentPos = 0;

                // Highlight runs
                foreach (Run run in paraRuns)
                {
                    int runStart = currentPos;
                    int runEnd = currentPos + run.Text.Length;

                    if (runEnd > lineStart && runStart < lineEnd)
                    {
                        run.Font.HighlightColor = Color.Yellow;
                    }

                    currentPos += run.Text.Length;
                }
            }
        }
    }
}

doc.Save("output_highlighted.docx");

private void NavigateLayoutTree(LayoutEnumerator enumerator, SortedDictionary<float, string> lines)
{
    do
    {
        // Process current node
        if (enumerator.Type == LayoutEntityType.Line)
        {
            float y = enumerator.Rectangle.Y;
            string text = ExtractLineText(enumerator);

            if (!string.IsNullOrWhiteSpace(text) && !lines.ContainsKey(y))
            {
                lines[y] = text;
            }
        }

        // Recursively process children
        if (enumerator.MoveFirstChild())
        {
            NavigateLayoutTree(enumerator, lines);
            enumerator.MoveParent();
        }
    }
    while (enumerator.MoveNext());
}

private string ExtractLineText(LayoutEnumerator enumerator)
{
    var text = new StringBuilder();

    if (enumerator.MoveFirstChild())
    {
        do
        {
            if (enumerator.Type == LayoutEntityType.Span)
            {
                text.Append(enumerator.Text);
            }

            // Recursively extract text from children
            text.Append(ExtractLineText(enumerator));

        } while (enumerator.MoveNext());

        enumerator.MoveParent();
    }

    return text.ToString();
}

bbhxwl · November 11, 2025, 2:24am

There is currently a bug where each row of data obtained is correct, but the data obtained has a paragraph end. However, para. GetText(). ToString() does not have a paragraph end, so the matching will never succeed

@vyacheslav.deryushev

bbhxwl · November 11, 2025, 2:27am

Your NavigateLayoutTree method extracts each row of data, which is more perfect than mine. If you could extract which paragraph this row of data is in here, it might be even more perfect