How to read body text line by line?

ducaisoft · May 5, 2023, 12:47am

Hi,Support:
Is there any method to read body range texts line by line? It seems there is only method to read text range paragraph by paragraph or character by character?

Thanks for your help.

alexey.noskov · May 5, 2023, 4:37am

@ducaisoft As you know there is no concept of page or line in MS Word documents due to their flow nature. The consumer applications build document layout on the fly, the same does Aspose.Words using it’s own layout engine. LayoutCollector and LayoutEnumerator classes provides a limited access to document layout information.
For example, the following code demonstrates the basic technique of splitting document content into lines:

Document doc = new Document(@"C:\Temp\in.docx");

// Split all Run nodes in the document to make them not more than one word.
List<Run> runs = doc.GetChildNodes(NodeType.Run, true).Cast<Run>().ToList();
foreach (Run r in runs)
{
    Run current = r;
    while (current.Text.IndexOf(' ') >= 0)
        current = SplitRun(current, current.Text.IndexOf(' ') + 1);
}

// Wrap all runs in the document with bookmakrs to make it possibel to work with LayoutCollector and LayoutEnumerator
runs = doc.GetChildNodes(NodeType.Run, true).Cast<Run>().ToList();

List<string> tmpBookmakrs = new List<string>();
int bkIndex = 0;
foreach (Run r in runs)
{
    // LayoutCollector and LayoutEnumerator does nto work with nodes in header/footer or in textboxes.
    if (r.GetAncestor(NodeType.HeaderFooter) != null || r.GetAncestor(NodeType.Shape) != null)
        continue;

    BookmarkStart start = new BookmarkStart(doc, string.Format("r{0}", bkIndex));
    BookmarkEnd end = new BookmarkEnd(doc, start.Name);

    r.ParentNode.InsertBefore(start, r);
    r.ParentNode.InsertAfter(end, r);

    tmpBookmakrs.Add(start.Name);
    bkIndex++;
}

// Now we can use collector and enumerator to get runs per line in MS Word document.
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);

object currentLine = null;
foreach (string bkName in tmpBookmakrs)
{
    Bookmark bk = doc.Range.Bookmarks[bkName];

    enumerator.Current = collector.GetEntity(bk.BookmarkStart);
    while (enumerator.Type != LayoutEntityType.Line)
        enumerator.MoveParent();

    if (currentLine != enumerator.Current)
    {
        currentLine = enumerator.Current;

        Console.WriteLine();
        Console.WriteLine("-------=========Start Of Line=========-------");
    }

    Run run = bk.BookmarkStart.NextSibling as Run;
    if (run != null)
        Console.Write(run.Text);
}

private static Run SplitRun(Run run, int position)
{
    Run afterRun = (Run)run.Clone(true);
    run.ParentNode.InsertAfter(afterRun, run);
    afterRun.Text = run.Text.Substring(position);
    run.Text = run.Text.Substring(0, position);
    return afterRun;
}

ducaisoft · May 5, 2023, 8:35am

This looks can not split each paragrap line by line.

alexey.noskov · May 5, 2023, 8:44am

@ducaisoft You can process a particular paragraph using the provided code - the approach is the same.

ducaisoft · May 5, 2023, 8:56am

Yes! I test it, but it can only extract many runs into the list, whereas can not extract each line into the list.
Maybe, this code may work correctly in C++, but may fail in Net. Would translate it into Net demo for me testing?

alexey.noskov · May 5, 2023, 10:10am

@ducaisoft The provided code is in C#.

ducaisoft · May 5, 2023, 10:17am

Could you translate the demo code as VB.net ones?

alexey.noskov · May 5, 2023, 10:36am

@ducaisoft Here is the same code in VB.NET:

Dim doc As Document = New Document("C:\Temp\in.docx")

' Split all Run nodes in the document to make them Not more than one word.
Dim runs As NodeCollection = doc.GetChildNodes(NodeType.Run, True)
For Each r As Run In runs
    Dim current As Run = r
    While (current.Text.IndexOf(" ") >= 0)
        current = SplitRun(current, current.Text.IndexOf(" ") + 1)
    End While
Next

' Wrap all runs in the document with bookmakrs to make it possibel to work with LayoutCollector and LayoutEnumerator
runs = doc.GetChildNodes(NodeType.Run, True)
Dim tmpBookmakrs As List(Of String) = New List(Of String)
Dim bkIndex As Integer = 0
For Each r As Run In runs
    ' LayoutCollector And LayoutEnumerator does nto work with nodes in header/footer Or in textboxes.
    If (r.GetAncestor(NodeType.HeaderFooter) IsNot Nothing Or r.GetAncestor(NodeType.Shape) IsNot Nothing) Then
        Continue For
    End If

    Dim bkStart As BookmarkStart = New BookmarkStart(doc, String.Format("r{0}", bkIndex))
    Dim bkEnd As BookmarkEnd = New BookmarkEnd(doc, bkStart.Name)

    r.ParentNode.InsertBefore(bkStart, r)
    r.ParentNode.InsertAfter(bkEnd, r)

    tmpBookmakrs.Add(bkStart.Name)
    bkIndex = bkIndex + 1
Next

' Now we can use collector and enumerator to get runs per line in MS Word document.
Dim collector As LayoutCollector = New LayoutCollector(doc)
Dim enumerator As LayoutEnumerator = New LayoutEnumerator(doc)

Dim currentLine As Object = New Object()
For Each bkName As String In tmpBookmakrs
    Dim bk As Bookmark = doc.Range.Bookmarks(bkName)

    enumerator.Current = collector.GetEntity(bk.BookmarkStart)
    While (enumerator.Type <> LayoutEntityType.Line)
        enumerator.MoveParent()
    End While

    If (currentLine IsNot enumerator.Current) Then
        currentLine = enumerator.Current
        Console.WriteLine()
        Console.WriteLine("-------=========Start Of Line=========-------")
    End If

    Dim run As Run = TryCast(bk.BookmarkStart.NextSibling, Run)
    If (run IsNot Nothing) Then
        Console.Write(run.Text)
    End If
Next

Private Function SplitRun(run As Run, position As Integer) As Run

    Dim afterRun As Run = CType(run.Clone(True), Run)
    run.ParentNode.InsertAfter(afterRun, run)
    afterRun.Text = run.Text.Substring(position)
    run.Text = run.Text.Substring(0, position)
    Return afterRun

End Function

ducaisoft · May 5, 2023, 11:02am

Thank you very much!
I test it and it works well!

And another issue is that how to split a given paragrap instead of the whole document?

alexey.noskov · May 5, 2023, 4:40pm

@ducaisoft Use the same code, but process the Run nodes from a single Paragraph:

Dim runs As NodeCollection = someParagraph.GetChildNodes(NodeType.Run, True)