Read content from word document from all the runs

Nizzam2024 · September 13, 2024, 9:09am

Hi,
I have a requirement of reading the content from word document word by word and it should be stored in an object like { sting: word string: separator} format. here separator can be a space or a new line between two words. I have tried that looping through the run items but newlines are getting skipped. If anything possible here from aspose please let me know

Example Input => This is the test, content.
Output Should Be =>

[
{text: "This", separator: " "}, 
{text: "is", separator: " "}, 
{text: "test", separator: ""}, 
{text: ",", separator: " " }, 
{text: "content", separator: " "}, 
{text: ".",separator: "\n"}
]

Here separator defines the space or new line between two words.

Professionalize.Discourse · September 13, 2024, 9:09am

@Nizzam2024

To read the content from a Word document word by word and store it in the specified format, you can utilize the Aspose.Words library. The approach involves iterating through the runs in the document and capturing the text along with the separators (spaces or new lines) between them.

Here’s a code example that demonstrates how to achieve this:

using Aspose.Words;
using Aspose.Words.Paragraphs;
using Aspose.Words.Run;
using System.Collections.Generic;

public class WordContentExtractor
{
    public static List<WordSeparator> ExtractWordsWithSeparators(string filePath)
    {
        Document doc = new Document(filePath);
        List<WordSeparator> result = new List<WordSeparator>();

        foreach (Paragraph paragraph in doc.GetChildNodes(NodeType.Paragraph, true))
        {
            string paragraphText = paragraph.GetText();
            string[] words = paragraphText.Split(new char[] { ' ', '\n' }, System.StringSplitOptions.RemoveEmptyEntries);

            for (int i = 0; i < words.Length; i++)
            {
                string word = words[i];
                string separator = "";

                // Determine the separator based on the position of the word
                if (i < words.Length - 1)
                {
                    // Check if the next character is a space or newline
                    if (paragraphText[paragraphText.IndexOf(word) + word.Length] == ' ')
                    {
                        separator = " ";
                    }
                    else if (paragraphText[paragraphText.IndexOf(word) + word.Length] == '\n')
                    {
                        separator = "\n";
                    }
                }

                result.Add(new WordSeparator { Text = word, Separator = separator });
            }
        }

        return result;
    }
}

public class WordSeparator
{
    public string Text { get; set; }
    public string Separator { get; set; }
}

Explanation:

Document Loading: The document is loaded using the Document class.
Iterating Paragraphs: Each paragraph in the document is accessed, and its text is retrieved.
Splitting Words: The paragraph text is split into words based on spaces and new lines.
Determining Separators: For each word, the code checks the character following it to determine if it is a space or a newline, which is then stored as the separator.
Storing Results: Each word and its corresponding separator are stored in a list of WordSeparator objects.

This implementation should help you achieve the desired output format while ensuring that separators are correctly identified and included.

Feel free to modify the logic as needed to fit your specific requirements!

Nizzam2024 · September 13, 2024, 9:16am

What if there is no space between two words like ‘test, content’ here test and comma are two different words but there is no space between

alexey.noskov · September 13, 2024, 11:34am

@Nizzam2024 In MS Word documents the word is not mandatory represented as a single Run node. As well as it is fully valid situation when Run node represents several words, or even whole paragraph’s content. You can use find/replace functionality to make words to be represented as a single Run node. For example see the following code example that demonstrates how to read documents content line by line:

Document doc = new Document(@"C:\Temp\in.docx");

// Make each work in the document to be represented as a single RUN.
FindReplaceOptions opt = new FindReplaceOptions();
opt.UseSubstitutions = true;
doc.Range.Replace(new Regex("[a-zA-Z]+"), "$0", opt);

doc.Save(@"C:\Temp\out.docx");

// Wrap all runs in the document with bookmarks to make it possible to work with LayoutCollector and LayoutEnumerator
List<Run> runs = doc.GetChildNodes(NodeType.Run, true).Cast<Run>().ToList();

List<string> tmpBookmakrs = new List<string>();
int bkIndex = 0;
foreach (Run r in runs)
{
    // LayoutCollector and LayoutEnumerator does not work with nodes in header/footer or in textboxes.
    if (r.GetAncestor(NodeType.HeaderFooter) != null || r.GetAncestor(NodeType.Shape) != null)
        continue;

    BookmarkStart start = new BookmarkStart(doc, string.Format("r{0}", bkIndex));
    BookmarkEnd end = new BookmarkEnd(doc, start.Name);

    r.ParentNode.InsertBefore(start, r);
    r.ParentNode.InsertAfter(end, r);

    tmpBookmakrs.Add(start.Name);
    bkIndex++;
}

// Now we can use collector and enumerator to get runs per line in MS Word document.
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);

object currentLine = null;
foreach (string bkName in tmpBookmakrs)
{
    Bookmark bk = doc.Range.Bookmarks[bkName];

    enumerator.Current = collector.GetEntity(bk.BookmarkStart);
    while (enumerator.Type != LayoutEntityType.Line)
        enumerator.MoveParent();

    if (currentLine != enumerator.Current)
    {
        currentLine = enumerator.Current;

        Console.WriteLine();
        Console.WriteLine("-------=========Start Of Line=========-------");
    }

    Run run = bk.BookmarkStart.NextSibling as Run;
    if (run != null)
        Console.Write(run.Text);
}