Aspose.Words for .NET で特定の文字列を検索しヒットした行から2行前の行の文字列を取得するにはどうやれば出来ますか？

ysho · May 16, 2023, 3:48am

以下のコードを書いているのですが、2行前の文字列（“１行目テキスト”）が取得できません。
Paragraphからindexを指定して特定要素の値を取得することは可能でしょうか？
下のコード以外に方法はありますか？

wordファイル.docx
１行目テキスト

３行目テキスト

コード

NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);

foreach (Paragraph paragraph in paragraphs)
{
    var index = paragraphs.IndexOf(paragraph);

    if (paragraph.Range.Text.Trim().Equals("３行目テキスト"))
    {
        // 2行前に括弧付きのファンド名が記載されていたら改ページ
        Paragraph paraBefore2 = (Paragraph)paragraphs[index - 2];
        MessageBox.Show("2行前のテキスト::::::::::" + paraBefore2.Range.Text.Trim())
        }
}

alexey.noskov · May 16, 2023, 4:19am

@ysho ご存知のとおり、MS Word 文書にはフローの性質上、ページや行という概念がありません。コンシューマアプリケーションは、独自のレイアウトエンジンを使用して、Aspose.Words と同様にドキュメントレイアウトをオンザフライで構築します。 LayoutCollector クラスと LayoutEnumerator クラスは、ドキュメントレイアウト情報への限定的なアクセスを提供します。
たとえば、次のコードは、ドキュメントのコンテンツを行に分割する基本的な手法を示しています。

Document doc = new Document(@"C:\Temp\in.docx");

// Split all Run nodes in the document to make them not more than one word.
List<Run> runs = doc.GetChildNodes(NodeType.Run, true).Cast<Run>().ToList();
foreach (Run r in runs)
{
    Run current = r;
    while (current.Text.IndexOf(' ') >= 0)
        current = SplitRun(current, current.Text.IndexOf(' ') + 1);
}

// Wrap all runs in the document with bookmakrs to make it possibel to work with LayoutCollector and LayoutEnumerator
runs = doc.GetChildNodes(NodeType.Run, true).Cast<Run>().ToList();

List<string> tmpBookmakrs = new List<string>();
int bkIndex = 0;
foreach (Run r in runs)
{
    // LayoutCollector and LayoutEnumerator does nto work with nodes in header/footer or in textboxes.
    if (r.GetAncestor(NodeType.HeaderFooter) != null || r.GetAncestor(NodeType.Shape) != null)
        continue;

    BookmarkStart start = new BookmarkStart(doc, string.Format("r{0}", bkIndex));
    BookmarkEnd end = new BookmarkEnd(doc, start.Name);

    r.ParentNode.InsertBefore(start, r);
    r.ParentNode.InsertAfter(end, r);

    tmpBookmakrs.Add(start.Name);
    bkIndex++;
}

// Now we can use collector and enumerator to get runs per line in MS Word document.
LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);

object currentLine = null;
foreach (string bkName in tmpBookmakrs)
{
    Bookmark bk = doc.Range.Bookmarks[bkName];

    enumerator.Current = collector.GetEntity(bk.BookmarkStart);
    while (enumerator.Type != LayoutEntityType.Line)
        enumerator.MoveParent();

    if (currentLine != enumerator.Current)
    {
        currentLine = enumerator.Current;

        Console.WriteLine();
        Console.WriteLine("-------=========Start Of Line=========-------");
    }

    Run run = bk.BookmarkStart.NextSibling as Run;
    if (run != null)
        Console.Write(run.Text);
}

private static Run SplitRun(Run run, int position)
{
    Run afterRun = (Run)run.Clone(true);
    run.ParentNode.InsertAfter(afterRun, run);
    afterRun.Text = run.Text.Substring(position);
    run.Text = run.Text.Substring(0, position);
    return afterRun;
}

ysho · May 18, 2023, 12:01pm

早速のご回答ありがとうございます。
やはり行という概念が無かったのですね。
以下のコードで解決出来ました。

// 見た目通りの順序を得るために全テキスト抽出
string[] allText = doc.GetText().Split(new[] { "\r" }, StringSplitOptions.None);
for (int i = 0; i < allText.Length; i++)
{
    var text = allText[i].Trim();

    if (("３行目テキスト").Equals(text))
    {
        // 現在の行より2行前のテキストを取得する
        var textBefore2 = allText[i - 2].Trim();

        MessageBox.Show("2行前のテキスト::::::::::" + textBefore2);
    }
}

alexey.noskov · May 18, 2023, 12:45pm

@ysho 問題を解決できたのは完璧です。何か問題がございましたら、お気軽にお問い合わせください。いつでも喜んでお手伝いいたします。