How to read just the body content from the docx file. Don't want to read any special section like comment, shape, footnote, endnote etc

bhavikkabaria23 · October 22, 2024, 12:03pm

Issue :- How to Remove or extract text (content) of comment section from its associated Paragraph from a document

Scenario :- Consider a document which has multiple paragraphs (e.g. P1,P2,P3…Pn) and a comment section (e.g.C1) which is added with one of the paragraph (e.g. P3). In this scenario we need to extract (skip) comment section and its content to read from a document and it should read content only from paragraph sections (P1,P2,P3…Pn). So we have tried below code to do this to extract comment section (C1).

var _paragraphs = _userDocument.GetChildNodes(NodeType.Paragraph, true)
    .Cast<Paragraph>()
    .Where(p => p.ParentNode != null && p.ParentNode.NodeType != NodeType.Comment).ToList();

Problem :- By using above code, only a comment section, C1 is extracted from a document but the content of C1 is still showing with the content of associated paragraph, P3.

e.g. Text of P3 → “This is the third paragraph”
Text of C1 → “This is my comment section”

So when we get the text of P3 by “P3.GetText()”, it gives “This is the third paragraph. This is my comment section”.

when we debug this situation, we see that “ChildNodes” of P3 has still showing Comment nodes of C1 even after we have used above code. To resolve this we have tried to remove “ChildNodes” from each paragraph which has “NodeType.Comment” but if we are processing a large document or multiple documents in a same time, this solution may be hang up the server or getting slow performance. We also need to do the same thing for Footnote, Endnote also.

Expected Result:- “P3.GetText()” should give only text of P3 - “This is the third paragraph.”. Please provide us a recommended solution which works for both comment node and its content to extract from a document.

Programming Language and Software :- We are using Dot Net Core with c# with Visual Studio 2022. DotNet Framework if 8.0

Please refer the attached screen shots.
Example_Problem

Word_Document_Comment

alexey.noskov · October 22, 2024, 12:59pm

@bhavikkabaria23 Comments are children of paragraphs, so it is expected that comment text is included when you execute Paragraph.GetText() method. If you would like to get text without comment text you can use the following code:

Document doc = new Document(@"C:\Temp\in.docx");

Paragraph paragraphWithComment = doc.FirstSection.Body.FirstParagraph;

// Clone paragraph and remove comment from it.
Paragraph paragraphWithoutComment = (Paragraph)paragraphWithComment.Clone(true);
paragraphWithoutComment.GetChildNodes(NodeType.Comment, true).Clear();

Console.WriteLine(paragraphWithoutComment.ToString(SaveFormat.Text));

Please see our documentation to learn more about Aspose.Words Document Object Model:
https://docs.aspose.com/words/net/aspose-words-document-object-model/

bhavikkabaria23 · October 22, 2024, 2:07pm

Thank you @alexey.noskov for your response.

The example which you given is works for a paragraph. But in our case, there are multiple paragraphs in a document. Our document can be large and having 1000+ paragraphs.

So, in this case we need to loop all paragraphs to apply the logic you suggested, and it can be crash the server or use huge memories.

So, is there any way to achieve this without doing looping?

alexey.noskov · October 22, 2024, 5:30pm

@bhavikkabaria23 You can implement your own DocumentVisitor implantation to get text from the paragraphs skipping comments content:

Document doc = new Document(@"C:\Temp\in.docx");

Paragraph paragraphWithComment = doc.FirstSection.Body.FirstParagraph;

MyDocumentVisitor visitor = new MyDocumentVisitor();
paragraphWithComment.Accept(visitor);

Console.WriteLine(visitor.Text);

private class MyDocumentVisitor : DocumentVisitor
{
    public void Reset()
    {
        mBuilder = new StringBuilder();
    }

    public override VisitorAction VisitCommentStart(Comment comment)
    {
        // Start skip content
        mIsSkip = true;
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitCommentEnd(Comment comment)
    {
        // End skip content.
        mIsSkip = false;
        return VisitorAction.Continue;
    }

    public override VisitorAction VisitRun(Run run)
    {
        if (!mIsSkip)
            mBuilder.Append(run.Text);

        return VisitorAction.Continue;
    }

    public string Text
    {
        get { return mBuilder.ToString(); }
    }

    private StringBuilder mBuilder = new StringBuilder();
    private bool mIsSkip = false;
}

If you need to skip all comments in your document, you can simply remove them:

doc.GetChildNodes(NodeType.Comment, true).Clear();