Parse Table of Contents TOC Field Entries in DOC DOCX Word Document using C# .NET | Extract Page Number of Phrase

TomCuffe · June 18, 2020, 1:17pm

Hi,

We are using .NET 4.5, C# development environment.

Does this product (or some other that you publish) have the ability to extract the Table of Contents (TOC) from a Word document in the format .doc AND .docx?

I am just looking for a plain text representation of the TOC - I do not need all the hidden tags etc.

For example, this is what I am after:

Contents Page
Employee Handbook Issues and Updates 2
Introduction 3
Joining Our Organisation 4

Thanks,

Tom

TomCuffe · June 18, 2020, 1:17pm

My ultimate aim in asking the question above was related to finding the page number of a phrase located in a Word document.

So - can any of your software search a .doc or .docx Word document for a phrase, and then return the page number of where that phrase is located?

Thanks again,

Tom

awais.hafeez · June 18, 2020, 2:29pm

@TomCuffe,

You will be able to parse a Table of Contents (TOC) field in Word DOC or DOCX document by using the following code:

sample-input.zip (11.0 KB)

C# Code:

Document doc = new Document("E:\\SampleDocs\\sample-input.docx");

foreach (FieldStart field in doc.GetChildNodes(NodeType.FieldStart, true))
{
    if (field.FieldType.Equals(FieldType.FieldHyperlink))
    {
        FieldHyperlink hyperlink = (FieldHyperlink)field.GetField();
        if (hyperlink.SubAddress != null && hyperlink.SubAddress.StartsWith("_Toc"))
        {
            Paragraph tocItem = (Paragraph)field.GetAncestor(NodeType.Paragraph);
            if (tocItem != null)
            {
                // To get text representation of a TOC Entry
                Console.WriteLine(tocItem.ToString(SaveFormat.Text).Trim());

                //// To get page numbers only
                //foreach (Field nestedField in tocItem.Range.Fields)
                //{
                //    if (nestedField.Type.Equals(FieldType.FieldPageRef))
                //    {
                //        //nestedField.Unlink();
                //        Console.WriteLine(nestedField.DisplayResult);
                //    }
                //}
            }
        }
    }
}

awais.hafeez · June 18, 2020, 2:30pm

@TomCuffe,

Please take sample document from my previous post and try running the following code:

Document doc = new Document("E:\\SampleDocs\\sample-input.docx");

FindReplaceOptions opts = new FindReplaceOptions();
opts.ReplacingCallback = new ReplaceEvaluator();

doc.Range.Replace("Heading 2", "", opts);

private class ReplaceEvaluator : IReplacingCallback
{
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        // This is a Run node that contains either the beginning or the complete match.
        Node currentNode = e.MatchNode;

        // The first (and may be the only) run can contain text before the match,
        // in this case it is necessary to split the run.
        if (e.MatchOffset > 0)
            currentNode = SplitRun((Run)currentNode, e.MatchOffset);

        // This array is used to store all nodes of the match for further removing.
        ArrayList runs = new ArrayList();

        // Find all runs that contain parts of the match string.
        int remainingLength = e.Match.Value.Length;
        while (
            (remainingLength > 0) &&
            (currentNode != null) &&
            (currentNode.GetText().Length <= remainingLength))
        {
            runs.Add(currentNode);
            remainingLength = remainingLength - currentNode.GetText().Length;

            // Select the next Run node.
            // Have to loop because there could be other nodes such as BookmarkStart etc.
            do
            {
                currentNode = currentNode.NextSibling;
            }
            while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
        }

        // Split the last run that contains the match if there is any text left.
        if ((currentNode != null) && (remainingLength > 0))
        {
            SplitRun((Run)currentNode, remainingLength);
            runs.Add(currentNode);
        }

        LayoutCollector collector = new LayoutCollector((Document)e.MatchNode.Document);
        int startPage = collector.GetStartPageIndex((Run)runs[0]);

        Console.WriteLine("Page number is {0}", startPage);

        return ReplaceAction.Skip;
    }

    private static Run SplitRun(Run run, int position)
    {
        Run afterRun = (Run)run.Clone(true);
        afterRun.Text = run.Text.Substring(position);
        run.Text = run.Text.Substring((0), (0) + (position));
        run.ParentNode.InsertAfter(afterRun, run);
        return afterRun;
    }
}

TomCuffe1 · June 19, 2020, 10:19am

Hi Awais,

Thanks for the prompt reply!

In the code you supply there is the following:

Document doc = new Document(“E:\SampleDocs\sample-input.docx”);

The documents I would be working with would be in the form of a MemoryStream - is it possible to use a memory stream with this solution?

Also, the code you have posted in the zip file does not download for me, even though I am logged in.

Thanks,

Tom

awais.hafeez · June 19, 2020, 3:46pm

@TomCuffe1, @TomCuffe,

Yes, the same code will work and you can pass a MemoryStream object to Document constructor. Please use any of the following Document constructors:

This might be because you are using a different account now than the one you used to create this forum thread. I am attaching the file again here for your reference:

sample-input.zip (11.0 KB)