To read the content of the TOC from word document

naveenc · June 29, 2015, 4:55am

Hi,

Iam using Aspose.words and i want to read the content of TOC with relations.

I mean i want parent and child relation between the nodes of TOC.

Regards

Naveen

tahir.manzoor · June 29, 2015, 12:09pm

Hi Naveen,

Thanks for your inquiry. Please read following documentation links about extracting contents from the document.
https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

Please use following code example to achieve your requirements. Hope this helps you.

Document doc = new Document(MyDir + "in.docx");
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToDocumentEnd();
builder.StartBookmark("_TocEnd");
builder.EndBookmark("_TocEnd");
NodeCollection nodes = doc.GetChildNodes(NodeType.FieldStart, true);
// Get list of bookmarks listed in TOC
ArrayList tocitems = new ArrayList();
foreach (FieldStart fstart in nodes)
{
    if (fstart.FieldType == Aspose.Words.Fields.FieldType.FieldPageRef)
    {
        String fieldText = fstart.GetField().GetFieldCode();
        if (fieldText.Contains("_Toc"))
        {
            fieldText = fieldText.Substring(fieldText.IndexOf("_Toc"), fieldText.Length - fieldText.IndexOf("_Toc")).Replace("\\h", "").Trim();
            tocitems.Add(fieldText);
        }
    }
}
for (int i = 0; i < tocitems.Count - 1; i++)
{
    BookmarkStart bookmarkStart = doc.Range.Bookmarks[tocitems[i].ToString()].BookmarkStart;
    BookmarkStart bookmarkEnd = doc.Range.Bookmarks[tocitems[i + 1].ToString()].BookmarkStart;
    // Firstly extract the content between these nodes including the bookmark.
    ArrayList extractedNodes = ExtractContent(bookmarkStart, bookmarkEnd, false);
    Document doc2 = GenerateDocument(doc, extractedNodes);
    doc2.Save(MyDir + tocitems[i] + "Out.docx");
}

wherold · June 29, 2015, 6:13pm

I had a similar problem. I solved it by inferring hierarchy using the OutlineLevel of each TOC bookmark node’s parent paragraph.

naveenc · July 1, 2015, 4:23am

Hi wherold,

Can u share the code please

Regards

Naveen

tahir.manzoor · July 2, 2015, 1:31am

Hi Naveen,

You can achieve your requirements using the code shared in my previous post. Please let us know if you face any issue while using Aspose.Words.

https://docs.aspose.com/words/net/how-to-extract-selected-content-between-nodes-in-a-document/

naveenc · July 2, 2015, 6:00am

Hi,

The code is not working for all the word documents which contain TOC.

Sending the document which is not working for me as an attachment to this post.

Work with the document and resolve the issue.

Please check it.

In TOC a section contains 1. Description

By using the code shared by you, i can only retrieve only "Description"

But i want the number of the section also i mean 1.Description

Please Resolve the issue.

Regards
Naveen

tahir.manzoor · July 2, 2015, 6:26am

Hi Naveen,

Thanks for your inquiry. In case you are using an older version of Aspose.Words, I would suggest you please upgrade to the latest version (v15.5.0) from here.

Please call Document.UpdateFields method after loading the document and execute the shared code. I have attached the two output documents (for “DESCRIPTION” and “INDICATIONS”) with this post for your kind reference.

Please let us know if you have any more queries.

naveenc · July 3, 2015, 6:10am

Hi Tahir,

Thanks for your reply.

I upgraded to latest version 15.5.0 and called Document.UpdateFields method after loading document.

But it is not working for me.

I need the list of TOC elements with the numbers(If it contains).

I mean

1.Description
2.Indications
3.Generics

the code sent by you is working but iam unable to get the numbers in the Toc content .

and i didnt understand why did u attached the output documents for Description and Indications.

I hope you understand my requirement
if not
Please run the below code u will understand.

Use the same document which i sent through the post

Below is my Code

//License for aspose word
License lic = new License();
lic.SetLicense("Aspose.Words.lic");
Aspose.Words.Document doc = new Aspose.Words.Document(Document not working to read TOC.docx);
doc.UpdateFields();
DocumentBuilder builder = new DocumentBuilder(doc);
List listSectionname = new List();
builder.MoveToDocumentEnd();

builder.StartBookmark("_TocEnd");

builder.EndBookmark("_TocEnd");
NodeCollection nodes = doc.GetChildNodes(NodeType.FieldStart, true);
// Get list of bookmarks listed in TOC
ArrayList tocitems = new ArrayList();
foreach (FieldStart fstart in nodes)
{

    if (fstart.FieldType == Aspose.Words.Fields.FieldType.FieldPageRef)
    {


        String fieldText = fstart.GetField().GetFieldCode();
        if (fieldText.Contains("_Toc"))
        {

            fieldText = fieldText.Substring(fieldText.IndexOf("_Toc"), fieldText.Length - fieldText.IndexOf("_Toc")).Replace("\h", "").Trim();
            tocitems.Add(fieldText);
        }

    }

}

tocitems.Add("TocEnd");
if (tocitems.Count > 1)
{

    try
    {

        string SectionContent = "";
        for (int i = 0; i < tocitems.Count - 1; i++)
        {

            BookmarkStart bookmarkStart = doc.Range.Bookmarks[tocitems[i].ToString()].BookmarkStart;
            BookmarkStart bookmarkEnd = doc.Range.Bookmarks[tocitems[i + 1].ToString()].BookmarkStart;

            ArrayList extractedNodesInclusive = ExtractContent(bookmarkStart, bookmarkEnd, false);
            foreach (Node node in extractedNodesInclusive)
            {

                var divsection = new HtmlGenericControl("div");
                divsection.Attributes.Add("class", "dynamicselect");
                var span = new HtmlGenericControl("span");
                SectionContent += node.ToString(SaveFormat.Text) + "";
                break;
            }

        }

        lblimport.Text = SectionContent;

tahir.manzoor · July 6, 2015, 3:17am

Hi Naveen,

Thanks for your inquiry. From the shared issue detail, there are two parts of your query.
1) Read the TOC elements only as shown below.

*naveenc:

But it is not working for me.

I need the list of TOC elements with the numbers(If it contains).

I mean

1.Description
2.Indications
3.Generics*

If this is the case, please use following code example to read the TOC elements.

DataTable tocTable = TableOfContentsToDataTable(doc, 0);
foreach (DataRow row in tocTable.Rows)
{
    Console.WriteLine(string.Format("Entry name: {0}, Heading Level: {1}, Page number: {2}", row["EntryName"], ((Style)row["EntryStyle"]).StyleIdentifier, row["Page"]));
}

public static DataTable TableOfContentsToDataTable(Document doc, int tocIndex)
{
    DataTable table = new DataTable();
    table.TableName = "Toc " + tocIndex;
    // ******* Needed for Aspose's code 
    table.Columns.Add("EntryRef");
    // ****** end 
    table.Columns.Add("EntryName");
    table.Columns.Add("ResultStartNode", typeof(Node));
    table.Columns.Add("ResultRuns", typeof(List<Run>));
    table.Columns.Add("EntryStyle", typeof(Style));
    table.Columns.Add("PageRef");
    table.Columns.Add("Page");
    // Get the FieldStart of the specified TOC.
    Node currentNode = (Node)FindTocStartFromIndex(doc, tocIndex);
    // Skip forward to the first field separator (after the TOC field code).
    while (currentNode.NodeType != NodeType.FieldSeparator)
        currentNode = currentNode.NextPreOrder(doc);
    // First node of the paragraph
    currentNode = currentNode.NextPreOrder(doc);
    bool isCollecting = true;
    int countOfFieldItems = 0;
    bool isAfterFirstTocEntry = false;
    bool isHyperlinked = currentNode.NodeType == NodeType.FieldStart;
    while (isCollecting)
    {
        StringBuilder entryRefCode = new StringBuilder();
        StringBuilder entryText = new StringBuilder();
        StringBuilder pageRefCode = new StringBuilder();
        StringBuilder pageText = new StringBuilder();
        // Ensures that first entry is gotten from TOC
        if (!isAfterFirstTocEntry)
        {
            // Skip nodes until encounters a run
            while (currentNode.NodeType != NodeType.Run)
            {
                currentNode = currentNode.NextPreOrder(doc);
            }
            isAfterFirstTocEntry = true;
        }
        if (isHyperlinked)
        {
            // Collect all runs in the field code until we encounter the field separator
            while (currentNode.NodeType != NodeType.FieldSeparator)
            {
                entryRefCode.Append(currentNode.Range.Text.Trim());
                currentNode = currentNode.NextPreOrder(doc);
            }
            // Skip past field separator
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Break if no data products in IDMP
        if (currentNode.Range.Text.Contains("No table of contents entries found."))
        {
            table.Columns.Clear();
            return table;
        }
        Node entryPositionNode = null;
        List<Run> fieldResultRuns = new List<Run>();
        Style entryStyle = null;
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            countOfFieldItems++;
            if (currentNode.NodeType == NodeType.Run)
            {
                if (entryPositionNode == null)
                    entryPositionNode = currentNode.PreviousPreOrder(doc);
                fieldResultRuns.Add((Run)currentNode.Clone(false));
                entryStyle = ((Run)currentNode).ParentParagraph.ParagraphFormat.Style;
            }
            entryText.Append(currentNode.Range.Text.Trim());
            currentNode = currentNode.NextPreOrder(doc);
        }
        countOfFieldItems = 0;
        // Skip nodes until FieldStart (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        currentNode = currentNode.NextPreOrder(doc);
        pageRefCode.Append(currentNode.Range.Text);
        // Skip nodes until FieldSeparator (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldSeparator)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Add the runs from the field which should be the page number
        currentNode = currentNode.NextPreOrder(doc);
        pageText.Append(currentNode.Range.Text);
        // Add to datatable
        table.Rows.Add(new object[] { entryRefCode.ToString(), entryText.ToString(), entryPositionNode, fieldResultRuns, entryStyle, pageRefCode.ToString(), pageText.ToString() });
        currentNode = currentNode.NextPreOrder(doc);
        // Skip to the first run of the the next paragraph (should be next entry). Check if a TOC field end is found at the same time
        bool isNextPara = false;
        bool isChecking = true;
        while (isChecking)
        {
            currentNode = currentNode.NextPreOrder(doc);
            // No node found, break.
            if (currentNode == null)
            {
                isCollecting = false;
                break;
            }
            // Passed a new paragraph
            if (currentNode.NodeType == NodeType.Paragraph)
                isNextPara = true;
            // Found first run of a new paragraph
            if (isNextPara && currentNode.NodeType == NodeType.Run)
                isChecking = false;
            // Once we encounter a FieldEnd node of type FieldTOC then we know we are at the end
            // of the current TOC and we can stop here.
            if (currentNode.NodeType == NodeType.FieldEnd)
            {
                Aspose.Words.Fields.FieldEnd fieldEnd = (Aspose.Words.Fields.FieldEnd)currentNode;
                if (fieldEnd.FieldType == Aspose.Words.Fields.FieldType.FieldTOC)
                {
                    isCollecting = false;
                    break;
                }
            }
        }
    }
    return table;
}

-------------------------------------------------------------

public static FieldStart FindTocStartFromIndex(Document doc, int tocIndex)
{
    // Store the FieldStart nodes of TOC fields in the document for quick access.
    ArrayList fieldStarts = new ArrayList();
    // This is a list to store the nodes found inside the specified TOC. They will be removed
    // at thee end of this method.
    ArrayList nodeList = new ArrayList();
    foreach (FieldStart start in doc.GetChildNodes(NodeType.FieldStart, true))
    {
        if (start.FieldType == FieldType.FieldTOC)
        {
            // Add all FieldStarts which are of type FieldTOC.
            fieldStarts.Add(start);
        }
    }
    // Ensure the TOC specified by the passed index exists.
    if (tocIndex > fieldStarts.Count - 1)
        throw new ArgumentOutOfRangeException("TOC index is out of range");
    return (FieldStart)fieldStarts[tocIndex];
}

*naveenc:

the code sent by you is working but iam unable to get the numbers in the Toc content .

and i didnt understand why did u attached the output documents for Description and Indications.*

2) Read the contents of TOC elements.

The code shared here extracts the contents of TOC elements which works as expected. The documents of Description and Indications contents are for your reference which shows that the code is working fine.

Regarding the list labels issue (e.g 2. INDICATIONS), please check the documents attached in my previous post. The list label is 1 for INDICATIONS. Please note that Aspose.Words mimics the same behavior as MS Word does. If you extract the contents for INDICATIONS section from your input document using MS Word, you will get the same output.

Hope this answers your query. If you still face problem, please manually create your expected Word document using Microsoft Word and attach it here for our reference. We will investigate how you want your final Word output be generated like. We will then provide you more information on this along with code.