How to read TOC from a document

transcore · April 9, 2014, 2:24pm

We are using Aspose.Word. My application needs to read the Table of Contents from a document and store each line of TOC into database. Using VB.NET. Comfortable with C#.

tahir.manzoor · April 10, 2014, 6:12am

Hi Ashley,

Thanks for your inquiry.

Please check the code example shared at following forum thread to achieve your requirements. Hope this helps you. Please let us know if you have any more queries.
https://forum.aspose.com/t/60963

awais.hafeez · April 10, 2014, 7:38am

Hi Ashley,

Thanks for your inquiry. I have copied the code from the thread suggested by Tahir here for your reference. Please let us know if we can be of any further assistance.

The following code will find and parse all of the paragraphs in the first TOC and print out the information of each entry.

DataTable tocTable = TableOfContentsToDataTable(doc, 0);
foreach (DataRow row in tocTable.Rows)
{
    Console.WriteLine(string.Format("Entry name: {0}, Heading Level: {1}, Page number: {2}", row["EntryName"], ((Style)row["EntryStyle"]).StyleIdentifier, row["Page"]));
}

public static DataTable TableOfContentsToDataTable(Document doc, int tocIndex)
{
    DataTable table = new DataTable();
    table.TableName = "Toc " + tocIndex;
    //******* Needed for Aspose’s code 
    table.Columns.Add("EntryRef");
    //****** end 
    table.Columns.Add("EntryName");
    table.Columns.Add("ResultStartNode", typeof(Node));
    table.Columns.Add("ResultRuns", typeof(List<Run>));
    table.Columns.Add("EntryStyle", typeof(Style));
    table.Columns.Add("PageRef");
    table.Columns.Add("Page");
    // Get the FieldStart of the specified TOC.
    Node currentNode = (Node)FindTocStartFromIndex(doc, tocIndex);
    // Skip forward to the first field separator (after the TOC field code).
    while (currentNode.NodeType != NodeType.FieldSeparator)
        currentNode = currentNode.NextPreOrder(doc);
    // First node of the paragraph
    currentNode = currentNode.NextPreOrder(doc);
    bool isCollecting = true;
    int countOfFieldItems = 0;
    bool isAfterFirstTocEntry = false;
    bool isHyperlinked = currentNode.NodeType == NodeType.FieldStart;
    while (isCollecting)
    {
        StringBuilder entryRefCode = new StringBuilder();
        StringBuilder entryText = new StringBuilder();
        StringBuilder pageRefCode = new StringBuilder();
        StringBuilder pageText = new StringBuilder();
        // Ensures that first entry is gotten from TOC
        if (!isAfterFirstTocEntry)
        {
            // Skip nodes until encounters a run
            while (currentNode.NodeType != NodeType.Run)
            {
                currentNode = currentNode.NextPreOrder(doc);
            }
            isAfterFirstTocEntry = true;
        }
        if (isHyperlinked)
        {
            // Collect all runs in the field code until we encounter the field separator
            while (currentNode.NodeType != NodeType.FieldSeparator)
            {
                entryRefCode.Append(currentNode.Range.Text.Trim());
                currentNode = currentNode.NextPreOrder(doc);
            }
            // Skip past field separator
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Break if no data products in IDMP
        if (currentNode.Range.Text.Contains("No table of contents entries found."))
        {
            table.Columns.Clear();
            return table;
        }
        Node entryPositionNode = null;
        List<Run> fieldResultRuns = new List<Run>();
        Style entryStyle = null;
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            countOfFieldItems++;
            if (currentNode.NodeType == NodeType.Run)
            {
                if (entryPositionNode == null)
                    entryPositionNode = currentNode.PreviousPreOrder(doc);
                fieldResultRuns.Add((Run)currentNode.Clone(false));
                entryStyle = ((Run)currentNode).ParentParagraph.ParagraphFormat.Style;
            }
            entryText.Append(currentNode.Range.Text.Trim());
            currentNode = currentNode.NextPreOrder(doc);
        }
        countOfFieldItems = 0;
        // Skip nodes until FieldStart (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        currentNode = currentNode.NextPreOrder(doc);
        pageRefCode.Append(currentNode.Range.Text);
        // Skip nodes until FieldSeparator (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldSeparator)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Add the runs from the field which should be the page number
        currentNode = currentNode.NextPreOrder(doc);
        pageText.Append(currentNode.Range.Text);
        // Add to datatable
        table.Rows.Add(new object[] { entryRefCode.ToString(), entryText.ToString(), entryPositionNode, fieldResultRuns, entryStyle, pageRefCode.ToString(), pageText.ToString() });
        currentNode = currentNode.NextPreOrder(doc);
        // Skip to the first run of the the next paragraph (should be next entry). Check if a TOC field end is found at the same time
        bool isNextPara = false;
        bool isChecking = true;
        while (isChecking)
        {
            currentNode = currentNode.NextPreOrder(doc);
            // No node found, break.
            if (currentNode == null)
            {
                isCollecting = false;
                break;
            }
            // Passed a new paragraph
            if (currentNode.NodeType == NodeType.Paragraph)
                isNextPara = true;
            // Found first run of a new paragraph
            if (isNextPara && currentNode.NodeType == NodeType.Run)
                isChecking = false;
            // Once we encounter a FieldEnd node of type FieldTOC then we know we are at the end
            // of the current TOC and we can stop here.
            if (currentNode.NodeType == NodeType.FieldEnd)
            {
                Aspose.Words.Fields.FieldEnd fieldEnd = (Aspose.Words.Fields.FieldEnd)currentNode;
                if (fieldEnd.FieldType == Aspose.Words.Fields.FieldType.FieldTOC)
                {
                    isCollecting = false;
                    break;
                }
            }
        }
    }
    return table;
}

Best regards,

transcore · April 10, 2014, 8:55pm

Thanks Awais.

My application aim is read the Text on each line of TOC in a document.

Eg:

TOC contents of a Test Document.:

Contents
1.1. Input Data 1
1.2. Output Data 1
1.3. Mode 2
1.3.1. Test Data 3
1.4. Next topic 4

The application should be able to loop through TOC and return the Text like “1.1. Input Data 1” from each line of TOC.

I tried to use a part of the code you have posted. But it is returning the header style and not the Text on each line.

How do I get the Text.

transcore · April 11, 2014, 3:32pm

no response ?

tahir.manzoor · April 13, 2014, 11:46am

Hi Ashley,

Please accept my apologies for late response.

Thanks for your inquiry. Please note that every node of TOC field is represented by a HYPERLINK field. When you click on a TOC item, the control jumps to a particular location in your document pointed by a hidden Bookmark.

Please check the EntryName and Page column of tocTable DataTable. See the attached image for detail.

If you still face any issue, please share your input document here for testing. We will then provide you more information about your query along with code.

rjain · April 23, 2015, 3:02pm

I do not see the DataTable class. Has it been removed. I am looking at aspose-pdf-10.2.0.jar.

tahir.manzoor · April 24, 2015, 2:21am

Hi Rishabh,

Thanks for your inquiry. The query discussed in this forum thread is about reading TOC items from Word document and DataTable class is System.Data.DataTable. If you have any queries related to Aspose.Pdf, please post your query here:
https://forum.aspose.com/c/pdf/10