How to get TOC?

qiqilie · August 3, 2011, 3:10am

Hello, I want to get a directory has been written Word, how to get and output, can provide the relevant code. Thank you!

AndreyN · August 3, 2011, 3:30am

Hi
Thanks for your inquiry. TOC is actually the field. This field contains paragraphs with special style name like “TOC1”, “TOC2” etc. You can loop trough all paragraphs and get all paragraphs with style name contains “TOC”.
Here is code example:

Document doc = new Document("in.doc");
// Get Paragraph Collection
NodeCollection paragraphColl = doc.GetChildNodes(NodeType.Paragraph, true);
// Loop though all Paragraphs
foreach(Paragraph par in paragraphColl)
{
    if (par.ParagraphFormat.Style.Name.Contains("TOC"))
        Console.WriteLine(par.ToTxt());
}

To handle level indentation for TOC you should determine what Heading# style the corresponding paragraphs belong to.
Please use the following code to extract all paragraphs of HeadingX style (in this example X is 1-3) from the document:

// Get all paragraphs from the document
NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
foreach(Paragraph paragraph in paragraphs)
{
    switch (paragraph.ParagraphFormat.StyleIdentifier)
    {
        case StyleIdentifier.Heading1:
        case StyleIdentifier.Heading2:
        case StyleIdentifier.Heading3:
            // This para style is HeadingX
            break;
    }
}

Hope this helps.
Best regards,

qiqilie · August 3, 2011, 4:12am

Thank you very much for answer, but there is a problem, how to get the page number of each paragraph and chapter?

AndreyN · August 3, 2011, 10:05am

Hello
Thanks for your inquiry. In this case please try using the following code:

Document doc = new Document("C:\\Temp\\in.doc");
Node currentNode = null;
// Get collection of FieldStart nodes
Node[] fieldStarts = doc.GetChildNodes(NodeType.FieldStart, true).ToArray();
// Loop through all FieldStart nodes
foreach(FieldStart start in fieldStarts)
{
    if (start.FieldType == FieldType.FieldTOC)
        currentNode = (Node) start;
}
// Skip forward to the first field separator (after the TOC field code).
while (currentNode.NodeType != NodeType.FieldSeparator)
    currentNode = currentNode.NextPreOrder(doc);
// First node of the paragraph
currentNode = currentNode.NextPreOrder(doc);
bool isCollecting = true;
int countOfFieldItems = 0;
while (isCollecting)
{
    StringBuilder entryText = new StringBuilder();
    StringBuilder pageText = new StringBuilder();
    while (currentNode.NodeType != NodeType.FieldStart)
    {
        countOfFieldItems++;
        entryText.Append(currentNode.GetText().Trim());
        currentNode = currentNode.NextPreOrder(doc);
    }
    countOfFieldItems = 0;
    currentNode = currentNode.NextPreOrder(doc);
    // Skip nodes until FieldSeparator (of PAGEREF)
    while (currentNode.NodeType != NodeType.FieldSeparator)
    {
        currentNode = currentNode.NextPreOrder(doc);
    }
    // Add the runs from the field which should be the page number
    currentNode = currentNode.NextPreOrder(doc);
    pageText.Append(currentNode.GetText());
    // Show
    Console.Write(entryText + "---" + pageText + "\n");
    currentNode = currentNode.NextPreOrder(doc);
    // Skip to the first run of the the next paragraph (should be next entry). Check if a TOC field end is found at the same time
    bool isNextPara = false;
    bool isChecking = true;
    while (isChecking)
    {
        currentNode = currentNode.NextPreOrder(doc);
        // No node found, break.
        if (currentNode == null)
        {
            isCollecting = false;
            break;
        }
        // Passed a new paragraph
        if (currentNode.NodeType == NodeType.Paragraph)
            isNextPara = true;
        // Found first run of a new paragraph
        if (isNextPara && currentNode.NodeType == NodeType.Run)
            isChecking = false;
        // Once we encounter a FieldEnd node of type FieldTOC then we know we are at the end
        // of the current TOC and we can stop here.
        if (currentNode.NodeType == NodeType.FieldEnd)
        {
            FieldEnd fieldEnd = (FieldEnd) currentNode;
            if (fieldEnd.FieldType == FieldType.FieldTOC)
            {
                isCollecting = false;
                break;
            }
        }
    }
}

Best regards,

qiqilie · August 3, 2011, 7:12pm

Thank you for your reply, a bit difficult to understand this, the page number you can see, but the chapter is like this HYPERLINK \ l “_Toc273402009”,Can be deleted?And the first can not be displayed. I want a result like this, for example: every paragraph has an AutoNumber ID,like 1、2、3、4,and each paragraph can get the parent ID and current page number, like this:

NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
foreach(Paragraph paragraph in paragraphs)
{
    switch (paragraph.ParagraphFormat.StyleIdentifier)
    {
        case StyleIdentifier.Heading1:
            Insert(id, parentid, page, chapter);
            break;
        case StyleIdentifier.Heading2:
            Insert(id, parentid, page, chapter);
            break;
        case StyleIdentifier.Heading3:
            Insert(id, parentid, page, chapter);
            break;
    }
}

adam.skelton · August 3, 2011, 9:23pm

Hi there,
Thanks for this additonal information, however could you please attach your template document here as well?
Thanks,

qiqilie · August 3, 2011, 9:53pm

Of course, I have uploaded.

adam.skelton · August 4, 2011, 1:32am

Hi there,
Thanks for attaching your document here for testing.
I think you can use the code below to achieve what you want. This will find and parse all of the paragraphs in the first TOC and print out the information of each entry.

DataTable tocTable = TableOfContentsToDataTable(doc, 0);
foreach (DataRow row in tocTable.Rows)
{
    Console.WriteLine(string.Format("Entry name: {0}, Heading Level: {1}, Page number: {2}", row["EntryName"], ((Style)row["EntryStyle"]).StyleIdentifier, row["Page"]));
}

public static DataTable TableOfContentsToDataTable(Document doc, int tocIndex)
{
    DataTable table = new DataTable();
    table.TableName = "Toc " + tocIndex;
    // * Needed for Aspose's code
    table.Columns.Add("EntryRef");
    //  end
    table.Columns.Add("EntryName");
    table.Columns.Add("ResultStartNode", typeof(Node));
    table.Columns.Add("ResultRuns", typeof(List<Run>));
    table.Columns.Add("EntryStyle", typeof(Style));
    table.Columns.Add("PageRef");
    table.Columns.Add("Page");
    // Get the FieldStart of the specified TOC.
    Node currentNode = (Node)FindTocStartFromIndex(doc, tocIndex);
    // Skip forward to the first field separator (after the TOC field code).
    while (currentNode.NodeType != NodeType.FieldSeparator)
        currentNode = currentNode.NextPreOrder(doc);
    // First node of the paragraph
    currentNode = currentNode.NextPreOrder(doc);
    bool isCollecting = true;
    int countOfFieldItems = 0;
    bool isAfterFirstTocEntry = false;
    bool isHyperlinked = currentNode.NodeType == NodeType.FieldStart;
    while (isCollecting)
    {
        StringBuilder entryRefCode = new StringBuilder();
        StringBuilder entryText = new StringBuilder();
        StringBuilder pageRefCode = new StringBuilder();
        StringBuilder pageText = new StringBuilder();
        // Ensures that first entry is gotten from TOC
        if (!isAfterFirstTocEntry)
        {
            // Skip nodes until encounters a run
            while (currentNode.NodeType != NodeType.Run)
            {
                currentNode = currentNode.NextPreOrder(doc);
            }
            isAfterFirstTocEntry = true;
        }
        if (isHyperlinked)
        {
            // Collect all runs in the field code until we encounter the field separator
            while (currentNode.NodeType != NodeType.FieldSeparator)
            {
                entryRefCode.Append(currentNode.Range.Text.Trim());
                currentNode = currentNode.NextPreOrder(doc);
            }
            // Skip past field separator
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Break if no data products in IDMP
        if (currentNode.Range.Text.Contains("No table of contents entries found."))
        {
            table.Columns.Clear();
            return table;
        }
        Node entryPositionNode = null;
        List<Run> fieldResultRuns = new List<Run>();
        Style entryStyle = null;
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            countOfFieldItems++;
            if (currentNode.NodeType == NodeType.Run)
            {
                if (entryPositionNode == null)
                    entryPositionNode = currentNode.PreviousPreOrder(doc);
                fieldResultRuns.Add((Run)currentNode.Clone(false));
                entryStyle = ((Run)currentNode).ParentParagraph.ParagraphFormat.Style;
            }
            entryText.Append(currentNode.Range.Text.Trim());
            currentNode = currentNode.NextPreOrder(doc);
        }
        countOfFieldItems = 0;
        // Skip nodes until FieldStart (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        currentNode = currentNode.NextPreOrder(doc);
        pageRefCode.Append(currentNode.Range.Text);
        // Skip nodes until FieldSeparator (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldSeparator)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Add the runs from the field which should be the page number
        currentNode = currentNode.NextPreOrder(doc);
        pageText.Append(currentNode.Range.Text);
        // Add to datatable
        table.Rows.Add(new object[] { entryRefCode.ToString(), entryText.ToString(), entryPositionNode, fieldResultRuns, entryStyle, pageRefCode.ToString(), pageText.ToString() });
        currentNode = currentNode.NextPreOrder(doc);
        // Skip to the first run of the the next paragraph (should be next entry). Check if a TOC field end is found at the same time
        bool isNextPara = false;
        bool isChecking = true;
        while (isChecking)
        {
            currentNode = currentNode.NextPreOrder(doc);
            // No node found, break.
            if (currentNode == null)
            {
                isCollecting = false;
                break;
            }
            // Passed a new paragraph
            if (currentNode.NodeType == NodeType.Paragraph)
                isNextPara = true;
            // Found first run of a new paragraph
            if (isNextPara && currentNode.NodeType == NodeType.Run)
                isChecking = false;
            // Once we encounter a FieldEnd node of type FieldTOC then we know we are at the end
            // of the current TOC and we can stop here.
            if (currentNode.NodeType == NodeType.FieldEnd)
            {
                Aspose.Words.Fields.FieldEnd fieldEnd = (Aspose.Words.Fields.FieldEnd)currentNode;
                if (fieldEnd.FieldType == Aspose.Words.Fields.FieldType.FieldTOC)
                {
                    isCollecting = false;
                    break;
                }
            }
        }
    }
    return table;
}

If you have any further queries, please feel free to ask.
Thanks,

qiqilie · August 4, 2011, 2:37am

Hello, I can not find this method, what is the need to add anything?
FindTocStartFromIndex(doc, tocIndex) cant’t find.

adam.skelton · August 4, 2011, 3:06am

Hi there,
Sorry about that, please find the implementation of the missing method below.

public static FieldStart FindTocStartFromIndex(Document doc, int tocIndex)
{
    // Store the FieldStart nodes of TOC fields in the document for quick access.
    ArrayList fieldStarts = new ArrayList();
    // This is a list to store the nodes found inside the specified TOC. They will be removed
    // at thee end of this method.
    ArrayList nodeList = new ArrayList();
    foreach(FieldStart start in doc.GetChildNodes(NodeType.FieldStart, true))
    {
        if (start.FieldType == FieldType.FieldTOC)
        {
            // Add all FieldStarts which are of type FieldTOC.
            fieldStarts.Add(start);
        }
    }
    // Ensure the TOC specified by the passed index exists.
    if (tocIndex> fieldStarts.Count - 1)
        throw new ArgumentOutOfRangeException("TOC index is out of range");
    return (FieldStart) fieldStarts[tocIndex];
}

Thanks,

qiqilie · August 4, 2011, 3:19am

Looks good, but still a little disappointed, I expected something like this：
Entry name: 1XXX, ID: 1, PrantID:0, Page number: 1
Entry name: 2XXX, ID: 2, PrantID:0, Page number: 1
Entry name: 2.1XXX, ID: 3, PrantID:2, Page number: 1
Entry name: 2.2XXX, ID: 4, PrantID:2, Page number: 2
Entry name: 2.3XXX, ID: 5, PrantID:2, Page number: 2
Entry name: 2.4XXX, ID: 6, PrantID:2, Page number: 4
Entry name: 2.4.1XXX, ID: 7, PrantID:6, Page number: 4
Entry name: 2.4.2XXX, ID: 8, PrantID:6, Page number: 4
Entry name: 2.4.3XXX, ID: 9, PrantID:6, Page number: 5

alexey.noskov · August 4, 2011, 5:36am

Hi
Thanks for your request. There are no IDs for TOC items in MS Word documents. However, I think, you can easily calculate them in your code. You can use HeadingLevel to move to next level.
Best regards,