Extract / pull TOC after Updating it?

robertal · January 21, 2011, 11:01am

Is it possible to access a table of contents (TOC) populated using the update function? I need to pull the generated TOC and manipulate its contents. Then I need to put this manipulated data back into the document. Ideally, I would like to be able to pull the TOC out as a datatable. I am using Aspose.Words for .NET 9.5.0.

Thanks!

adam.skelton · January 21, 2011, 6:36pm

Hi Rob,
Thanks for your inquiry.
Sure you can find a sample implementation of how to do this below. It reuses code from another method used to Remove a TOC at a specific index from the document. Please note the code is more of an example and was coded quickly and is not very robust.

// Extract text from the first TOC in the document.
DataTable dataTable = TableOfContentsToDataTable(doc, 0);
public static DataTable TableOfContentsToDataTable(Document doc, int tocIndex)
{
    DataTable table = new DataTable();
    table.TableName = "Toc " + tocIndex;
    table.Columns.Add("EntryName");
    table.Columns.Add("Page");
    // Store the FieldStart nodes of TOC fields in the document for quick access.
    ArrayList fieldStarts = new ArrayList();
    // This is a list to store the nodes found inside the specified TOC. They will be removed
    // at thee end of this method.
    ArrayList nodeList = new ArrayList();
    foreach(FieldStart start in doc.GetChildNodes(NodeType.FieldStart, true))
    {
        if (start.FieldType == FieldType.FieldTOC)
        {
            // Add all FieldStarts which are of type FieldTOC.
            fieldStarts.Add(start);
        }
    }
    // Ensure the TOC specified by the passed index exists.
    if (tocIndex> fieldStarts.Count - 1)
        throw new ArgumentOutOfRangeException("TOC index is out of range");
    // Get the FieldStart of the specified TOC.
    Node currentNode = (Node) fieldStarts[tocIndex];
    // Skip forward to the first field separator (after the TOC field code).
    while (currentNode.NodeType != NodeType.FieldSeparator)
        currentNode = currentNode.NextPreOrder(doc);
    // First node of the paragraph
    currentNode = currentNode.NextPreOrder(doc);
    bool isCollecting = true;
    while (isCollecting)
    {
        StringBuilder entryText = new StringBuilder();
        StringBuilder pageText = new StringBuilder();
        // Collect runs until start of FieldStart which make up the entry name of the TOC
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            entryText.Append(currentNode.ToTxt().Trim());
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Skip nodes until FieldSeparator (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldSeparator)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Add the runs from the field which should be the page number
        currentNode = currentNode.NextPreOrder(doc);
        pageText.Append(currentNode.ToTxt());
        // Add to datatable
        table.Rows.Add(new string[]
        {
            entryText.ToString(), pageText.ToString()
        });
        currentNode = currentNode.NextPreOrder(doc);
        // Skip to the first run of the the next paragraph (should be next entry). Check if a TOC field end is found at the same time
        bool isNextPara = false;
        bool isChecking = true;
        while (isChecking)
        {
            currentNode = currentNode.NextPreOrder(doc);
            // No node found, break.
            if (currentNode == null)
            {
                isCollecting = false;
                break;
            }
            // Passed a new paragraph
            if (currentNode.NodeType == NodeType.Paragraph)
                isNextPara = true;
            // Found first run of a new paragraph
            if (isNextPara && currentNode.NodeType == NodeType.Run)
                isChecking = false;
            // Once we encounter a FieldEnd node of type FieldTOC then we know we are at the end
            // of the current TOC and we can stop here.
            if (currentNode.NodeType == NodeType.FieldEnd)
            {
                FieldEnd fieldEnd = (FieldEnd) currentNode;
                if (fieldEnd.FieldType == FieldType.FieldTOC)
                {
                    isCollecting = false;
                    break;
                }
            }
        }
    }
    return table;
}

Thanks,

robertal · January 26, 2011, 3:32pm

It works like a champ, but it messes up the first entry. The entryText is null and the page comes across as the actual “Entry”. Interesting, the HYPERLINK \l “_Toc256000079” is missing, too. The rest of the entries (rows) are correct. For example:

Entry Name | **Page | entryName for page 1 HYPERLINK\ l “_Toc256000079”
entryName for page 2 | 2 HYPERLINK\ l “_Toc256000080” entryName for page 3 | 3

I’m going to try and play with it to make it work. However if you have any ideas, please let me know.

Thanks.

adam.skelton · January 27, 2011, 4:15am

Hi Rob,
Thanks for your inquiry.
I would think the issue may be occuring because the TOC FieldStart can sometimes appear on a separate paragraph from the first entry.
If you cannot crack it please feel free to attach your template here and I will assist.
Thanks,

robertal · January 27, 2011, 2:17pm

Awesome, thanks. Walked through the code and figured out I where I had to fix it. For those that may be interested see below, bold text is my additions.

// Extract text from the first TOC in the document.
DataTable dataTable = TableOfContentsToDataTable(doc, 0);

public static DataTable TableOfContentsToDataTable(Document doc, int tocIndex)
{
    DataTable table = new DataTable();
    table.TableName = "Toc " + tocIndex;
    table.Columns.Add("EntryName");
    table.Columns.Add("Page");
    // Store the FieldStart nodes of TOC fields in the document for quick access.
    ArrayList fieldStarts = new ArrayList();
    // This is a list to store the nodes found inside the specified TOC. They will be removed
    // at thee end of this method.
    ArrayList nodeList = new ArrayList();

    bool isAfterFirstTocEntry = false;
    foreach(FieldStart start in doc.GetChildNodes(NodeType.FieldStart, true))
    {
        if (start.FieldType == FieldType.FieldTOC)
        {
            // Add all FieldStarts which are of type FieldTOC.
            fieldStarts.Add(start);
        }
    }
    // Ensure the TOC specified by the passed index exists.
    if (tocIndex> fieldStarts.Count - 1)
        throw new ArgumentOutOfRangeException("TOC index is out of range");
    // Get the FieldStart of the specified TOC.
    Node currentNode = (Node) fieldStarts[tocIndex];
    // Skip forward to the first field separator (after the TOC field code).
    while (currentNode.NodeType != NodeType.FieldSeparator)
        currentNode = currentNode.NextPreOrder(doc);
    // First node of the paragraph
    currentNode = currentNode.NextPreOrder(doc);
    bool isCollecting = true;
    int countOfFieldItems = 0;
    while (isCollecting)
    {
        StringBuilder entryText = new StringBuilder();
        StringBuilder pageText = new StringBuilder();

        // Ensures that first entry is gotten from TOC
        if (!isAfterFirstTocEntry)
        {
            // Skip nodes until encounters a run
            while (currentNode.NodeType != NodeType.Run)
            {
                currentNode = currentNode.NextPreOrder(doc);
            }
            isAfterFirstTocEntry = true;
        }
        // Collect runs until start of FieldStart which make up the entry name of the TOC
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            countOfFieldItems++;
            entryText.Append(currentNode.ToTxt().Trim());

            if (countOfFieldItems == 3)
            {
                dataItemNameText.Append(currentNode.ToTxt().Trim());
            }

            currentNode = currentNode.NextPreOrder(doc);
        }

        countOfFieldItems = 0;
        // Skip nodes until FieldSeparator (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldSeparator)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Add the runs from the field which should be the page number
        currentNode = currentNode.NextPreOrder(doc);
        pageText.Append(currentNode.ToTxt());
        // Add to datatable
        table.Rows.Add(new string[]
        {
            entryText.ToString(), pageText.ToString()
        });
        currentNode = currentNode.NextPreOrder(doc);
        // Skip to the first run of the the next paragraph (should be next entry). Check if a TOC field end is found at the same time
        bool isNextPara = false;
        bool isChecking = true;
        while (isChecking)
        {
            currentNode = currentNode.NextPreOrder(doc);
            // No node found, break.
            if (currentNode == null)
            {
                isCollecting = false;
                break;
            }
            // Passed a new paragraph
            if (currentNode.NodeType == NodeType.Paragraph)
                isNextPara = true;
            // Found first run of a new paragraph
            if (isNextPara && currentNode.NodeType == NodeType.Run)
                isChecking = false;
            // Once we encounter a FieldEnd node of type FieldTOC then we know we are at the end
            // of the current TOC and we can stop here.
            if (currentNode.NodeType == NodeType.FieldEnd)
            {
                FieldEnd fieldEnd = (FieldEnd) currentNode;
                if (fieldEnd.FieldType == FieldType.FieldTOC)
                {
                    isCollecting = false;
                    break;
                }
            }
        }
    }
    return table;
}

robertal · January 27, 2011, 2:22pm

Once I have my info extracted and in a dataview, can I put it back in after editing it and have per say the hyperlinks still work? I have tried inserting the info both as a table and as mail merge item (using ExecuteWithRegions() taken from example at https://forum.aspose.com/t/97128)

Thanks!

adam.skelton · January 27, 2011, 9:28pm

Hi Rob,
Thanks for your inquiry.
From your other threads it sounds like you are trying to refactor the content of the TOC into an index. This sounds possible and is actually quite easy to repopulate the TOC.
Please see the code below which will “reinsert” the sorted data back into the original TOC field. I have moved the code to find the TOC at the index to a separate method and added a few lines in the original method.

public static void RepopulateTocWithValues(Document doc, int tocIndex, DataView view)
{
    // Get the FieldStart of the specified TOC.
    Node currentNode = (Node) FindTocStartFromIndex(doc, tocIndex);
    // Skip forward to the first field separator (after the TOC field code).
    while (currentNode.NodeType != NodeType.FieldSeparator)
        currentNode = currentNode.NextPreOrder(doc);
    // First node of the paragraph
    currentNode = currentNode.NextPreOrder(doc);
    // The Paragraph of the current TOC entry.
    Paragraph currentParagraph = (Paragraph) currentNode.ParentNode;
    // The original datatable (before sorting).
    DataTable origTable = view.Table;
    // Iterate through all recorded TOC entries.
    for (int index = 0; index <view.Count; index++)
    {
        // The current row in the sorted table in the view
        DataRow sortedRow = view[index].Row;
        // The current row in the sorted table in the original table.
        DataRow origRow = origTable.Rows[index];
        // Replace each part of the original entry with each part of the sorted entry.
        currentParagraph.Range.Replace((string) origRow["EntryName"], (string) sortedRow["EntryName"], false, false);
        currentParagraph.Range.Replace((string) origRow["PageRef"], (string) sortedRow["PageRef"], false, false);
        currentParagraph.Range.Replace((string) origRow["Page"], (string) sortedRow["Page"], false, false);
        // You can add code here to edit the apperance of the new paragraph entry.
        // Goto the next paragraph which should be the next entry.
        currentParagraph = (Paragraph) currentParagraph.NextSibling;
    }
}
public static DataTable TableOfContentsToDataTable(Document doc, int tocIndex)
{
    DataTable table = new DataTable();
    table.TableName = "Toc " + tocIndex;
    table.Columns.Add("EntryName");
    table.Columns.Add("PageRef");
    table.Columns.Add("Page");
    // Get the FieldStart of the specified TOC.
    Node currentNode = (Node) FindTocStartFromIndex(doc, tocIndex);
    // Skip forward to the first field separator (after the TOC field code).
    while (currentNode.NodeType != NodeType.FieldSeparator)
        currentNode = currentNode.NextPreOrder(doc);
    // First node of the paragraph
    currentNode = currentNode.NextPreOrder(doc);
    bool isCollecting = true;
    int countOfFieldItems = 0;
    bool isAfterFirstTocEntry = false;
    while (isCollecting)
    {
        StringBuilder entryText = new StringBuilder();
        StringBuilder pageRefCode = new StringBuilder();
        StringBuilder pageText = new StringBuilder();
        // Ensures that first entry is gotten from TOC
        if (!isAfterFirstTocEntry)
        {
            // Skip nodes until encounters a run
            while (currentNode.NodeType != NodeType.Run)
            {
                currentNode = currentNode.NextPreOrder(doc);
            }
            isAfterFirstTocEntry = true;
        }
        // Collect runs until start of FieldStart which make up the entry name of the TOC
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            countOfFieldItems++;
            entryText.Append(currentNode.ToTxt().Trim());
            if (countOfFieldItems == 3)
            {
                entryText.Append(currentNode.ToTxt().Trim());
            }
            currentNode = currentNode.NextPreOrder(doc);
        }
        countOfFieldItems = 0;
        // Skip nodes until FieldStart (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldStart)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        currentNode = currentNode.NextPreOrder(doc);
        pageRefCode.Append(currentNode.ToTxt());
        // Skip nodes until FieldSeparator (of PAGEREF)
        while (currentNode.NodeType != NodeType.FieldSeparator)
        {
            currentNode = currentNode.NextPreOrder(doc);
        }
        // Add the runs from the field which should be the page number
        currentNode = currentNode.NextPreOrder(doc);
        pageText.Append(currentNode.ToTxt());
        // Add to datatable
        table.Rows.Add(new string[]
        {
            entryText.ToString(), pageRefCode.ToString(), pageText.ToString()
        });
        currentNode = currentNode.NextPreOrder(doc);
        // Skip to the first run of the the next paragraph (should be next entry). Check if a TOC field end is found at the same time
        bool isNextPara = false;
        bool isChecking = true;
        while (isChecking)
        {
            currentNode = currentNode.NextPreOrder(doc);
            // No node found, break.
            if (currentNode == null)
            {
                isCollecting = false;
                break;
            }
            // Passed a new paragraph
            if (currentNode.NodeType == NodeType.Paragraph)
                isNextPara = true;
            // Found first run of a new paragraph
            if (isNextPara && currentNode.NodeType == NodeType.Run)
                isChecking = false;
            // Once we encounter a FieldEnd node of type FieldTOC then we know we are at the end
            // of the current TOC and we can stop here.
            if (currentNode.NodeType == NodeType.FieldEnd)
            {
                FieldEnd fieldEnd = (FieldEnd) currentNode;
                if (fieldEnd.FieldType == FieldType.FieldTOC)
                {
                    isCollecting = false;
                    break;
                }
            }
        }
    }
    return table;
}
public static FieldStart FindTocStartFromIndex(Document doc, int tocIndex)
{
    // Store the FieldStart nodes of TOC fields in the document for quick access.
    ArrayList fieldStarts = new ArrayList();
    // This is a list to store the nodes found inside the specified TOC. They will be removed
    // at thee end of this method.
    ArrayList nodeList = new ArrayList();
    foreach(FieldStart start in doc.GetChildNodes(NodeType.FieldStart, true))
    {
        if (start.FieldType == FieldType.FieldTOC)
        {
            // Add all FieldStarts which are of type FieldTOC.
            fieldStarts.Add(start);
        }
    }
    // Ensure the TOC specified by the passed index exists.
    if (tocIndex> fieldStarts.Count - 1)
        throw new ArgumentOutOfRangeException("TOC index is out of range");
    return (FieldStart) fieldStarts[tocIndex];
}

Thanks,

robertal · February 25, 2011, 5:09pm

Okay, just noticed a problem in RepopulateTocWithValues() at this line of code:

currentParagraph.Range.Replace((string) origRow["Page"], (string) sortedRow["Page"], false, false);

It works fine except that it will replace any instance of the page number in the current paragraph. For example, if my origRow page number is “9” and my EntryName contains 9, then the the 9 in the EntryName is replaced with whatever value is in the sortedRow page number.

I have tried using a regex replacement scheme, but that fails to get the number. I used:

currentParagraph.Range.Replace(new System.Text.RegularExpressions.Regex(pageRegex), (string) sortedRow["Page"]);

where pageRegex = "^" + (string) origRow["Page"] + "$"

I am thinking that there must be other unseen text before and/or after the page number that I can’t see, and thus haven’t ben able to account for in my regex expression.

Any help would be greatly appreciated.

Thank you.

adam.skelton · February 25, 2011, 7:02pm

Hi Rob,
Thanks for your inquiry.
You’re correct that there are other hidden characters causing your regex to not work. These are namely the FieldSeparator and FieldEnd chars. Please see the regex below which matches the exact field result containing the page number.

Regex reg = new Regex(string.Format("(?<={0}){1}(?<!{2})", ControlChar.FieldSeparatorChar, origRow["Page"], ControlChar.FieldEndChar));
currentParagraph.Range.Replace(reg, (string) sortedRow["Page"]);

The FieldSeparator and FieldEnd characters were incorporated but this will lead to an exception as the replace method does not support special characters in the replacement string. Therefore I added some lookbehind tags so these characters are matched but not included. This should work as expected now.
Thanks,

robertal · February 28, 2011, 11:21am

Works like a champ. Thanks.

robertal · March 11, 2011, 8:33am

Alright, time for a new twist. Some of my headings need to have colored text or strikethrough and this additional formatting gets included in the TOC. The probelm though arises when I extract the items, sort, and put them back in. The additional text formatting is not carried with the items, but stays with the initial row in the TOC. (This is the reason why I asked for help in message 290194.) For example (using underline in place of strikethrough):

TOC - Unsorted
Item 3
Item 1
Item 2

becomes

TOC Sorted
Item 1
Item 2
Item 3

Here is and example of where the data is pulled:

Item 3
jsfkljlsjafkljslkjflksd
slfkjsldjflkdsjlfkjsdlkjflk

Item 1
djfslkjklsjklfjlskdfjlksd
lksjdflsjdlkfjdsjflklksdflk

Item 2
kldfjklsdjfljsdlkflkdsj
klsdjflkdsjlkfjlksdjl

Any suggestions on how to pass the additional text formatting in the sorting? I looked at the TOC fields and could not tell any difference between entries that would indicate they contained text formatting information.

Thanks!

adam.skelton · March 12, 2011, 4:35am

Hi Rob,
Thanks for your inquiry.
Could you please attach a sample template here for testing purpose and I will take a closer look for you. I prepared a quick template on my side but when using a hyperlinked TOC MS Word and Aspose.Words both don’t apply any direct formatting to the TOC entries which would make the above unnecessary.
Thanks,

robertal · March 15, 2011, 4:21pm

Attached is my sample code. Please note what headers are in red, and black.

Thanks!

adam.skelton · March 15, 2011, 6:50pm

Hi Rob,
Thanks for attaching your code here. I’m afraid this produces the same sort of result as in my test - with a hyperlinked TOC no direct formatting shows through on the TOC. Therefore there is no way to retain this formatting during processing as it does not appear on the TOC in the first place.
Thanks,

robertal · March 16, 2011, 10:49am

Hi,

I ran the code without the hyperlink tag and got the same results. Can you do the same on your end to verify? (I removed the \h from the third TOC.) So if it does the same thing on your end, then formatting is not retained with the entry for which it is initially entered. The formatting just stays at the physical location of where the initial entry was. So I guess there is no “fix” for this then?

Thanks.

adam.skelton · March 16, 2011, 10:01pm

Hi Rob,
Thanks for your inquiry.
Actually in my testing the direct formatting was applied to the entries in the TOC if the TOC was not hyperlinked (\h switch). I’m not sure if this helps as I think you require your TOC to be hyperlinked. It seems strange that the TOC will not accept direct formatting when hyperlinked, my guess is this is a limitation of MS Word.
Thanks,

robertal · March 17, 2011, 10:14am

To make sure we are on the same page, by direct formatting you mean the formatting applied to a heading in addition to the doc.Styles[StyleIdentifier.x] style, or the doc.Styles[StyleIdentifier.x] style? If we mean the former, I wonder why we are not getting the same results. Because when I run the code on my side without the /h I get the same results as with the /h. I will try running it again though, could you do the same?

Just to make sure, we are looking at the same thing. We are looking at the third TOC, one that is converted to an index using the work around code we came up with, right?

Thanks.

-Rob

adam.skelton · March 17, 2011, 4:42pm

Hi Rob,
Yes by direct formatting I mean formatting included in addition to the original style which is applied directly onto the paragraph.
Please see the attachments which demonstrate my previous post. You will see that with the hyperlink switch direct formatting is not included, whereas without the switch direct formatting is applied.
Thanks,

robertal · March 17, 2011, 5:27pm

Ah, I see the problem. TOC Without Hyperlink Switch.doc is not sorted. Looking at the codeI sent you the necessary functions were not included. They are now attached. (Curious though, how did you get it my sample code to work with out PopulateTableOfContents() and its sub functions?)

Please include these functions and try again. I think then you will see my problem.

Thanks!

-Rob

adam.skelton · March 18, 2011, 11:17pm

Hi Rob,
Thanks for this additional information.
Good news, it seems when I first tested the output of direct formatting on a hyperlinked TOC using Aspose.Words I got an incorrect result, it turns out that Aspose.Words does still copy direct formatting even when MS Word does not. This means it is possible to achieve what you are looking for.
I was able to still run your tests before as I simply excluded those methods which were not available. The test was to simply test if direct formatting would be included in hyperlinked TOC.
Please make the changes to your code below. This copies the formatting of the runs over for the entry name of each TOC. This will apply the correct formatting when entires are moved around. I think you can also remove the EntryText column now as I don’t think it’s used. Note I also removed the Range.Replace method for the EntryText already just in case.
In TableOfContentsToDataTable method:

table.Columns.Add("EntryName");
table.Columns.Add("ResultStartNode", typeof(Node));
table.Columns.Add("ResultRuns", typeof(List <Run> ));
table.Columns.Add("PageRef");
table.Columns.Add("Page");
// Break if no data products in IDMP
if (currentNode.Range.Text.Contains("No table of contents entries found."))
{
    table.Columns.Clear();
    return table;
}
Node entryPositionNode = null;
List <Run> fieldResultRuns = new List <Run> ();
while (currentNode.NodeType != NodeType.FieldStart)
{
    countOfFieldItems++;
    if (currentNode.NodeType == NodeType.Run)
    {
        if (entryPositionNode == null)
            entryPositionNode = currentNode.PreviousPreOrder(doc);
        fieldResultRuns.Add((Run) currentNode.Clone(false));
    }
    entryText.Append(currentNode.Range.Text.Trim());
    currentNode = currentNode.NextPreOrder(doc);
}
countOfFieldItems = 0;
table.Rows.Add(new object[] { entryRefCode.ToString(), entryText.ToString(), entryPositionNode, fieldResultRuns, pageRefCode.ToString(), pageText.ToString() });

In RepopulateTOCWithValues method

// The current row in the sorted table in the original table.
DataRow origRow = origTable.Rows[index];
// Replace each part of the original entry with each part of the sorted entry.
if (!string.IsNullOrEmpty((string) origRow["EntryRef"]))
{
    currentParagraph.Range.Replace((string) origRow["EntryRef"], (string) sortedRow["EntryRef"], false, false);
}
Node previousNode = (Node) origRow["ResultStartNode"];
RemoveRunsInFieldResult(previousNode);
List <Run> runList = (List <Run> ) sortedRow["Resultruns"];
Node targetNode = previousNode;
foreach(Run run in runList)
{
    // In a TOC without the hyperlink switch it is possible that the first run of the
    // entry text is at the start of the paragraph, this will add a target of type paragraph.
    // Prepend the runs to start of this paragraph
    if (targetNode.NodeType == NodeType.Paragraph)
        ((Paragraph) targetNode).PrependChild(run);
    else
        targetNode.ParentNode.InsertAfter(run, targetNode); // Add the results after the first node.
    targetNode = run;
}
currentParagraph.Range.Replace((string) origRow["PageRef"], (string) sortedRow["PageRef"], false, false);

and you need this new method:

public static void RemoveRunsInFieldResult(Node startNode)
{
    Node currentNode = startNode.NextPreOrder(startNode.Document);
    while (currentNode.NodeType != NodeType.FieldStart)
    {
        Node nextNode = currentNode.NextPreOrder(currentNode.Document);
        currentNode.Remove();
        currentNode = nextNode;
    }
}

Thanks,