Converting a Document to HTML with the TOC as a left panel

monir.aittahar · January 30, 2019, 11:26am

I’m studying a replacement of another converter, which produces an HTML Document with frames. The left frame contains the table of contents (TOC).

Aspose converts a document very well, but the TOC is in the top of the contents. Is there an out-of-the-box way to get the TOC as a left frame ?

Otherwise, I saw there is a way to retrieve the SectionCollection nodes. In the HTML conversion, each toc entry points to a section named _TocXXXXXXXXX. If I can retrieve such names, I could manually generating a TOC. Does Aspose allow to perform that?

Regards,
M

awais.hafeez · January 30, 2019, 4:06pm

@monir.aittahar,

Please check the following code if that helps:

Document doc = new Document("D:\\Temp\\toc.docx");

foreach (Field field in doc.Range.Fields)
{
    if (field.Type.Equals(Aspose.Words.Fields.FieldType.FieldHyperlink))
    {
        FieldHyperlink hyperlink = (FieldHyperlink)field;
        if (hyperlink.SubAddress != null && hyperlink.SubAddress.StartsWith("_Toc"))
        {
            Paragraph tocItem = (Paragraph)field.Start.GetAncestor(NodeType.Paragraph);
            Console.WriteLine(tocItem.ToString(SaveFormat.Text).Trim());
            Console.WriteLine("------------------");
            if (tocItem != null)
            {
                Bookmark bm = doc.Range.Bookmarks[hyperlink.SubAddress];
                // Get the location this TOC Item is pointing to
                Paragraph pointer = (Paragraph)bm.BookmarkStart.GetAncestor(NodeType.Paragraph);
                Console.WriteLine(pointer.ToString(SaveFormat.Text));
            }

            Console.WriteLine("|||||||||||||||||||||||||||||");
        }
    }
}

Or you can ZIP and attach the following resources here for testing:

Your simplified input document
Aspose.Words 19.1 generated output document showing the undesired behavior
Your expected document showing the correct output. You can create expected document by using MS Word.
Please also create a standalone simple console application (source code without compilation errors) that helps us to reproduce your current problem on our end and attach it here for testing. Please do not include Aspose.Words DLL files in it to reduce the file size.
Any additional steps that you think might be required to reproduce this issue on our end.

As soon as you get these pieces of information ready, we will start further investigation into your scenario and provide you more information. Thanks for your cooperation.

monir.aittahar · January 31, 2019, 5:40pm

Hi @awais.hafeez,

Thank you very much for this very useful sample. I’m able to generate manually an HTML TOC with hyperlinks.

(Disclaimer: due to licence issue, I cannot yet upgrade to the latest version, I ran the following tests with the 18.10 version of Aspose.Words)

However, I noticed something about this line:

Console.WriteLine(pointer.ToString(SaveFormat.Text));`

Of course it retrieves the whole text of a TOC entry, which includes the page number. I tried to get rid of the page number. So I “crawled” through the tocItem entry.

I cannot retrieve something more accurate than the Result member of the FieldHyperLink object embedded in the tocItem instance.

1 Expression du besoin  PAGEREF _Toc475005740 \h 1

Is there a way to retrieve the number and the text of the TOC entry without the page number?

Regards.

awais.hafeez · February 1, 2019, 4:51am

@monir.aittahar,

Please ZIP and upload your sample input Word document here for testing. We will investigate the scenario on our end and provide you more information.

monir.aittahar · February 1, 2019, 10:15am

Hi @awais.hafeez, The file lorem_ipsum.zip (13.1 KB) is uploaded.

Console log lines produced by the sample you provided look like:

1 Lorem Ipsum 1
1.1 Lorem Ipsum 1
1.2 Lorem Ipsum 1
2 Lorem Ipsum 1
2.1 Lorem Ipsum 1
2.1.1 Lorem Ipsum 1
2.1.2 Lorem Ipsum 1

The last number is the page number. I would want to get rid of it without editing the String manually.

Thank you for your help.
Best regards.

awais.hafeez · February 1, 2019, 1:08pm

@monir.aittahar,

It is represented by a PageRef field. You can remove this field from inside every TOC item before getting the string representation:

Document doc = new Document("E:\\temp\\lorem_ipsum\\lorem_ipsum.docx");

foreach (Field field in doc.Range.Fields)
{
    if (field.Type.Equals(Aspose.Words.Fields.FieldType.FieldHyperlink))
    {
        FieldHyperlink hyperlink = (FieldHyperlink)field;
        if (hyperlink.SubAddress != null && hyperlink.SubAddress.StartsWith("_Toc"))
        {
            Paragraph tocItem = (Paragraph)field.Start.GetAncestor(NodeType.Paragraph);

            foreach (Field nestedField in tocItem.Range.Fields)
            {
                if (nestedField.Type.Equals(FieldType.FieldPageRef))
                {
                    nestedField.Remove();
                }
            }

            Console.WriteLine(tocItem.ToString(SaveFormat.Text).Trim());
            Console.WriteLine("------------------");
        }
    }
}

monir.aittahar · February 2, 2019, 12:09am

Hi @awais.hafeez,

Thank you very much.

Regards,
M

monir.aittahar · March 26, 2019, 6:45pm

Dear all,

The documents I’m dealing have a TOC build in another way I’m dealing with another way (a single { TOC \o "1-3"} instead of a bunch of { HYPERLINK \l "_TocXXXXX" }).

The entries are still yet clickable when the document is opened with Word, but the links are not preserved in the HTML conversion, although the titles are still preceded with named anchors (<a name="_Toc36886258">).

Inside the Aspose.Words.Document object representing the Word document, the TOC entries are not reachable as HyperLinkFields, but as FieldToc. I struggle with getting the value of the named anchors (the “_Toc” thing) related to TOC entries. Is there a way to do that ?

You’ll find a link to the sample as an attached ZIP file below.

lorem_ipsum-toc_wo_hyperlinks.zip (25.4 KB)

awais.hafeez · March 27, 2019, 5:20am

@monir.aittahar,

For this case, please try using the following code:

Document doc = new Document("E:\\lorem_ipsum-toc_wo_hyperlinks\\lorem_ipsum-toc_wo_hyperlinks.docx");

foreach (Field field in doc.Range.Fields)
{
    if (field.Type.Equals(Aspose.Words.Fields.FieldType.FieldPageRef))
    {
        FieldPageRef pageRef = (FieldPageRef)field;
        if (pageRef.BookmarkName != null && pageRef.BookmarkName.StartsWith("_Toc"))
        {
            Paragraph tocItem = (Paragraph)field.Start.GetAncestor(NodeType.Paragraph);

            //foreach (Field nestedField in tocItem.Range.Fields)
            //{
            //    if (nestedField.Type.Equals(FieldType.FieldPageRef))
            //    {
            //        nestedField.Remove();
            //    }
            //}

            Console.WriteLine(tocItem.ToString(SaveFormat.Text).Trim());
            Console.WriteLine("------------------");
        }
    }
}

Hope, this helps.

monir.aittahar · March 27, 2019, 10:12am

Hi @awais.hafeez,

Thanks you very much, I just ad to add a line right after the tocItem retrieving to get rid of the page number:

Paragraph tocItem = (Paragraph)field.Start.GetAncestor(NodeType.Paragraph);
field.Remove() // Get rid of the page number

If I understand correctly, in Microsoft Word, TOCs could be both designed as a set of HyperLink or PageRef?

awais.hafeez · March 27, 2019, 11:23am

@monir.aittahar,

Yes, both are valid types of TOC fields. Please let us know if you have any troubles and we will be glad to look into this further for you.