MHTML to Doc - retain TOC functionality

We have been historically using Aspose.Words.Java with RTF generated documents and converting them to Doc and Pdf file formats. We are in the process of trying to overhaul our system and looking to use HTML or MHTML instead of RTF and using Aspose to still convert to Doc and Pdf. I have looked at the round trip documentation and have tried having our system convert from RTF to Doc to MHTML (with roundtrip true) to Doc and the Table of Contents doesn’t work. The hyperlinks don’t take you to the correct place in the document and we lose the page numbering in the TOC itself. I have read that TOC is not currently supported but the documentation note for that is pretty old.

I’m wondering if you have any suggestions or a work around where we would be able to use the MHTML format and still get a populated Table of Contents in the final word document. I’ve played around with trying to look for bookmarks in the MHTML file data and doing the jump to and create TOC functionality, but doesn’t seem like it wants to work with MHTML data.

Any assistance you can offer would be appreciated.

@AurinBlackstaff

To ensure a timely and accurate response, please attach the following resources here for testing:

  • Your input Word/RTF document.
  • Please attach the output document that shows the undesired behavior.
  • Please attach the expected output document that shows the desired behavior.
  • Please create a simple Java application ( source code without compilation errors ) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

SampleFiles.zip (115.1 KB)

I have attached the requested files.

Original and Expected results if the file we currently get from Aspose.Words.java after it has converted our existing RTF formatted documents. This is the result we are looking to duplicate by using MHTML format instead.

You will see in the converted to MHTML with .setExportRoundtripInformation(true); as a save option is stripping the document borders, the footer formatting is getting messed up, and the links in the TOC are no longer linking to the expected areas in the document.

We want the final result to be able to mirror the Orginal_and_Expected_Results.doc file. Is this even possible to obtain when coming from a MTHML file type? If so we need insight on how to go about doing that and what we need in the MTHML file so Aspose can properly convert it.

The sample code is receiving a string with the file data in it and returning the file data in a stream to and from Perl system code. If you needed more insight on the code side of things.

Again anything that you can do to assist with this matter would be helpful. We need to determine if we can migrate to using an MHTML or HTML based format or if we will need to look into other options.

Thank you

@AurinBlackstaff

In your case, we suggest you please use HtmlSaveOptions.ExportTocPageNumbers property to export the page numbers to table of contents in output HTML. Hope this helps you.

HtmlSaveOptions options = new HtmlSaveOptions(SaveFormat.MHTML);
options.setExportTocPageNumbers(true);
options.setExportRoundtripInformation(true);

Thank you for the insight on the page numbering in the TOC. I unfortunately think there is a bit of miscommunication on what I am looking for. We are hoping to draft a document in MHTML format and submit to Aspose to save it as a DOC file. We don’t want to use the round trip we are just trying to use that as a base point to understand how the MHTML file needs to be built for Aspose to properly format the final DOC file.

I need to know if there is a way to have a Word Doc table of contents created from an MHTML file. We would not know page numbers and such until after the final document has been assembled. Currently with the RTF format we are using, there are tags in the RTF data that lets the Aspose updateFields(); command build the Word TOC and populate all the page numbers and links. This is what we get in the Orginal_and_Expected file I submitted. We need this same functionality when coming from MHTML source directly. Is this possible to do and if so how? We don’t want hyperlinks we want the actual Word Field for TOC that we can use the updateFields() function to properly populate based on tags used throughout the document like we can do with the RTF source.

@AurinBlackstaff

The table of content is TOC field in MS Word document. A field in a Word document is a complex structure consisting of multiple nodes that include field start, field code, field separator, field result and field end.

In this case, we suggest you following solution.

  1. Import the MHTML document in Aspose.Words’ DOM. This MHTML does not contain the table of content.
  2. Move the cursor to the location where you want to insert the table of content.
  3. Insert the TOC field using DocumentBuilder.InsertTableOfContents method.
  4. Call Document.UpdateFields method before saving the document.

Hope this helps you.

Thank you for the assistance. That worked for getting a table of contents generated as expected as long as I am using proper heading tags in the MHTML source. The issue that I am running into now is moving around inside the document. I have looked at the Move the cursor link and played around. Are there certain HTML tags I should be using for the builder to recognize them as nodes, paragraphs, bookmarks etc.

I have attached the MHTML source file I am working with and an updated sample code file. I’m able to get to where I want to insert the TOC using section and paragraph moving, but I need to be able to put something specific in the source file that it can directly look for as the document may have it in different sections as they are dynamically built. Like a bookmark named TOC or something like that. I know there is moveToBookmark function but what would my HTML tag need to look like for the builder to find it and insert the TOC at that position.

Here is the line in my MHTML sample file attached that I want to insert it afterward in case you needed it.

MHTMLSourceFile.zip (44.2 KB)

TABLE OF CONTENTS

@AurinBlackstaff

We suggest you please read the following article about Aspose.Words’ document object model.
Aspose.Words Document Object Model

You can use following anchor tag in your MHTML. It is imported as Bookmark into Aspose.Words’ DOM.
<a name=‘mybookmark’></a>

You can move the cursor to the bookmark and insert table of content.

Moreover, you can also find the text e.g. ‘TABLE OF CONTENT’ in your document (MHTML) and insert table of content after it. In this case, you need to use the find and replace feature of Aspose.Words. Please implement IReplacingCallback interface. In IReplacingCallback.Replacing method, move the cursor to the next node of matched node and insert table of content. Please read the following article.
Find and Replace

Hope this helps you.

I was able to get the TOC to work properly jumping to the bookmark tag. Only thing is I had to use <a name=3D"TOCBookMark"> for Aspose to recognize it, your example didn’t import into the DOM to where the bookmark was recognized. But regardless that issue has been resolved so I appreciate the help on that.

I did note in one of the prior posts that the page borders weren’t working going from the MHTML format to Doc. I dug and tried some stuff and got that to work properly using the pagesetup and border setting features in code like below. The issue I now have is that saving as PDF does everything else correctly except for the borders. Is there a difference that I need to account for to get the borders to show when saving the file to PDF format?
int i;
int sections = rtfDoc.getSections().getCount();
for (i = 0; i < sections; i++) {
PageSetup ps = rtfDoc.getSections().get(i).getPageSetup();
ps.setBorderSurroundsFooter(false);
ps.setBorderSurroundsHeader(false);
ps.setBorderAlwaysInFront(false);
ps.setBorderDistanceFrom(PageBorderDistanceFrom.TEXT);
ps.setBorderAppliesTo(PageBorderAppliesTo.ALL_PAGES);
ps.getBorders().setLineStyle(LineStyle.SINGLE);
ps.getBorders().setLineWidth(2);
ps.getBorders().setColor(Color.BLACK);
ps.getBorders().setDistanceFromText(10);
}

I have attached the source MHTML, the output to DOC format, the output to PDF format and my code for review. PDFResults.zip (224.2 KB)

@AurinBlackstaff

Please call Document.UpdatePageLayout method before saving document to PDF. Hope this helps you.

If you still face problem, please ZIP and attach your input RTF here for testing. We will investigate the issue and provide you more information on it.