When combining two html documents using Aspose Words, data of one of them is trimmed out

Hi,
We are using Aspose Words to combine a couple of html documents and then export them as a single pdf file.

The problem is that, for the second html document, which has a table with a lot of columns, some of the columns are missing in the combined output pdf.

When I only export the second html to pdf, all the columns are visible, so seems that there is some scaling happening when both are combined together.

Could someone help with the above, in terms of how to make sure the combined pdf displays all the data?

Note: I found, that when I change the order of inserting the documents, i.e inserting the second html file (i.e. the bigger file) first, followed by the first html file, the combined pdf also displays everything. So may be, the first inserted document’s width sizing is somehow used? We obviously cannot use this solution as we do not know which of the input html files will be bigger or more wider.

Rough Java code of how it executes currently:

DocumentBuilder documentBuilder = documentBuilderProvider.get();

Document firstHtmlDocument = new Document(firstHtmlInputStream, new LoadOptions(LoadFormat.HTML, null, null));
Document secondHtmlDocument = new Document(firstHtmlInputStream, new LoadOptions(LoadFormat.HTML, null, null));

documentBuilder.insertDocument(firstHtmlDocument, ImportFormatMode.KEEP_SOURCE_FORMATTING);
documentBuilder.insertDocument(secondHtmlDocument, ImportFormatMode.KEEP_SOURCE_FORMATTING);

documentBuilder.getDocument().save(outputStream, new PdfSaveOptions());

The attached zip files contains the two html files and also the output pdf ‘combinedOutput.pdf’ that shows the problem. Also got another pdf file, that shows how the formatting issue disappears when the html file with lots of table columns is inserted first as mentioned in the Note section.
TestFiles.zip (141.8 KB)

@hunair I think in your case it is better to use Document.AppendDocument method to merge documents instead of DocumentBuilder.InsertDocument. The following code produces the correct output on my side:

Document doc1 = new Document(@"C:\Temp\HtmlFile1.html");
Document doc2 = new Document(@"C:\Temp\HtmlFile2.html");

// Uncomment this line if you do not need a separate page for the second document.
// doc2.FirstSection.PageSetup.SectionStart = SectionStart.Continuous;

doc1.AppendDocument(doc2, ImportFormatMode.KeepSourceFormatting);

doc1.Save(@"C:\Temp\out.pdf");

Thanks Alexey. When I tried doing it with an empty document and appending the above documents to it, the trimming issue still seems to persist. However, if I just create first document and append the second one to it, as shown in your example, it seems to work. Wonder why it didn’t work in earlier case.
Also, as you would have seen, the columns are wrapped-up a lot, even the column headers. Is there any way, to set a property or so, so that the columns remain wide and also the page widens?
Thanks.

@hunair The problem with an empty document occurs because when you create a document from scratch it is optimized for MS Word 2003 and compatibility options causes the layout problems in your case. You can resolve this by optimized newly created document for a newer version of MS Word. For example see the following code:

Document doc = new Document();
doc.CompatibilityOptions.OptimizeFor(MsWordVersion.Word2019);
doc.RemoveAllChildren();

Document doc1 = new Document(@"C:\Temp\HtmlFile1.html");
Document doc2 = new Document(@"C:\Temp\HtmlFile2.html");

// Uncomment this line if you do not need a separate page for the second document.
// doc2.FirstSection.PageSetup.SectionStart = SectionStart.Continuous;

doc.AppendDocument(doc1, ImportFormatMode.KeepSourceFormatting);
doc.AppendDocument(doc2, ImportFormatMode.KeepSourceFormatting);

doc.Save(@"C:\Temp\out.pdf");

Answering your second question, HTML documents do not have page width, but MS Word documents does. You can try adjusting page width or change page orientation to get more accurate result when convert from HTML:

Document doc = new Document();
doc.CompatibilityOptions.OptimizeFor(MsWordVersion.Word2019);
doc.RemoveAllChildren();

Document doc1 = new Document(@"C:\Temp\HtmlFile1.html");
EnlargePage(doc1);
Document doc2 = new Document(@"C:\Temp\HtmlFile2.html");
EnlargePage(doc2);

// Uncomment this line if you do not need a separate page for the second document.
// doc2.FirstSection.PageSetup.SectionStart = SectionStart.Continuous;

doc.AppendDocument(doc1, ImportFormatMode.KeepSourceFormatting);
doc.AppendDocument(doc2, ImportFormatMode.KeepSourceFormatting);

doc.Save(@"C:\Temp\out.pdf");
private static void EnlargePage(Document doc)
{
    foreach (Section s in doc.Sections)
    {
        s.PageSetup.PaperSize = Aspose.Words.PaperSize.A3;
        s.PageSetup.Orientation = Orientation.Landscape;
    }
}

Thank you Alexey for all the help. The above mentioned changes has helped solve the issue with appending to an empty document.
The bit on the formatting of the html is still bothering us. The suggested fix to increase the page size to a wider document could be a solution if there was a way to dynamically check if the html document being inserted needs to be widened as we do not want the document to be too wide when the content isn’t too wide.
I was wondering if we could somehow dynamically increase the width of the document if the html is too wide? Also, one thing I noticed, is that for a wide html document, some lines where there is more than enough width, the content still seems to be wrapped-down, is there a way to correct this via a property? Also, is there a way to set a minimum value to which words display without wrapping?
Also, is there a property to enable horizontal scrolling for wide documents?

Thanks

@hunair Unfortunately, there is no way to dynamically determine the required page width for HTML page using Aspose.Words.
Also, since HTML and MS Word formats are quite different it is difficult and sometimes impossible to guarantee 100% fidelity upon conversion from one to another. In Most cases Aspose.Words tries to mimic MS Word behavior when import and export in HTML format. You can learn more about features supported by Aspose.Words upon loading adn saving in HTML format from our documentation:
https://docs.aspose.com/words/net/load-in-the-html-html-xhtml-mhtml-format/
https://docs.aspose.com/words/net/save-in-html-xhtml-mhtml-formats/

You can avoid text wrapping in table cells by auto fitting it to fixed columns width:

NodeCollection tables = doc.GetChildNodes(NodeType.Table, true);
foreach (Table t in tables)
    t.AutoFit(AutoFitBehavior.FixedColumnWidths);

But in this case, since table is too wide it goes outside the page bounds.

No, there is no such property.

No, there is no such property. Horizontal scrolling is auto enabled by PDF consumer (Acrobat Reader for example) app if page width is too wide. But anyways the content should fit the page to be visible.