Hi there
we’re using Aspose.Words to save a html document as a docx file and import the contents into another docx file using:
var workTable = (Table)workDoc.GetChild(NodeType.Table,
0,
true);
foreach (var row in workTable.Rows.OfType<Row>())
targetTable.Rows.Add(targetTableDoc.ImportNode(row, true, ImportFormatMode.KeepSourceFormatting);
/* tried that one as well which didn't make any difference either */
var nodeImporter = new NodeImporter(workDoc, doc, ImportFormatMode.UseDestinationStyles, new ImportFormatOptions{KeepSourceNumbering = true});
originalTable.Rows.Add(nodeImporter.ImportNode(r, isImportChildren: true));
Since bumping Aspose.Words to anything further than 18.1.0 ( eg. 22.9 which we are targeting right now ) A lot of linebreaks or empty paragraphs which just contain a “/a” character get lost.
I am attaching a minimal sample here to reproduce this issue.
AsposeHtmlImportReproSet.zip (2.8 KB)
@SimonSchwendele
It seems that you are experiencing issues with line breaks and empty paragraphs being lost when importing HTML into a DOCX file using Aspose.Words, particularly after upgrading from version 18.1.0 to a later version like 22.9.
To address this issue, you might want to consider using the HtmlInsertOptions
enumeration, which provides various options for how HTML content is inserted into a document. Specifically, the PreserveBlocks
option can help maintain the properties of block-level elements, which may assist in preserving line breaks and empty paragraphs during the import process.
Here’s an example of how you can use HtmlInsertOptions
when inserting HTML:
// Create a DocumentBuilder instance
DocumentBuilder builder = new DocumentBuilder();
// Define your HTML content
string htmlContent = "<p>First paragraph</p><p></p><p>Second paragraph</p>";
// Use HtmlInsertOptions to preserve block-level elements
HtmlInsertOptions insertOptions = HtmlInsertOptions.PreserveBlocks;
// Insert the HTML content with the specified options
builder.InsertHtml(htmlContent, insertOptions);
// Save the document
builder.Document.Save("output.docx");
This approach may help in retaining the formatting and structure of your HTML content, including line breaks and empty paragraphs. If the problem persists, consider checking the specific HTML content you are importing, as certain HTML structures may behave differently based on the version of Aspose.Words you are using.
If you continue to face issues, please provide a minimal reproducible example, as it can help in diagnosing the problem more effectively.
Sources:
[1]: HtmlInsertOptions Enum | Aspose.Words for .NET
[2]: DocumentBuilder.InsertHtml | Aspose.Words for .NET
Okay it appears this does the trick.
Please close this
EDIT:
Actually it doesnt.
But due to some weird circumvence it works when I call anything after this ( like
document.Range.Replace("WeirdPaddingThing", string.Empty);
)
Okay no idea whats going on here.
HtmlInsertOptions.PreserveBlocks
doesnt help with the formatting.
It still looks incorrectly formatted in the resulting docx.
The weird part tho is that after
foreach (var row in workTable.Rows.OfType<Row>())
targetTable.Rows.Add(targetTableDoc.ImportNode(row, true, ImportFormatMode.UseDestinationStyles);
the content now looks correct in the final docx file
@SimonSchwendele You should note, that Aspose.Words is designed to work with MS Word documents. HTML documents and MS Word documents object models are quite different and it is not always possible to provide 100% fidelity after conversion one model to another. In most cases Aspose.Words mimics MS Word behavior when work with HTML.
I tried converting your HTML to DOCX using MS Word and the output is similar to the output produced by Aspose.Words. Here is DOCX document produced by MS Word: ms.docx (14.7 KB)
As you can see it also trims soft line breaks at the end of paragraphs. If you need to preserve empty paragraphs, please try using HTML like this <p> </p>
Thanks for your reply.
My point is that it didnt used to do this.
This behavior only started upon updating to 18.2.0.
We’ve used this implementation for years without any issues just using Aspose.Words 15.6.
Suddenly just upon updating to 22.9 thins started to look differently.
@SimonSchwendele It looks like behavior has been changed to closer mimic MS Word behavior. As a workaround, if it is required to preserve line breaks at the end of paragraph, you can put a special span at the end of paragraph, like this:
<p style="margin-top:0pt; margin-bottom:8pt">
<span>Test</span><br /><br /><span style="-aw-import:ignore"> </span>
</p>
In this case both soft line break will be preserved after loading HTML to Aspose.Words DOM.
Okay no idea why it used to work for a few tests but now it doesnt work again.
Te msword.docx you uploaded is actually correct and as I want it to be.
The biggest Problem in my case is that the paragraphs start looking weird when integrated using importHtml
This is not the case with the msword.docx you provided.
And this is the result when I convert it using Aspose.Words.
@SimonSchwendele
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): WORDSNET-27752
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.