Issue after convert HTML to Docx

We have ASPOSE.Total license and we are using ASPOSE to convert HTML document to word document but after conversion we have some issues with word document.

  1. Every line of word document comes under a text box item, So if we want to edit
    the document after conversion. If we press Enter key text does not comes in next
    line. it is hidden in the text box item.
  2. Tab functionality is not working correctly in converted word document.
  3. Due to every text line is in the textbox item word document shortcuts are not
    working properly.

@Sudrashya Could you please specify which Aspose produce you use for document conversion? Also please attach your input and output documents here for testing. We will check the issue and provide you more information.

We are using ASPOSE Total product and below is the code for conversion

try
    {
        string path = Path.GetFullPath(Path.Combine(Directory.GetCurrentDirectory(), @"..\..\..\"));
        string final = File.ReadAllText(path + "Input\\Final.html");
       
        // Initialize DocSaveOptions 
        var options = new Aspose.Html.Saving.DocSaveOptions();
        // Convert HTML webpage to DOCX
       Converter.ConvertHTML(final,path, options, path + "Output\\output.docx");
    }
    catch(Exception ex)
    {

 
    }

Test.zip (5.2 KB)

Also you can see the earlier image also where I selected the textbox shape on every line.

@Sudrashya You are using Aspose.HTML for conversion Html to Word document. My colleagues from Aspose.HTML team will answer you shortly.
Meanwhile, you can try using Aspose.Words for HTML to Word conversion:

Aspose.Words.Document doc = new Aspose.Words.Document(@"C:\Temp\in.html");

// table in the source HTML are too wide to fit the page.
// Fix this by autofiting the table to window.
foreach (Aspose.Words.Tables.Table t in doc.GetChildNodes(NodeType.Table, true))
    t.AutoFit(AutoFitBehavior.AutoFitToWindow);

doc.Save(@"C:\Temp\out.docx");

This code produce the following output form your HTML: out.docx (8.7 KB)
It more or less the same as DOCX produced by MS Word from your HTML: ms.docx (14.2 KB)

ms.docx is cutting from the right side and “Page 1 of 1” is coming to left while in html it is on right.

@Sudrashya ms.docx has been produced by MS Word. Aspose.Words in most case mimic MS Word upon conversion from HTML to Word.

@alexey.noskov : I tried your solution but it is not working with html elements exists in right side. it is taking all the “float:right” elements to the left side of the document.
I have one image and “Page 1 of 1” in right side but after conversion those items came to left side. Please find the input file with image. Please try it and let me know.

Test.zip (4.7 KB)

@alexey.noskov: Please refer the out.docx also. last text “Page 1 of 1” is coming in the left side whereas in the html it is on right side.

@Sudrashya This is the expected behavior of Aspose.Words. You should note that Aspose.Words is designed to work with MS Word documents. There is no analog of DIV elements in MS Word documents, so the DIV s are converted to paragraphs in Aspose.Words DOM. In this case Aspose.Words behaves the same way as MS Word does.

@alexey.noskov: So we can’t convert the html document to word with same style?
Any other product to achieve the same as we have license for ASPOSE total.

@Sudrashya Originally, you have used Aspose.HTML for HTML to Word conversion. My colleagues from Aspose.HTML team will answer you shortly. I am from Aspose.Words team.

@alexey.noskov: I can use ASPOSE word as well. If you provide me the solution as I have ASPOSE Total license. So just confirm me with ASPOSE word it is not possible to convert html document to word with same style. “as we see in HTML”.
attaching again the html input file for your reference.

Test.zip (4.7 KB)

@Sudrashya As I have mentioned Aspose.Words is designed to work with MS Word documents and upon loading HTML document, the document is converted to Aspose.Words DOM. The significant different in HTML and MS Word documents models does not allow to provide 100% fidelity after conversion. So, I am afraid, it is not possible to get MS Word document that looks exactly the same as HTML opened in the browser using Aspose.Words.

@Sudrashya

We used below code snippet and Aspose.PDF for .NET to convert your HTML into DOCX. Please check the attached output file and let us know in case you still find issues in it:

Document doc = new Document(dataDir + "input.html", new HtmlLoadOptions());

DocSaveOptions saveOptions = new DocSaveOptions()
{
 Format = DocSaveOptions.DocFormat.DocX,
 Mode = DocSaveOptions.RecognitionMode.EnhancedFlow
};

doc.Save(dataDir + "output.docx", saveOptions);

output.docx (63.8 KB)

@asad.ali: No it is not as per the expectation, If you see the alignment of text is not as per HTML document. Please try with attached HTML. you will see bad alignment if document contains image as well.
Test.zip (4.7 KB)

@Sudrashya

We now tested the case with Aspose.HTML for .NET 23.1 and noticed the similar issue that you mentioned about image alignment. We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): **HTMLNET-4246**

We will let you know once the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

@alexey.noskov: Now we are using ASPOSE Word to covert HTML to Word document.
When ASPOSE convert HTML to WORD document it leaves some space on top of the document for header section.
Is it possible to start the conversion from the top of the page OR any way to remove the header section. So that conversion can start from the top of document.
We are using the code below:

Aspose.Words.Document doc = new Aspose.Words.Document(@"C:\Temp\in.html");

// table in the source HTML are too wide to fit the page.
// Fix this by autofiting the table to window.
foreach (Aspose.Words.Tables.Table t in doc.GetChildNodes(NodeType.Table, true))
    t.AutoFit(AutoFitBehavior.AutoFitToWindow);

doc.Save(@"C:\Temp\out.docx");

@Sudrashya You can reset page margins to zero:

Aspose.Words.Document doc = new Aspose.Words.Document(@"C:\Temp\in.html");

// Reset section margins.
foreach (Aspose.Words.Section s in doc.Sections)
{
    s.PageSetup.TopMargin = 0;
    s.PageSetup.BottomMargin = 0;
    s.PageSetup.LeftMargin = 0;
    s.PageSetup.RightMargin = 0;
}

// table in the source HTML are too wide to fit the page.
// Fix this by autofiting the table to window.
foreach (Aspose.Words.Tables.Table t in doc.GetChildNodes(NodeType.Table, true))
    t.AutoFit(AutoFitBehavior.AutoFitToWindow);

doc.Save(@"C:\Temp\out.docx");

@alexey.noskov Thanks for response. I have one issue with table conversion from HTML to WORD using ASPOSE.WORD. Table header is not consistent with data in the table. We can’t change AutoFitBehavior.FixedColumnWidths. Using below code for conversion

Aspose.Words.Document document = new Aspose.Words.Document(new MemoryStream(Encoding.UTF8.GetBytes(html)));
foreach (Aspose.Words.Section s in document.Sections)
{
    s.PageSetup.TopMargin = 10;
    s.PageSetup.BottomMargin = 10;
    s.PageSetup.LeftMargin = 50;
    s.PageSetup.RightMargin = 0;
}
// table in the source HTML are too wide to fit the page.
// Fix this by autofiting the table to window.
foreach (Aspose.Words.Tables.Table t in document.GetChildNodes(NodeType.Table, true))
{
    t.AutoFit(AutoFitBehavior.FixedColumnWidths);
}
document.Save(outputFilePath);

Please let me know what to do? html is attached.

template.zip (597 Bytes)