Word to HTML file convert issue

Hi Team,

I’m using aspose.word dll for converting the file from DOCX to HTML but converting the HTML file is not coming up with the proper output file. I’m attaching the input file which word.docx file, the output file which is HTML and the source code That we’re using.mlju2021_02128.docx (104.5 KB)

@sandy.wood Unfortunately, I do not see any attached document. Could you please attach the problematic documents here for testing? We will check the issue and provide you more information.

mlju2021_02128.docx (104.5 KB)

I’ve attached the docx file please have look.

@sandy.wood Thank you for additional information. I have checked your document and noticed a problem with the table on the the 12th page. Actually it is not a table, but shape with text on the top of it. By the way, MS Word also cannot properly convert it HTML.
It looks like your document is a result of recognition from image or some other fixed page format. And the recognition software was not able to properly handle the mentioned table. Could you please let us know what was the original document format?
If your goal converting to HTML is to display (not edit) the document in browser, you can consider using FixedHtml format instead. In this case layout of your document will be preserved:

Document doc = new Document(@"C:\Temp\in.docx");
doc.Save(@"C:\temp\out.html", SaveFormat.HtmlFixed);

Thanks, @alexey.noskov I’ve used your code and it’s fixing some issues in the HTML file but it’s given the space between the Text character(attached the file) and attached the DOCX fil as well.WordToHTML.PNG (6.4 KB)
mlju2021_02126.docx (84.3 KB)

@sandy.wood As I can see the whitespace is there in the source DOCX document. Please see the attached screenshot:

It looks like the recognition software mistakenly put whitespace there.

Yes, I’ve seen the document but is there any way we can remove the whitespace while converting the file from DOCX to HTML.

@sandy.wood You can try using Range.Replace method to find and remove such redundant whitespaces. But I am afraid it will not be a trivial task to filter out what whitespaces are redundant.

@alexey.noskov, I’ve seen the HTML they are using the letter-spacing style which I need to remove.
style=“font-size:14.5pt; letter-spacing:2.37pt; left:122.46pt; top:32.64pt;”

Is there any way to remove it from Style while converting the file Docx to HTML?

@sandy.wood The letter-spacing is comming from the document. So you can remove this attribute by resetting the appropriate property. For example see the following code:

Document doc = new Document(@"C:\Temp\in.docx");

List<Node> runs = doc.GetChildNodes(NodeType.Run, true).Where(r => ((Run)r).Font.Spacing != 0).ToList();
foreach (Run r in runs)
    r.Font.Spacing = 0;

doc.Save(@"C:\Temp\out.html", SaveFormat.HtmlFixed);