We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Conversion from pdf to word and then to html issue

I initially convert a document from pdf to word (any version) and then in a later different process convert the word document to html. The word document displays exactly as expected, however,

the resulting html loses much of the structure/formatting of the document - in particular all text lines start at the left margin.
If the pdf document is converted directly to html there is no issue.
I am using aspose.pdf 16.10 and aspose.words 16.8.

What can I do?

Hi Richard,

Thanks for contacting support.

I have performed the PDF to DOCX conversion using Aspose.Pdf for .NET 17.1.0 and as you have stated above, the output is properly being generated. However we are now further looking into DOCX to HTML conversion and will keep updated with our findings. We are sorry for this inconvenience.
Hi Richard,

Thanks for your inquiry. In this case, Aspose.Words mimics the behavior of MS Word 2016. I have converted this Aspose.Pdf generated DOCX document (975_out.docx) to HTML using MS Word 2016 and attached it here for your reference. Aspose.Words 17.1.0 produces an HTML output similar to MS Word 2016. So, this seems to be an expected behavior. If we can help you with anything else, please feel free to ask.

Best regards,

The two documents that you have produced which you claim to be similar are completely different. There is absolutely no formatting in the html document. Please don’t bother replying unless you are going to help!!

Hi Richard,

Please accept my apologies for your inconvenience.

In your case, we suggest you please use DocSaveOptions.Mode as RecognitionMode.Flow to get the desired output. Please check the following C# code example.

Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(MyDir + "975.pdf");

Aspose.Pdf.DocSaveOptions options = new Aspose.Pdf.DocSaveOptions();
options.Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Flow;

MemoryStream stream = new MemoryStream();

pdf.Save(MyDir + "Word.doc", options);
pdf.Save(stream, options);

stream.Position = 0;

Aspose.Words.Document doc = new Aspose.Words.Document(stream);
doc.Save(MyDir + "Output.html", SaveFormat.Html);

There are two issues in Word output generated by Aspose.PDF. We have logged these issues in our issue tracking system. Following is the detail.

  • The position of cell’s text in output Word document is different from input Pdf. Please check the attached image (position of text.png) for detail. This problem is logged as PDFNET-42186.
  • The font name of text in output Word document is different from input Pdf. The issue is logged as PDFNET-42187.

In the final html output generated by Aspose.Words, the font name is also incorrect. We have also logged this issue as WORDSNET-14794 in our issue tracking system.

You will be notified via this forum thread once these issues are resolved. We are really very sorry for your inconvenience.