Doc/pdf to html -- quality and size issue

We have analyzed aspose DOC/PDF jar to convert doc/docx/pdf to html.



We found the quality of html after conversion to be not that satisfactory



We then tried converting doc/docx to pdf and then to html, through which
quality improves a little bit but with that size increase to around
8-10 times



Please find the attached docs which we have used for this POC …





We are embedding all the resources in the html like images,css…



If possible can we separate out common resources like css which will be common for all the converted documents



Also is there any control on fonts like in few converted CVs having font
size in px and others are in pt … If we want to convert all docs to
html in one unit of font …





Please let us know will it be better in paid software, we are planning
to buy it. But prior to that need to cross check the quality of
converted CVs along with size



please find below the code which we are using for conversion:

Document doc = new Document(inputFile);
// Instantiate HTML Save options object
HtmlSaveOptions newOptions = new HtmlSaveOptions();

// Enable option to embed all resources inside the HTML
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

// This is just optimization for IE and can be omitted
newOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
// Output file path
doc.save(outputFile, newOptions);

Hi Madhur,

We are testing your documents and will update you soon.

Best Regards,

Hi Madhur,

I was not able to reproduce this issue at my end using the latest version of Aspose.Words. HTML output looks fine.

Can you please share which version of Aspose.Words are you using? Please also share if you are using .NET version or Java version because you have mentioned about JAR files in your post but your code is .NET code.

Aspose.Words allow you to set CSS file name so you can use same CSS file for all HTML files. You can also use CssSavingCallback and FontSavingCallback events to update CSS or font settings as per your requirement.

Best Regards,

thanks for the reply

we are using aspose-pdf-9.7.0-jdk14.jar, aspose-words-14.11.0-jdk16.jar jars


Doc to html code
Document doc = new Document(inputFile);
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.setExportImagesAsBase64(true);
doc.save(outputFile, newOptions);

pdf to html code
Document doc = new Document(inputFile);
HtmlSaveOptions newOptions = new HtmlSaveOptions();
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
newOptions.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
doc.save(outputFile, newOptions);



I have attached a sample doc.
In attachment folder you will find below folders:
doc means original docs (size 168k)
html_from_doc means html created from doc (size 184K)
pdf_from_doc means pdf created from doc (size 268k)
html_from_pdf means html created from above created pdf (size 1.1 M)



When we convert doc to html, size is fine but quality is not good
when we convert from doc to pdf and then pdf to html, qulaity improves but size becomes too huge and these are just 1-2 sample doc, we are facing similar issues in many docs



Our requirement is to convert doc/docx to html but without compromising on the quality and with appropraite size.

Also we have observed some common font type and css in the html after conversion. Is there a way that we can separate out all such common part like css in a separate css and then include it in all converted htmls.

Example : if I have 4 docs A, B, C, D and there is some common css which I can save as sample.css and use them in all 4 htmls created from docs. this would help in reducing the size of the converted htmls and will provide some control to us on look and feel.


You have mentioned something below
"Aspose.Words allow you to set CSS file name so you can use same CSS file for all HTML files. You can also use CssSavingCallback and FontSavingCallback events to update CSS or font settings as per your requirement."

Could you please provide some example for the same



thanks,
madhur




Hi Madhur,

In your case, it is recommended to use HTML Fixed format as you can see in the following code. This will give you better quality and size as compared to Aspose.Pdf.

Document doc = new Document("sample.doc");

<?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" />

HtmlFixedSaveOptions newOptions = new HtmlFixedSaveOptions();

newOptions.setSaveFormat(SaveFormat.HTML_FIXED);

newOptions.setExportEmbeddedImages(true);

newOptions.setExportEmbeddedFonts(true);

newOptions.setExportEmbeddedCss(true);

newOptions.setExportEmbeddedSvg(true);

doc.save("Sample.html", newOptions);

As far as using same CSS for multiple files is concerned, Aspose.Words will not compare the CSS files to determine if same CSS can be used for all files. It will be up to you to set CSS file name and check if a particular CSS can be used for many documents.

Best Regards,

thanks for the reply.


As suggested above I have used the same code. It improves the quality when I was converting doc to html earlier but still its not as good when I convert from doc to pdf to html.

Also size is still a big issue, in many of my docs size increases to 7-10 times.

Attached 2 such sample cvs

thanks,
Madhur

thanks for the reply.


As suggested above I have used the same code. It improves the quality when I was converting doc to html earlier but still its not as good when I convert from doc to pdf to html.

Also size is still a big issue, in many of my docs size increases to 7-10 times.

Attached 2 such sample docs

thanks,
Madhur

madhur garg:
thanks for the reply.

As suggested above I have used the same code. It improves the quality when I was converting doc to html earlier but still its not as good when I convert from doc to pdf to html.

Also size is still a big issue, in many of my docs size increases to 7-10 times.

Attached 2 such sample cvs
Hi Madhur,

Thanks for sharing the details and sorry for the delayed response.

I have tested the scenario of converting DOCX file to PDF format and as per my observations, the source (ThejaGuthi%5b0_0%5d49.docx) 20.6 KB file is converted to 80.2 KB PDF document. Then I converted the PDF file to HTML format using Aspose.Pdf for .NET and as per my observations, the resultant HTML and the size of folder containing resource files (WOFF, SVG, CSS etc) is 282KB and its because of WOFF font files used in source PDF files. Besides this, I have also observed that contents of HTML generated from PDF are garbled. For the sake of correction, I have logged this problem
as
PDFNEWNET-38064 in our issue tracking system. We will further
look into the details of this problem and will keep you updated on the status
of correction. Please be patient and spare us little time. We are sorry for
this inconvenience.

[C#]

Aspose.Words.Document
docx =
new Aspose.Words.Document(“c:/pdftest/ThejaGuthi%5b0_0%5d49.docx”);<o:p></o:p>

MemoryStream ms = new MemoryStream();

// docx.Save(ms, Aspose.Words.SaveFormat.Pdf);

docx.Save("c:/pdftest/ThejaGuthi%5b0_0%5d49.pdf", Aspose.Words.SaveFormat.Pdf);

Aspose.Pdf.Document pdf = new Document("c:/pdftest/ThejaGuthi%5b0_0%5d49.pdf");

Console.WriteLine(pdf.Pages.Count);

pdf.Save(“c:/pdftest/ThejaGuthi%5b0_0%5d49.html”, SaveFormat.Html);