I’m going from PDF to DOCX to HTML.
I have received the DOCX files which are converted from another service. I’m using Aspose.Words to go from DOCX to HTML.
I need HTML because the output will be displayed in a WYSIWYG editor which supports HTML. Also I need the styling to be as close to the original PDF as possible including bullets, italics, bold, table etc.
I need to go from PDF to DOCX and then to HTML because I need to process the output according to the HTML Heading tags viz. H1, H2, H3, H4, H5, and H6.
When saving the file as HTML from MS-OFFICE the heading tags are available as is in the exported HTML file.
But when trying to generate HTML from the same DOCX it does not output the heading tag what it does is it creates a paragraph and applies CSS to match the style of the DOCX document.
I did go over the forums and Aspose.Word documentation what I figured out was that it has something to do with the built-in styles of the DOCX document.
What I’ am doing is I’ am remove the heading styles of the current document. Loading the styles from another document and saving the document.
Python Code
doc.styles.get_by_name('Heading 1').built_in
doc.styles.get_by_name("Heading 1").remove()
temp.copy_styles_from_template(doc)
doc.automatically_update_styles = True
doc.save(outputFile)
What is being done wrongly for there to be no spaces between words when we view the document in MS-OFFICE
How do I set the current style of the document as the builtin style using aspose.words.
sample-document-converted-from-word-using-word-save-as-pdf.docx (11.0 KB)
I’ am adding styles back to the document based on bookmarks in the document which is not that reliable in cases where the subheading does not have a bookmark or the bookmark is not formatted correctly.
Is there a way to set the current style of the document as built in style so that while exporting the document I do not need to remove the styles?
Also I have noticed that after removing the styles in the DOCX document spaces between words vanish. But after exporting the DOCX to HTML the spaces are there in-between words
Also are the bookmarks are just an array or do they have a parent and child relationship between each other.
Have attached the sample document where there are no spaces between the words. But when exporting from docx to html there are spaces between the words in the exported HTML