No spaces between words after removing builtin Styles?

I’m going from PDF to DOCX to HTML.

I have received the DOCX files which are converted from another service. I’m using Aspose.Words to go from DOCX to HTML.

I need HTML because the output will be displayed in a WYSIWYG editor which supports HTML. Also I need the styling to be as close to the original PDF as possible including bullets, italics, bold, table etc.

I need to go from PDF to DOCX and then to HTML because I need to process the output according to the HTML Heading tags viz. H1, H2, H3, H4, H5, and H6.

When saving the file as HTML from MS-OFFICE the heading tags are available as is in the exported HTML file.

But when trying to generate HTML from the same DOCX it does not output the heading tag what it does is it creates a paragraph and applies CSS to match the style of the DOCX document.

I did go over the forums and Aspose.Word documentation what I figured out was that it has something to do with the built-in styles of the DOCX document.

What I’ am doing is I’ am remove the heading styles of the current document. Loading the styles from another document and saving the document.

Python Code

doc.styles.get_by_name('Heading 1').built_in
doc.styles.get_by_name("Heading 1").remove()
temp.copy_styles_from_template(doc)
doc.automatically_update_styles = True
doc.save(outputFile)

What is being done wrongly for there to be no spaces between words when we view the document in MS-OFFICE

How do I set the current style of the document as the builtin style using aspose.words.
sample-document-converted-from-word-using-word-save-as-pdf.docx (11.0 KB)
I’ am adding styles back to the document based on bookmarks in the document which is not that reliable in cases where the subheading does not have a bookmark or the bookmark is not formatted correctly.

Is there a way to set the current style of the document as built in style so that while exporting the document I do not need to remove the styles?

Also I have noticed that after removing the styles in the DOCX document spaces between words vanish. But after exporting the DOCX to HTML the spaces are there in-between words

Also are the bookmarks are just an array or do they have a parent and child relationship between each other.
Have attached the sample document where there are no spaces between the words. But when exporting from docx to html there are spaces between the words in the exported HTML

@harshvardhan.scindia

There is no way to mark style as built-in. The Style.built_in property is read-only and indicates whether the style is built-in or not.

I checked the attached document and cannot reproduce the same on my side.

Bookmarks in MS Word documents are represented by two nodes BookmarkStart and BookmarkEnd. Please see our documentation to lean more about bookmarks.

Please correct me if I understand your scenario improperly. You have PDF document as an input and use Aspose.Words to load PDF document and convert it to DOCX. After conversion headings from PDF document are imported as simple text with formatting, that is expected since in PDF text does not have styles. You can try postprocessing the document and applying the required styles in your document. It is not required to remove built-in styles for this. You should simply identify paragraphs where heading styles must be applied and set the appropriate style.
Also, could you please attach your source PDF document here for testing?

Hi this is the sample PDFdocument.

edited-sample-document-converted-from-docx-to-pdf-adobe.pdf (121.5 KB)

Also below is an image from the same document.screenshot-image-of-no-space-document.PNG (102.7 KB)

And how can I can get all paragraphs with Heading 1 style. I’am using Python.
Thank you for your quick response.

@harshvardhan.scindia Thank you for additional information. I have checked your source PDF document and noticed it is tagged, i.e. heading paragraphs in PDF are marked with an appropriate tags. We probably can use this information while loading PDF document into Aspose.Words DOM. I will consult with developers responsible for PDF import and provide you more information.

@harshvardhan.scindia As I already mentioned your sample PDF document is tagged and we can read tags information into Aspose.Words DOM. There is ParagraphFormat.outline_level property, which is currently used to export document structure into PDF. We can fill this property with an appropriate level upon reading PDF document. Would it be enough for you to identify heading paragraphs?

Hi thank you for the reply. How to read the tags information into Aspose.Words DOM. Please guide me to the documentation. Will explore the ParagraphFormat.outline_level information.

@harshvardhan.scindia In the current version there is no way to achieve this. I have created a feature request WORDSNET-23414. We add this functionality in on of the future versions and let you know once it is available.
With the current version the only way I can propose is checking content formatting to detect heading paragraphs. However this is not accurate method because in different documents heading paragraphs can have different formatting.

Hi, Thank you for your quick reply I have added one more question. It is regarding both Aspose.Words and Aspose.PDF. When Converting a tagged PDF to HTML or WORD. Using the information in the tagged PDF can we export the heading tags as well. i.e. the WORD file will have the relevant heading (h1, h2, h3, h4, h5, h6) and the same for the HTML as the Heading tags improve accessibility.

@harshvardhan.scindia Once the feature WORDSNET-23414 is implemented headings (h1, h2, h3, h4, h5, h6) information will be read into Aspose.Words DOM as a relevant paragraph outline level. You will be able to use this information to assign the appropriate styles to the paragraphs and then export document to Html and MS Word formats.

Okay thank you for your quick reply. Any timeline when this feature will be implemented.

@harshvardhan.scindia It is scheduled to the March release of Aspose.Words for .NET, which will be released at the beginning of March 2022. Python version will be released a bit later. We will be sure to notify you once it is available.

Hi, thank you for the quick reply. Will it be available in JAVA as well. For which language does Aspose.Words has better support JAVA or .NET.

@harshvardhan.scindia Sure, the feature will be also available in Java version.

.NET version is the main version of the product and is released first. Then code is ported to Java and C++, these produces are released next. Python version is a wrapper version and uses .NET version as a core. All the versions of Aspose.Words has the same set of functionality and level of support.

1 Like

Thank You So Much :slight_smile:

The issues you have found earlier (filed as WORDSNET-23414) have been fixed in this Aspose.Words for .NET 22.3 update also available on NuGet.