Issue while Converting DocX to HTML and then back to DocX

parthu · February 27, 2018, 12:45pm

Hi there,

my use case is as per below :-

Converting some part of docx file to HTML (extracting particular part into separate Document and saving that document as HTML)
Modifying extracted HTML content using CKEditor.
Converting Updated content (in HTML form) to original DocX file (using find and replace with insertHTML() method)

now i have an issue that when ever there is an empty line in original docx (in content that i want to extract) during step 1 some style attribute with -aw-import:ignore; is added. for example -

 

now when ever i change content of this empty line with some word or text then final generated document during step 3 will not contain text whose style tag have -aw-import:ignore; attribute.

i mean Aspose word will ignore tags with style attribute -aw-import:ignore; thus those tags data will not reflected in generated docx.

any solution?

tahir.manzoor · February 27, 2018, 1:54pm

@parthu,

Thanks for your inquiry. To ensure a timely and accurate response, please attach the following resources here for testing:

Your input Word/HTML document.
Please attach the output Word file that shows the undesired behavior.
Please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we’ll start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

callidus · March 19, 2018, 3:01pm

@tahir.manzoor

we have the same issue when doing conversion from DOCX to HTML and exporting back to DOCX.

Aspose adds “-aw-import:ignore” attribute to empty elements (&nbsp) and when exporting back these elements are omitted from export to DOCX completely.

We need to understand why Aspose adds this attribute to empty elements and is it safe to remove it in our application before doing export as a quck fix.

Regards,
Callidus Team

tahir.manzoor · March 19, 2018, 5:01pm

@callidus,

Thanks for your inquiry. Please ZIP and attach your input Word document here for testing. We will investigate the issue on our side and provide you more information.

callidus · March 20, 2018, 1:35pm

Hi @tahir.manzoor,

please find attached example with DOCX input file and HTML result. You can see that Aspose adds “ignore” attribute to all empty (&nbsp) elements.

We use Aspose Word 17.10 version to make this conversion.

    public static String generateHtmlFromByteArray(byte[] template) {
    if (template==null){
        return "";
    }
    InputStream is = new ByteArrayInputStream(template);
    String convertedHtml = null;
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    try {
        Document doc = new Document(is);
        HtmlSaveOptions options = new HtmlSaveOptions(SaveFormat.HTML);

        options.setExportImagesAsBase64(true);
        doc.joinRunsWithSameFormatting();
        doc.save(outputStream, options);
        convertedHtml = outputStream.toString("UTF-8");

    } catch (Exception e) {
        log.error("Error while converting byte array to HTML.", e);
    } finally {
        IOUtils.closeQuietly(is);
        IOUtils.closeQuietly(outputStream);
    }
    convertedHtml = transformListTags(convertedHtml);
    return convertedHtml;
}

CallidusExample.zip (11.9 KB)

Regards,
Milorad

tahir.manzoor · March 20, 2018, 4:34pm

@callidus,

Thanks for sharing the detail. We have logged this problem in our issue tracking system as WORDSNET-16599. We will inform you via this forum thread once there is any update available on this issue.

We apologize for your inconvenience.

tahir.manzoor · March 21, 2018, 6:55am

@callidus,

Thanks for your patience. We have completed the analysis of WORDSNET-16599 and closed it.

Please note that empty paragraphs cannot be exported to HTML as empty elements, because such elements will be collapsed (will have zero height) in browsers. In order to prevent collapsing, Aspose.Words writes invisible content - a non-breaking space character - into each empty paragraph. Since this content is not a part of the original document, it should be ignored when the HTML document is loaded by Aspose.Words. Otherwise, extra content would appear in the document after DOCX-HTML-DOCX round-trip.

If you write some custom content to paragraphs that are empty in the source DOCX document, you can remove the ‘-aw-import:ignore’ and non-breaking space character from the paragraphs.

Custom text.