Image overlapping in the HTML text

Hi team,

We are using the code below to extract the HTML text from contentControl. However, in some cases, images overlap with the text. We have attached the extracted HTML code, which includes images from the document. In the HTML displayed in our HTML editor window, the image positions are set to absolute. As a result, when we display the content, the image positions remain unchanged. We have also attached a zip file containing a video along with the HTML text and images extracted from contentControl.

private HtmlMixedText getHtmlMixedText(StructuredDocumentTag std, boolean acceptRevision) {
        HtmlMixedText htmlMixedText = new HtmlMixedText();
        try {
            HtmlSaveOptions opts = new HtmlSaveOptions(SaveFormat.HTML);
            opts.setHtmlVersion(HtmlVersion.HTML_5);
            opts.setExportImagesAsBase64(true);
            opts.setExportListLabels(ExportListLabels.AS_INLINE_TEXT);
            opts.setExportPageMargins(true);
            htmlMixedText.setHtmlWithRevisions(std.toString(opts));
            htmlMixedText.setText(std.getText());
        } catch (Exception ex) {
            LOGGER.error("Not able to extract html text from content control , id : {}, due to : {}", std.getTag(), ex);
            htmlMixedText.setHtmlText(std.getText());
        }
        return htmlMixedText;
    }

data.zip (2.6 MB)

@hariomgupta73 Could you please attach your input document here for testing? We will check conversion on our side and provide you more information.

In addition, you should note, that Aspose.Words is designed to work with MS Word documents. HTML documents and MS Word documents object models are quite different and it is not always possible to provide 100% fidelity after conversion one model to another. In most cases Aspose.Words mimics MS Word behavior when work with HTML.

attaching the document for testing

test document.docx (247.1 KB)

@hariomgupta73 Thank you for additional information. The problem occurs because shapes in your document are floating. I am afraid, there is not way to properly preserve floating shapes position upon converting document to flow HTML.

@alexey.noskov Is there any way to change the shapes in the document?. can we change the shape of a image in the document so that it cannot be float.

@hariomgupta73 You can change wrap type of shapes through the code, but this might significantly affect the document layout:

Document doc = new Document("C:\\Temp\\in.docx");
for(Shape s : (Iterable<Shape>)doc.getChildNodes(NodeType.SHAPE, true))
{
    if(s.isTopLevel())
        s.setWrapType(WrapType.INLINE);
}

We are facing other issue while extracting HTML from a Word document using the Aspose library. The problem occurs when text content is hidden below the watermark images in the extracted HTML. Due to this, the text is not visible correctly in the rendered HTML output.

Issue Details:

  • Problem: Text in the Word document that appears behind a watermark is not properly extracted or is overlapped by the watermark in the HTML output.
  • Expected Behavior: The extracted HTML should retain proper text visibility without being hidden under the watermark.
  • Actual Behavior: The extracted HTML places the text below the watermark, making it unreadable in certain scenarios.

attaching the document to reproduce the issue.

_Master-Supply-Agreement-MSA-effective-9-Oct-2023_For Dem0.docx (47.1 KB)

@hariomgupta73 As I can see watermark in the output HTML is behind the text, just as expected.

If the output HTML is for viewing purposes, i.e. it is not supposed to be edited or processed, you can consider using HtmlFixed format. In this case the output should look exactly the same as it looks in MS Word:

Document doc = new Document("C:\\temp\\in.docx");
HtmlFixedSaveOptions opt = new HtmlFixedSaveOptions();
opt.setExportEmbeddedCss(true);
opt.setExportEmbeddedFonts(true);
opt.setExportEmbeddedImages(true);
opt.setExportEmbeddedSvg(true);
doc.save("C:\\Temp\\out.html", opt);

HtmlFixed format is designed to preserve original document layout for viewing purposes. So if your goal is to display the HTML on page, then this format can be considered as an alternative. But unfortunately, it does not support roundtrip to DOCX at all.

@alexey.noskov The above solution generates HTML text that matches the original text of the document. Is there a way to remove such images from the HTML while rendering it for viewing, while still preserving the formatting?

However, during editing, we provide the original HTML retrieved from our method so that when inserting the HTML back into the document, the formatting remains intact. We also use a method to fetch HTML based on the content control instead of retrieving it directly from the document.

@hariomgupta73

Do you mean the image behind text? You can remove all shapes using the following code:

Document doc = new Document("C:\\Temp\\in.docx");
doc.getChildNodes(NodeType.SHAPE, true).clear();
doc.save("C:\\Temp\\out.docx");

In your case there is a group shape, if you need to remove group shape behind text, you can use the following code:

Document doc = new Document("C:\\Temp\\in.docx");
        
for(GroupShape gs : (Iterable<GroupShape>)doc.getChildNodes(NodeType.GROUP_SHAPE, true))
{
    if(gs.getBehindText())
        gs.remove();
}
        
doc.save("C:\\Temp\\out.docx");