Word file conversion to HTML, HTML structure issues

Word files are converted into HTML, but the converted HTML structure is incorrect. The watermark image and header image are in the same structure, and it cannot be distinguished which are watermark images and which are header images
test.zip (121.2 KB)

@yanke1 Could you please attach your input document here for testing?
Please note, Aspose.Words is designed to work with MS Word documents. HTML documents and MS Word documents object models are quite different and it is not always possible to provide 100% fidelity and preserving all features after conversion one format to another.

We need to identify the images contained in the converted HTML, and distinguish which ones are watermark images and which ones are header images. It is difficult to distinguish them in the current HTML.
test.zip (847.0 KB)

@yanke1 You can wrap watermark shapes into bookmarks to make them identifiable after conversion to HTML:

Document doc = new Document("C:\\Temp\\in.docx");

// Wrap watermark shapes into bookmarks to make it possible to identify them after conversion to HTML.
int bk_index = 0;
for (Shape s : (Iterable<Shape>)doc.getChildNodes(NodeType.SHAPE, true))
{
    if (s.getName().contains("PowerPlusWaterMarkObject") || s.getName().contains("WordPictureWatermark"))
    {
        String bkName = "watermark_" + bk_index;
        s.getParentNode().insertBefore(new BookmarkStart(doc, bkName), s);
        s.getParentNode().insertAfter(new BookmarkEnd(doc, bkName), s);
        bk_index++;
    }
}
doc.save("C:\\Temp\\out.html");
<a name="watermark_0">
    <span style="height:0pt; text-align:left; display:block; position:absolute; z-index:-65537">
        <img src="out.001.png" width="704" height="543" alt="" style="margin-top:191.51pt; margin-left:-56.28pt; -aw-left-pos:0pt; -aw-rel-hpos:margin; -aw-rel-vpos:margin; -aw-top-pos:0pt; -aw-wrap-type:none; position:absolute" />
    </span>
</a>

Thank you for your reply. Most files can recognize watermark images, but the following file still cannot be recognized. Is there any other way?
watermark.zip (144 KB)

@alexey.noskov

@yanke1 Watermarks in MS Word document are simple shape in the header behind main content. To distinguish them among other shapes MS Word and Aspose.Words use special shape names that starts either from "PowerPlusWaterMarkObject" or "WordPictureWatermark" as shown in the code above. In your document, however, there are no shapes with such names, so the code does not identify any shapes as watermarks. It looks like the “watermark” shape in your document was inserted using some custom tool or manually by inserting simple shape in the header. I am afraid there is no other way to identify watermarks among other shapes, except special shape name.

Thank you for your reply

1 Like