Positioning of any type of shape not preserved when converting from word to html and back to word

The positoning of any kind of shapes in not preserved in some cases when i convert docx -> html ->word

In the case below it is because the page break in inserted in the worng location than the original document the issue is happening and it get worse its hiding the shape. If this is an issue in your product where 100% fedility cannot be assured then please provide me a work around of what should i be looking in html and then fix it there so when converted to word it will be fine.

image.png (449.8 KB)

have also uploaded intpu and output filesIn.docx (167.3 KB)
out-rountrip.docx (104.5 KB)

@cyrusdaru1 The html format is a flow format that doesn’t initially presume precise pagination. As in MS Word, there is no way to set the exact location in the text for PageBreak character. A roundtrip with an intermediate stage of saving in Html format is always the risk of partial losing of formatting or layout. The only question is the initial markup complexity. Unfortunately, we cannot guarantee that the layout will be saved with this roundtrip method.

Hi Vadim but i need to provide solution to my customer to this problem .Once the document is generated from html , If I remove the shapes from this target document (out docx) and then copy the shapes direclty from the source docx will the layout be preserved as now there is no intemediat html .

I can always add unique id to each of the shapes using Alternative text property in source document to identify them.

Please we need to have some solution to this kind of problem, i have no problem coding the logic

@cyrusdaru1 Please, send the intermediate html file obtained in this situation.

Please find the attached html

main.zip (70.8 KB)

@cyrusdaru1 The main cause of the issue in this case is the incorrect ContentControl roundtrip during export/import to html. Initially, the problematic images you highlighted were inside SdtContent and they were aligned with that content. After the roundtrip this ContentControl disappeared, its content was rendered and the relative horizontal alignment became incorrect (see red highlighting in my screenshot). The same is true for the second image originally provided in the source document as “Company” ContentControl. Besides, one of SdtContent(-s) was exported as a paragraph during the roundtrip, which resulted in an additional paragraph that was not present in the original document (see green highlighting), which in its turn caused displacement of 2030 logo image from the visible area. If you manually delete this paragraph, this logo returns to its place.

SdtContent structural element in MS Word has a very wide application and, at the same time, does not have a well-defined analogue in html. It may turn into one or more div p elements during export. During back conversion it can be converted back both to SdtContent and to multiple paragraphs. If SdtContent also has complex multi-level alignment within itself, then the probability of its correct recovery after Docx -> Html -> Docx conversion goes to zero. You can try to preliminary prepare the document for the roundtrip by extracting the contents of SdtContent into the body of the document, otherwise, in general, I’m afraid this task is unsolvable.

Can you please help me with the following
a. How to identify is the shape is empedded in StdContent using your api
b. How can i remove these StdContent and place it in the body

@cyrusdaru1, please consider the following code:

// Get desired shape.
Shape shape = doc.FirstSection.Body.GetChild(NodeType.Shape, 2, true) as Shape;
// Checking if the shape is inside the ContentControl.
if (shape.GetAncestor(NodeType.StructuredDocumentTag) != null)
    Console.WriteLine("Shape inside ContentControl");
// Get desired structured document tag.
StructuredDocumentTag sdt = doc.FirstSection.Body.GetChild(NodeType.StructuredDocumentTag,
    0, true ) as StructuredDocumentTag;
// Remove content controls leaving their content untouched.
sdt.RemoveSelfOnly();

Hi Vadim… I removed StructuredDocumentTags from the document before converting to HTML and in the round trip i do see an extra paragraph see the screenshot below

image.png (68.7 KB)

Also below is another sample where extra paragraphs are making issues.
This is happening after i have revomed all StructuredDocumentTags from the source document and then converted to html which is then converted back to docx

image.png (671.6 KB)

Also since i need to preserve shapes in the target document after round trip . I copy the shapes from the source and replace it in the target. By the way the above extra space issue remains if i copy the shaped or leave it as images after converting to HTML.

in.docx (1.8 MB)
out-rountrip.docx (1.7 MB)
Have also attached input and the output word file.

I have also noticed at times that the paragraph line spacing is not properly saved while converting to HTML and then back to word. This will also result in the above issue. So all in all i have seen 2 issues

  1. Extra Pragragraphs added while conversion round trip
  2. Line spacing not preserved

image.png (645.4 KB)

@cyrusdaru1
Both problems may be related to the issue mentioned in the following thread WORDSNET-24463. The second problem “Line spacing not preserved” is related to the thread directly. The first one “Extra Pragragraphs added while conversion round trip” is related to it indirectly. Since the problematic shape is text-aligned (Shape.RelativeVerticalPosition == RelativeVerticalPosition.TextFrameDefault), Aspose.Words export tries to align the shape position in a similar way in Html, but due to a roundtrip problem during back conversion the shape position gets changed.
You can try working around this first problem by setting alignment by page. (Shape.RelativeVerticalPosition == RelativeVerticalPosition.Page).