Aspose Compare - removing classnames and changing valid xHTML (Aspose.Words Version 21.4.0)

On comparing xHTML using Aspose.Words, the compared output shows that Aspose.Words is changing the xHTML; and removing classnames… (Besides adding hundreds of unecesary style attributes on all html elements (which I have generically removed from these elements as they add nothing to the styling of the elements and are completely useless)) …

Examples of these are as follows:

1 - <em> and/or <strong> tags are changed to be <span style="font-weight:bold">some strong</span> and/or <span style="font-style:italic">and italic as em</span>

while this does not necessarily change the way it is presented on the page in a browser, it does have a large impact on screen readers as it would then be presenting a completely different experience that is not reader-friendly.

QUESTION: Is there a flag/setting/configuration that tells Aspose to NOT change html elements?

2 - IDs and current style attributes are removed.

QUESTION: Is there a flag/setting/configuration that tells Aspose to NOT change html elements?

3 - LI items are split out of a UL if they are different (see image)

Other questions in the images…

OLD VERSION:

    	<div>
    		<div class="p" id="thisWillBeRemoved">
    			[heading 3]
    			<ul id="UL_TT1_DDR_PKB">
    				<li id="SL18311645-100489">[First list item]</li>
    				<li id="SL18311647-100489">[Second list item]</li>
    				<li id="SL18311649-100489">[Third list item] <fn>example: <em>emphasis</em></fn></li>
    				<li id="SL18311647-100489">[Fourth list item]</li>
    				<li id="SL18311650-100489">[5th list item] <a href="url" data-scope="internal">Link</a> bottom of the page</li>
    				<li id="SL18311650-100489">[Last list item]</li>
    			</ul>
    		</div>
    		<div class="p">link to symbol <a href="url" data-scope="internal">@#$%^&amp;*()</a>            </div>
    		<div class="p">[link] <fn><a href="javascript:;" target="_blank" data-scope="external">Target Space</a>, <em id="GUID-67535493-B6CC-4952-95F3-4FB9807480C9">emphasis</em>.</fn></div>
    		<div class="p">Only <a href="url" data-scope="internal">Link to SPACE</a></div>
    		<div class="p">Lorem ipsum dolor sit amet, <strong>some strong</strong> aecenas aliquam justo et neque eleifend, id vulputate ligula dictum. Maecenas eget lacinia est.</div>
    		<div class="p">Fusce iaculis pharetra ex, <em>and italic as em</em> et vestibulum metus fringilla et. Sed condimentum risus vitae dapibus congue.</div>
    		<div class="p">This is ONLY in the OLD version. Duis molestie velit eu ligula venenatis, ac tincidunt massa semper. Ut ultrices risus orci, facilisis sollicitudin tellus pretium et. Donec a velit eleifend,</div>
    		<div class="p">This is changed in each version, Vestibulum at congue. Quisque non massa id nibh ornare vel eget quam.</div>
    		<ul>
    			<li>Same. Fusce malesuada ligula eu nisl finibus, ut semper metus rhoncus.</li>
    			<li>ONLY in OLD. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</li>
    			<li>Changed in both Vestibulum ex suscipit ante convallis.</li>
    			<li>Same. Fusce malesuada ligula eu nisl finibus, ut semper metus rhoncus.</li>
    		</ul>
    		<div id="imgblock">
    			<div class="figure-block-body">
    				<figcaption>Image Caption</figcaption>
    				<img src="2dcee8b-af1b-4cdd-9430-0195c52297e2_1_en-us.jpg" id="image1" alt="Alternate text for image" />
    			</div>
    		</div>
    		<div class="p">In closing, we are done.</div>
    	</div>

NEW VERSION:

       	<div>
    		<div class="p" id="thisWillBeRemoved">
    			[heading 3]
    			<ul id="UL_TT1_DDR_PKB">
    				<li id="SL18311645-100489">[First list item]</li>
    				<li id="SL18311647-100489">[Second list item]</li>
    				<li id="SL18311649-100489">[Third list item] <fn>example: <em>emphasis</em></fn></li>
    				<li id="SL18311647-100489">[Fourth list item]</li>
    				<li id="SL18311650-100489">[Last list item]</li>
    			</ul>
    		</div>
    		<div class="p">[link] <fn><a href="javascript:;" target="_blank" data-scope="external">Target Space</a>, <em id="GUID-67535493-B6CC-4952-95F3-4FB9807480C9">emphasis</em>.</fn></div>
    		<div class="p">Lorem ipsum dolor sit amet, <strong>some strong</strong> aecenas aliquam justo et neque eleifend, id vulputate ligula dictum. Maecenas eget lacinia est.</div>
    		<div class="p">Fusce iaculis pharetra ex, <em>and italic as em</em> et vestibulum metus fringilla et. Sed condimentum risus vitae dapibus congue.</div>
    		<div class="p">This is ONLY in the NEW version. Suspendisse viverra, elit nec porttitor porta, arcu sem suscipit turpis, non viverra turpis neque ac nisi.</div>
    		<div class="p">This is changed in each version, Vestibulum at ligula. Quisque massa id nibh ornare pellentesque vel eget quam.</div>
    		<ul>
    			<li>Same. Fusce malesuada ligula eu nisl finibus, ut semper metus rhoncus.</li>
    			<li>ONLY in NEW. Donec finibus arcu ac feugiat iaculis.</li>
    			<li>Changed in both Vestibulum eleifend ex ante condimentum.</li>
    			<li>Same. Fusce malesuada ligula eu nisl finibus, ut semper metus rhoncus.</li>
    		</ul>
    		<div id="imgblock">
    			<div class="figure-block-body">
    				<figcaption>Image Caption</figcaption>
    				<img src="1122a87f-5953-48a3-93f8-e9d1cfc85e0f_1_en-us.jpg" id="image2" alt="Alternate text for image" />
    			</div>
    		</div>
    		<div class="p">In closing, we are done.</div>
    	</div>

OUTPUT:

    	<div>
    		<p>
    			<span>[heading 3] </span>
    		</p>
    		<ul>
    			<li>
    				<span>[First list item]</span>
    			</li>
    			<li>
    				<span>[Second list item]</span>
    			</li>
    			<li>
    				<span>[Third list item] example: </span><span>emphasis</span>
    			</li>
    			<li>
    				<span>[Fourth list item]</span>
    			</li>
    		</ul>
    		<p>
    			<del><span style="font-family:Symbol"></span></del><span>&amp;#xa0;&amp;#xa0; </span><span>[</span><del><span>5th list item] </span></del>
    			<del><span style="text-decoration:underline">Link</span></del><del><span style="-aw-import:spaces">&amp;#xa0;</span><span>bottom of the page</span></del>
    		</p>
    		<ul>
    			<li>
    				<del><span>[</span></del><span>Last list item]</span>
    			</li>
    		</ul>
    		<p>
    			<del><span>link to symbol </span></del><del><span style="text-decoration:underline">@#$%^&amp;*()</span></del>
    		</p>
    		<p>
    			<span>[link] </span><a href="javascript:;" target="_blank" style="text-decoration:none"><span style="text-decoration:underline">Target Space</span></a><span>, </span>    <span style="font-style:italic">emphasis</span><span>.</span>
    		</p>
    		<p>
    			<del><span>Only </span></del><del><span style="text-decoration:underline">Link to SPACE</span></del>
    		</p>
    		<p>
    			<span>Lorem ipsum dolor sit amet, </span><span style="font-weight:bold">some strong</span><span> aecenas aliquam justo et neque eleifend, id vulputate ligula dictum. Maecenas eget lacinia est.</span>
    		</p>
    		<p>
    			<span>Fusce iaculis pharetra ex, </span><span style="font-style:italic">and italic as em</span><span> et vestibulum metus fringilla et. Sed condimentum risus vitae dapibus congue.</span>
    		</p>
    		<p>
    			<span>This is ONLY in the </span><del><span>OLD</span></del><ins><span>NEW</span></ins><span> version. </span><del><span>Duis molesti</span></del><ins><span>Suspendiss</span></ins><span>e v</span><del><span>elit eu ligula venenatis, ac tincidunt massa semper. Ut ultrices ri</span></del><ins><span>iverra, elit nec porttitor porta, arcu sem </span></ins><span>sus</span><del><span style="-aw-import:spaces">&amp;#xa0;</span><span>orci, facilisis sollicitudin tellus pretium et. Donec a velit eleifend,</span></del><ins><span>cipit turpis, non viverra turpis neque ac nisi.</span></ins>
    		</p>
    		<p>
    			<span>This is changed in each version, Vestibulum at </span><del><span>congue</span></del><ins><span>ligula</span></ins><span>. Quisque </span><del><span>non </span></del><span>massa id nibh ornare </span><ins><span>pellentesque </span></ins><span>vel eget quam.</span>
    		</p>
    		<ul>
    			<li>
    				<span>Same. Fusce malesuada ligula eu nisl finibus, ut semper metus rhoncus.</span>
    			</li>
    			<li>
    				<span>ONLY in </span><del><span>OLD. Lorem ipsum dolor sit amet, consectetur adipiscing elit</span></del><ins><span>NEW. Donec finibus arcu ac feugiat iaculis</span></ins><span>.</span>
    			</li>
    			<li>
    				<span>Changed in both Vestibulum e</span><del><span>x suscipit </span></del><ins><span>leifend ex </span></ins><span>ante con</span><del><span>vallis</span></del><ins><span>dimentum</span></ins><span>.</span>
    			</li>
        			<li>
    				<span>Same. Fusce malesuada ligula eu nisl finibus, ut semper metus rhoncus.</span>
    			</li>
    		</ul>
    		<p>
    			<span>Image Caption</span>
    		</p>
    		<p>
    			<ins><img src="/images/cache/diff/Aspose.Words.34251782-4ac9-421b-a55a-60d03c25e90d.001.jpeg" width="624" height="110" alt="Alternate text for image" style="-aw-left-pos:0pt; -aw-rel-hpos:column; -aw-rel-vpos:paragraph; -aw-top-pos:0pt; -aw-wrap-type:inline" /></ins><del><img src="/images/cache/diff/Aspose.Words.34251782-4ac9-421b-a55a-60d03c25e90d.002.jpeg" width="800" height="92" alt="Alternate text for image" style="-aw-left-pos:0pt; -aw-rel-hpos:column; -aw-rel-vpos:paragraph; -aw-top-pos:0pt; -aw-wrap-type:inline" /></del>
    		</p>
    		<p>
    			<span>In closing, we are done.</span>
    		</p>
    	</div>

CODE: (Split out in functions, but together here to show options)

Document asposeDocument;
using (var stream = new MemoryStream())
{
    htmlDocument.Save(stream);
    stream.Position = 0;
    asposeDocument = new Document(stream);
    asposeDocument.AutomaticallyUpdateStyles = false;
}
return asposeDocument;
CompareOptions compareOptions = new CompareOptions();
compareOptions.Granularity = Granularity.CharLevel;

docOld.Compare(docNew, "Compare", DateTime.Now, compareOptions);

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.OptionOutputAsXml = true;
HtmlSaveOptions options = new HtmlSaveOptions();
options.HtmlVersion = HtmlVersion.Xhtml;
options.CssStyleSheetType = CssStyleSheetType.Inline;
options.ExportHeadersFootersMode = ExportHeadersFootersMode.None;
options.ExportImagesAsBase64 = false;
options.ExportOriginalUrlForLinkedImages = true;
options.ExportPageMargins = false;
options.ExportXhtmlTransitional = true;
options.ImagesFolderAlias = AsposeImagesPath;
options.ImagesFolder = $"{BaseFilePath}{AsposeImagesPath}";
options.PrettyFormat = true; //Can disable this later!
options.SaveFormat = SaveFormat.Html;
options.ScaleImageToShapeSize = false;

docOld.Save(streamCompare, options);

COMPARISON IMAGES WITH ISSUES/QUESTIONS:
aspose-issues.jpg (914.1 KB)

aspose-img.jpg (605.8 KB)

@brpennington While loading HTML Aspose.Words converts HTML into Aspose.Words DOM (Document Object Model), which is designed to work with MS Word document formats. HTML format differs from MS Word formats and all its features cannot be retained while open/save HTML document through Aspose.Words or MS Word.

No, there is not way to retain HTML elements unchanged while processing HTML document using Aspose.Words, as I mentioned earlier Aspose.Words is designed first of all to work with MS Word documents.
If you try comparing HTML documents using MS Word, it also does not retain the original HTML elements, because of the same reasons.

@alexey.noskov … Thank you for your reply and confirmation of my initial thoughts. I assume then that the actual comparison is then performed by the ms word “engine” and not an aspose controlled product?

@brpennington No, Aspose.Words does not use MS Word. Aspose.Words uses our own documents comparison engine.

@alexey.noskov excellent, thank-you for confirming this. Regarding the LI comparison where it converts the LI to something completely different (P tag with the special symbol characte), thereby separating the list into two parts… is that something that the compare engine could improve on?

The work around for this (for us) is then to try make that special symbol indented the same in the output so that it does not look like two separate lists, but thats not really a great one. I get that the bullet is what we want shown as “different” and “del” on the LI would be “impossible” as it would make invalid HTML

<ul>
    ...
    <del><li>...</li></del>
    ...
</ul>

as that is not valid HTML… so I get why its done this way - and think it is pretty impressive that it splits the UL and maintains the validity of the xHTML document!!

@brpennington Thank you for additional information. As I mentioned already, Aspose.Words does not compare HTML documents directly, it compares Document Object Models of two documents. The difference is marked as revisions. If you save the result of comparison as MS Word document, DOCX for example, and open it, you will be able to accept or reject these revisions.
Since HTML does not provide all features available in MS Word documents, Aspose.Words export revisions so the result HTML document look as close as possible to what MS Word shows.