Compare Word documents is wrongly comparing Normal text to Heading text leading to confusing comparison results

We are using following API to compare two Word documents using Aspose.Words for Java Version 15.7 (aspose-words-15.7.0-jdk16.jar).

==> Document.compare(Document document, java.lang.String author, java.util.Date dateTime)

What we noticed is that - text with normal font is being compared with text in Heading1, Heading2 etc. While we know it is tough to implement this, but if we don’t implement this, then it is not giving better comparison results. It leads to confusing comparison results.

HOW TO FIX THIS: At least we can keep certain standard fonts/styles such as Heading1, Heading2 styles/fonts separate from other fonts/styles when comparing the text… Otherwise we compare normal text in paragraphs to the text is Heading1, Heading2 etc. leading to confusing results. It is like comparing texts in different unrelated parts of the document.

To explain this, take the below four word documents attached to this issue:
* “SourceDoc.docx” – source doc
* “TargetDoc.docx” – source doc modified to this target doc
* “Compare_by_Aspose.docx” – result of comparing source and target documents by Aspose APIs
* “Compare_by_MS_Word_CompareDocuments.docx” – result of comparing source and target documents by MS Word application by going to MS Word -> Review tab -> Select Compare -> Select two files -> Select to compare the two files.


If you see the attached “Compare_by_Aspose.docx”, you will notice that the character sequence “ac” in the word “oracle” in the Source document is being compared to the character sequence “ac” in the Heading1 “Background” in the Target document… which means a normal text in Source doc is being compared to Heading1 text in Target doc…

However MS Word compare doesn’t do compare like this… Refer to the attached “Compare_by_MS_Word_CompareDocuments.docx”.

Can you please fix this as explained above…

Thanks,
-Satya

I think it will be useful if you can check MS Word Compare utility to understand use cases about how to generate the word diff. What i notices is that - MS Word diff tool does clean job for finding the differences. These can be provided as either regular feature or options to the Aspose document compare API. That will be nice.

Go to MS Word -> Review tab -> Select Compare -> Select two files -> Select to compare the two files.

Hi Satya,

Thanks for your inquiry.
I
have tested the scenario and have managed to reproduce the same issue
at my side. For the sake of correction, I have logged this problem in
our issue tracking system as WORDSNET-12339. I have linked this forum
thread to the same issue and you will be notified via this forum thread
once this issue is resolved.

We apologize for your inconvenience.

Hi Satya,

Thanks for your patience. It is to inform you that our product team has completed the analysis of this issue (WORDSNET-12339)
and has come to a conclusion that this issue and the
undesired behavior you’re observing is actually not a bug in
Aspose.Words.

Please see three attached documents named “Normal to Heading (A)”,
(B), ®. Named ® is result of (A) and (B) comparison. As you can see
MS Word compares text regardless it has Normal style in (A) document and
Heading 1 in (B) document. You can see Word just applies formatting change to
paragraph.

It would be great if you please share use cases of your scenario along with some more detail here for our reference. We will then provide you more information on this.

Hi Tahir,
Thanks for looking into this.

The testcase you tried is different than what i explained. You tried a simple case of changing the text font from Heading1 to Normal font. That means, you just changed the “font style” of the text from Heading1 to Normal font.

But i am not talking about that case… In my test case document, i have a text in Heading1 font and also i have a paragraph text with Normal font. I and changed both Heading1 text content and text content in the paragraph. PLEASE NOTE that i didn’t change the “font style” of the text, but i just changed the actual text content in those areas… That is, i just changed text content in heading and text in paragraph… but NOT changed the font style of them… In this case, what i am saying our requirement is that, if the Aspose API comparison result somewhat made equivalent to that of MS Word comparison result, it looks neater.

Basically MS Word tool seems to compare heading text to heading text and normal text to normal text… It sort of does apples to apples comparison. Then comparison result looks neater. That is our requirement.

We are asking not to treat all text as same… For example, the bullet headings like 1., 1.1., 1.1.1. (Heading1, Heading2 fonts) and normal text (Normal fonts) to be treated in different buckets when comparing them… Otherwise if you treat all text as same, then comparison result looks ugly.

Take the attached documents to explain this with examples:
* SourceDoc1.docx is compared to TargetDoc1.docx.
Here Compare_by_MS_Word_CompareDocuments1.docx generated by MS Word looks neat.
Whereas the Compare_by_Aspose1.docx generated by Aspose Compare API has this issue.

* To explain in simple terms, in this example, SourceDoc1.docx contains “HeadingText1, Line1, Line2.”… whereas TargetDoc1.docx contains “HeadingText2, Line3, Line2, Line1.”.

* In this case Aspose API compares HeadingText1 text content to Line3 text content in new document… where MS Word doesn’t… MS Word just compares Heading1 text content to Heading2 text content… If heading text contents moves up or down between the paragraph text content, then MS Word compare tool simply results in delete and add of Heading text content , but not thinks of it got changed with paragraph Normal text content.

Please refer to the attached documents with above usecases.

Please try with different simple, complex and mixed cases of heading text and normal paragraph text for such as (a) changing only in text content (b) changing only in font style and © changing both in text content and font styles (d) Heading text content moved up and down across the paragraph normal text content etc… and then cross compare Aspose API behavior with MS Word behavior… to understand our requirements in better way.

This is very important feature for us to present the document differences in a very neat way to our customers - otherwise they won’t be interested in our product. So please consider this fix as high priority. Ultimately MS Word difference tool will be benchmark for the requirement.

I hope you understand our concern, and importance and urgency for this feature.

Let me know if any more clarifications needed.

Thanks,
-Satya






Hi Tahir,
Looks at the above attached documents, which i provided by simplifying the case… Let me know if any clarifications.
-Satya

Hi Satya,

Thanks for sharing the detail. I have logged this detail in our issue tracking system. Our product team will check this use case and we will update you via this forum thread once there is any update available on this scenario.

Please let us know if you have any more queries.

Thanks Tahir.

Hi Satya,

Thanks for your patience. Firstly, I like to share with you that document comparison is very hard issue with huge count of possible use cases.

Secondly, as far as we can see Word doesn’t compare text by style. See attached screenshot of your shared test case. It can be clearly seen that Word doesn’t take text style into consideration when make comparison. It inserted “Overview” text, when inserted “The Oracle Corporation is …” text and when deletes “Background” text. If your idea about Word comparison algorithm was right result will be different. “Overview” text will be inserted and “Beackground” text will be deleted right after.

So, the undesired behavior you’re observing is actually not a bug in
Aspose.Words. We are closing WORDSNET-12339 as ‘Not a bug’.

Please let us know if you have any more queries.

Hi Tahir,
Thanks for verifying the issue.

For this usecase, you showed me the MS Word compare output.. Fine... Then, what is Aspose API output?

Do you agree - there is difference of behavior between Aspose and MS Word in their compare output for this usecase? If yes, then why that difference comes in Aspose API? because there is a difference of functionality between MS Word and Aspose.

You said >>"Firstly, I like to share with you that document comparison is very hard issue with huge count of possible use cases."<< --- Absolutely Agree with you.

You said >>"If your idea about Word comparison algorithm was right result will be different. "Overview" text will be inserted and "Background" text will be deleted right after." -- Yes it may be correct... May be i didn't explain it better above.. But problem do exists in Aspose side...

===> The PROBLEM with Aspose is that - it tries match a *portion* of text in normal text to a *portion* of text in heading1.. MS Word doesn't do that.. Do you agree? Please kindly see the Aspose API result for this use case.

Whatever algorithm that Aspose or MS Word uses, for this usecase, Word output is looking better. Whereas Aspose is not good. Do you agree?

Our intension is to show better comparison results to our customers... If Aspose API compare output is made equal to MS Word compare output (which is a benchmark as of now due to its nice work).. then the Aspose product is more usable and customers will like to use - is my opinion..

YES.. It is not a BUG in Aspose... Can you please address this as Enhancement?

Can you please raise an enhancement request on your side..

I hope you understood..

-Satya

Hi Satya,

Thanks
for sharing the detail. I am in communication with our product team about this enhancement request. We will update you via this forum thread asap.

Thanks much. Keep us updated on this.

The issues you have found earlier (filed as WORDSNET-12339) have been fixed in this Aspose.Words for .NET 16.10.0 update and this Aspose.Words for Java 16.10.0 update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.