How to verify two docs are same or similar?

Hi,Suppor:

Is there any method to find and verify whether two docs are same or similar?
Thanks for your help!

@ducaisoft,

Please refer to your other thread:

Here are a few more details as to how you can use the Document.Compare Method:

Document doc1 = new Document();
DocumentBuilder builder = new DocumentBuilder(doc1);
builder.Writeln("This is the original document.");

// The target document doc2.
Document doc2 = new Document();
builder = new DocumentBuilder(doc2);
builder.Writeln("This is the edited document.");

// If either document has a revision, an exception will be thrown.
if (doc1.Revisions.Count == 0 && doc2.Revisions.Count == 0)
    doc1.Compare(doc2, "authorName", DateTime.Now);

// If doc1 and doc2 are different, doc1 now has some revisions after the comparison, which can now be viewed and processed.
Assert.AreEqual(2, doc1.Revisions.Count);

foreach (Revision r in doc1.Revisions)
{
    Console.WriteLine($"Revision type: {r.RevisionType}, on a node of type \"{r.ParentNode.NodeType}\"");
    Console.WriteLine($"\tChanged text: \"{r.ParentNode.GetText()}\"");
}

// All the revisions in doc1 are differences between doc1 and doc2, so accepting them on doc1 transforms doc1 into doc2.
doc1.Revisions.AcceptAll();

// doc1, when saved, now resembles doc2.
doc1.Save("C:\\Temp\\Document.Compare.docx");
doc1 = new Document("C:\\Temp\\Document.Compare.docx");
Assert.AreEqual(0, doc1.Revisions.Count);
Assert.AreEqual(doc2.GetText().Trim(), doc1.GetText().Trim());

Thanks for your suggestion.

I tried it, and found there is not Assert.AreEqual Method in Aspose.Words.dll v20.9. Therefore, it may be useless for Doc.Compare method in the API.
Another bug is that compOp.IgnoreHeadersAndFooters = True doesn’t work.
Another more bug is that the API can not get the page count correctly, for example, a doc has 9 pages, whereas the api only get the page count as 2! so that I have to convert the doc as pdf, and then get the page count via pdf doc correctly.

And there is another way for my issue?

@ducaisoft,

Assert.AreEqual method is part of nunit framework and you may get/install NUnit package from NuGet.

Please ZIP and upload your input Word documents and Aspose.Words generated DOCX file(s) showing the undesired behavior here for testing. We will then investigate the issues (related to compOp.IgnoreHeadersAndFooters and Page Count) on our end and provide you more information.

Thanks for your message.
But I still do not understand your method:


Dim sDoc as new Document("Doc1.doc")
Dim cDoc as new Document("Doc2.doc")
Dim Author as string=environ("user")
sDoc.compare(cDoc,Author ,DateTime.now)
Dim Rn as integer=sdoc.Revisions.Count
if Rn=0 then
  Console.WriteLine "The two docs are same"
elseif Rn>100 then
   Console.WriteLine "The two docs are different"
‘here how to justify they are different if Rn>0?
else
Console.WriteLine "The two docs are likely similar"
‘here how to justify they are similarif Rn>0?
endif

Another demo is :

Dim sDoc as new Document()
Dim builder = new DocumentBuilder(sDoc )
buider.write("This the source doc")
Dim cDoc as new Document()
builder = new DocumentBuilder(cDoc )
buider.write("This the doc for compare")
Dim Author as string=environ("user")
sDoc.compare(cDoc,Author ,DateTime.now)
Dim Rn as integer=sdoc.Revisions.Count
’Here the result is =0

whereas by using MS Word to compare the two docs, the MS Word report they are different.
What’s wrong between the API and MS Word? why they generate different result?

@ducaisoft,

For more details, please refer to:

Please also check these two sets of documents and try running the following code:

C# Code:

Document docOriginal = new Document("C:\\Temp\\Original document.docx");
Document docRevised = new Document("C:\\Temp\\Revised document.docx");

docOriginal.Compare(docRevised, "author", DateTime.Now);
int revCount = docRevised.Revisions.Count;

if (docOriginal.Revisions.Count == 0)
    Console.WriteLine("Documents are equal");
else
    Console.WriteLine("Documents are not equal");

docOriginal.Save("C:\\Temp\\output with differences.docx");

If you have any sample Word documents where Aspose.Words’ comparison engine produces different results than to what MS Word produces, then please ZIP and upload those specific Word documents you are getting this problems with here for testing. We will then investigate the issue on our end and provide you more information.

The API DocCompare method only take effect for original doc. If the doc has been edit, the revision count always is 0, whereas the MS Word can report the revision count correctly.

@ducaisoft,

Please ZIP and attach the following resources here for testing:

  • Your simplified source Word document(s)
  • Aspose.Words for .NET 21.1 generated output DOCX file showing the undesired behavior (where Aspose.Words 0 Revisions etc)
  • Your expected DOCX file showing the desired output. You can create this document manually by using MS Word.
  • Please also create a standalone simple Console application (source code without compilation errors) that helps us to reproduce your current problem on our end and attach it here for testing. Please do not include Aspose.Words DLL files in it to reduce the file size.

As soon as you get these pieces of information ready, we will start further investigation into your scenario and provide you more information.

Please refer to this stuffs for your investigation.
GetWordsPages.zip (491.9 KB)
PS: If fail to unzip the file, please change the file extension “zip” to “rar” and then try again.

@ducaisoft,

Document.PageCount Property returns the correct number of Pages for all the ten documents you shared. However, Document.BuiltInDocumentProperties Property returns incorrect number of Pages for A2.docx, B1.doc, B2.doc, C2.doc, D1.doc and E2.doc documents. But, calling Document.UpdatePageLayout Method before invoking Document.BuiltInDocumentProperties Property fixes this issue. Please try running the following code:

string[] fileNames = Directory.GetFiles("C:\\Temp\\GetWordsPages\\", "*.doc?", SearchOption.TopDirectoryOnly);
foreach (string fileName in fileNames)
{
    Document doc = new Document(fileName);

    doc.UpdatePageLayout();

    Console.WriteLine(doc.OriginalFileName);
    Console.WriteLine("doc.BuiltInDocumentProperties.Pages = " + doc.BuiltInDocumentProperties.Pages);
    Console.WriteLine("doc.PageCount = " + doc.PageCount);
    Console.WriteLine("doc.Revisions.Count = " + doc.Revisions.Count);
    Console.WriteLine("-----------------------------");
}

Moreover, both MS Word 2019 and above code confirm that there are no Revisions in any of these Word documents.

Open Word document with MS Word. Then go to Review tab > Reviewing Pane > Reviewing Pane Vertical or Horizontal.

Please let me know if I can be of any further assistance.