How to know a word doc is corrupted

It looks like while trying to open a corrupted word doc using Aspose.Words, it automatically repaired the corruption. But is there anyway to know, upon loading the document, that the doc is corrupted?

Thanks!

A DOC format is very complex and poorly documented. Aspose.Words tries to load a document as much as it can. It corrects many minor "corruptions". But if it comes to a point that it cannot load the document further, it throws. There is no single specific exception that we throw when Aspose.Words cannot load a document. The following exceptions can be thrown:

  • PleaseReportException when Aspose.Words detected something believed to be worthwhile reporting to Aspose.Words developers since having such a document in our test suite will cover more undocumented and unknown features of the DOC format.
  • File and stream related exceptions. There are many offsets inside data fields inside a DOC file that point to locations in the stream. If some of these offsets or lengths are invalid, there is a possibility that Aspose.Words will end up trying to read beyond end of file and .NET will throw.
  • Index out of range etc exceptions. Similar to the above. A DOC file contains various indexes between structures. Aspose.Words reads this into various arrays and lists and accessing them when an index is invalid might throw some sort of index out of range or argument exception.

Thanks!

I also wonder if there is anyway I could know that my original document is in fact corrupted. Though Aspose.Words can repair some corruptions, but I still need to know that the document WAS corrupted. Is there a way to know programmatically?

No, there is no way to know that.

If the document is corrupted it will not load. Only minor deviations that do not result in lost of data or formatting will be automatically corrected.

The problem I am trying to tackle at this moment is to open word docs using Aspose.Words. Is it safe to assume that if Aspose.Words cannot open a doc (throws an exception), then I cannot use Microsoft's Word automation to open it either ?

Thanks.

Another related question is: if I were to use Aspose.Words to open an earlier version than Word 97, what would happen? Exception thrown? Or is there anyway to know what version of Word the document is using?

I'm not sure what the prolem is. If you explain more why you need to worry about invalid documents so much, we might be able to come up with a better solution.

DOC format is very complex. Aspose.Words reads the DOC file itself, it does not use Microsoft Word. Code in Aspose.Words and in Microsoft Word apparently is very different. Whatever errors Aspose.Words can detect and fix or throw has no direct relationship to what errors Microsoft Word can detect. Different versions of Microsoft Word will error out or correct documents differently. Sometimes you can open a document in Aspose.Words that Microsoft Word cannot open. Sometimes you can open a document in Microsoft Word whereas Aspose.Words will throw.

If the file is pre Word97 format, Aspose.Words will throw

throw new NotSupportedException("Sorry, this document is in pre-Word97 format and it is not currently supported.");

There is no way to programmatically detect Microsoft Word version from the document in Aspose.Words API. Don't forget that DOC files can be created by some other programs. In general, Word 97 - Word 2007 binary format is backward and forward compatible. With more features added in newer formats of course.