Detecting file format from stream using C#

Hi,

We are currently trying to outline a strategy for detecting the Aspose component that could handle a stream (starting from Detect file types from stream).

The code we use for detecting if Aspose.Words 20.3.0 can handle a stream is:

                    var canHandle = false;
                    
                    try
                    {
                        input.Position = 0L;
                        var fileFormatInfo = Aspose.Words.FileFormatUtil.DetectFileFormat(input);

                        if (fileFormatInfo.LoadFormat != Aspose.Words.LoadFormat.Unknown
                            && fileFormatInfo.LoadFormat != Aspose.Words.LoadFormat.Pdf)
                        {
                            canHandle = true;
                        }
                    }
                    catch (Exception)
                    {
                    }

We already started to see problems with this generic approach:

  1. we needed to add the exclusion for PDF format as Aspose.Words will detect PDF and we want Aspose.PDF to be the one that does it
  2. the real problem is that a lot more file formats (including diagrams and some images (!)) are being detected as having LoadFormat as Aspose.Words.LoadFormat.Text

We just want to make sure that we have a piece of code that handles only formats that Aspose.Words really knows how to handle:

  • Microsoft Word: DOC, DOCX, RTF, DOT, DOTX, DOTM, DOCM FlatOPC, FlatOpcMacroEnabled, FlatOpcTemplate, FlatOpcTemplateMacroEnabled
  • OpenOffice: ODT, OTT
  • WordprocessingML: WordML
  • Web: HTML, MHTML
  • Text: TXT
  • MOBI

In particular, we are interested in making sure that Aspose.Words is the component that will handle the MS Word formats and the TXT one while other text-like file formats are being disregarded.

Here are the files that are detected as Aspose.Words.LoadFormat.Text: Aspose.Words.DetectedAsText.zip (88.2 KB)

Is there a better approach we could pursue here?

Best regards,
Alin

@gwert

Please note that Aspose.Words API only detects the file formats that are loaded in its DOM. Please check the load formats support by Aspose.Words from here:
Supported Document Formats

The FileFormatUtil.DetectFileFormat detects the document format. However, it does not guarantee that the specified document is valid. This method only detects the document format by reading data that is sufficient for detection. To fully verify that a document is valid you need to load the document into a Document object.