Detect file format incorrect detections

Hey,
I’ve attached 2 files

  1. help.com, which is Application and incorrectly detected as CSV by Aspose Cells!
  2. XML.ttml, which is XML and incorrectly detected as HTML by Aspose Cells!
    Thanks for consideration.
    AsposeCells.zip (1.9 KB)

@australian.dev.nerds

With our latest version 26.1, we found help.com is detected as Unknown which is the expected result because Aspose.Cells is mainly designed for manipulating spreadsheet-related file formats. There are too many file formats for it to detected accurately.

For the XML.ttml we found the issue that it is detected as Html. We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): CELLSNET-59754

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hi, thanks, about .com file, please pass it as stream, I’ve seen this bug before in cells, that passing streams to cells causes surprise while passing files worked:

Private Function Detector(ByVal InputStream As Stream) As Cells.FileFormatType
    Detector = Cells.FileFormatType.Unknown
    If InputStream Is Nothing OrElse InputStream.Length = 0 Then Exit Function
    If InputStream IsNot Nothing Then InputStream.Position = 0
    Detector = Cells.FileFormatUtil.DetectFileFormat(InputStream).FileFormatType
End Function

usage:

Dim CellType As Cells.FileFormatType = CellsDetector(InputStream)
If CellType <> Cells.FileFormatType.Unknown Then
Select Case CellType
Case Cells.FileFormatType.Csv
msgbox(“csv”)

@australian.dev.nerds
Thank you for more details. Now we can find the issue that the com file was detected as CSV when using stream.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): CELLSNET-59755

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@australian.dev.nerds,

This is to inform you that both issues (“CELLSNET-59754” and “CELLSNET-59755”) have been resolved and the fixes will be included in Aspose.Cells v26.2, scheduled for release in the first half of February 2026. We will notify you once the new version is available.

1 Like

Hello and thanks,
Please also consider these Pem certificates, which are all detected as CSV.
CSV has the most false positives!
Thanks.
asposecells.zip (20.2 KB)

@australian.dev.nerds
For unknown files, we will process their original text and then check if there is a ‘,’ in the text content to determine if it is a CSV file. I am considering adding an option to determine whether to guess if a text is CSV or not

1 Like

Hello and thanks, are you sure that is all you do internally? Just a single comma inside file? Even if file has non-printable chars?! Even if there’s only one comma? Even if the separator is not comma? It might be Tab, Pipe, Comma, Semicolon etc…

Please be kind and confirm if that is the csv detection logic, and if txt detection is also as unreliable as that? :smiley:

By the way, can you please also fix this wrong detection?

This xml file is detected as Html.

Xml validation is easier that html in .net:

XmlReader.Create(inStream, New XmlReaderSettings With {.DtdProcessing = DtdProcessing.Ignore, .IgnoreWhitespace = True, .CloseInput = x})

Cells.zip (372 Bytes)

@australian.dev.nerds,

Thanks for the XML file.

Sure, so we will consider testing your attached file with the (internal) fix as we already resolved the similar issue (“CELLSNET-59754”).

Thanks for sharing your concerns.

We shared an outline for the detection of CSV file formats. Generally, we conduct a series of internal evaluations and other validations to ensure the accuracy and reliability of our detection methods. We are refining our processes and will provide you with more detailed information soon.

@australian.dev.nerds
Cells.FileFormatUtil.DetectFileFormat() is mainly used to detect Excel file formats (XLSX, XLS, XLSB, XLSM, ODS). We cannot support detecting all file formats.

Your HTML file is an XML file. I checked whether the name of the first node is an html tag to distinguish HTML or XML. Both ”tt" in ttml and “Article” in cells.xml are the html tags. So the files are detected as HTML. We will try to only handle mainstream HTML tags when detecting an html file. We cannot check and compare all nodes for performance. You can continue detecting if the file format is HTML or XML.

@australian.dev.nerds

  1. In the next version, if the file contains non-printable chars, the file will not be detected as CSV or TXT.
  2. Cells.FileFormatUtil.DetectFileFormat() is mainly used to detect Excel file formats (xlsx, xls, xlsb, xlsm, ods). We cannot support detecting all file formats.

There is no good solution to detect CSV, we just compared the first two lines to check whether the first lines contains same count of “,”. If you want to clearly detect CSV, you have to detect it with some other tools.

Hello and thanks for the reply, few logical paradoxes here:

No one can! But what about the file types in your enum? ie, if you cannot detect html, why it’s there? :slight_smile:

This needs a new RFC! But until then, how do you claim an HTML file is an XML file? :smiley:
Indeed, it’s TTML, not HTML, and at the end, it is XML when starts with <?xml …

Once you encounter <?xml , your default is XML, not you will read the first node to check if it is one of your open spec supported types, if is, you return your detected type, else XML should be returned, please ask some devs to evaluate this algorithm, or use Gemini to judge! :slight_smile:

@australian.dev.nerds
Thanks for your info.
We will consider your advice.

The issues you have found earlier (filed as CELLSNET-59754,CELLSNET-59755) have been fixed in this update. This message was posted using Bugs notification tool by leoluo

That’s a really solid point about the <?xml declaration. It seems like prioritizing the header check before diving into node-name matching would clear up a lot of these false positives for HTML. Also, the check for non-printable characters is a great addition for the next version - it should definitely help filter out those random system files being flagged as CSV. Looking forward to seeing how the detection evolves in v26.2!

@SynthiaCasper
We will continue to optimize file type detection.