How Aspose selects encoding to read text (CSV) file?


#1

Could you describe how Aspose.Cells C# version detects which encoding to use to read content of CSV file.
For example following code:

            var options = new TxtLoadOptions()
            {
                Separator = ','
            };
           var workbook = new Workbook("myfile.csv", options);

is able to correctly read both Unicode (UTF-8 and UTF-16) textual file and Windows-1251 (ASCII) textual file on Russian OS, but can’t read Shift-JIS file on Japanese OS.

Debugger shows that Aspose.Cells uses System.Text.DBCSCodePageEncoding class as encoder:
DBCSEncoding.png (18.9 KB)
So theoretically everything should work ok.

Aspose.Cells C# version: 19.6
CSV file in Shift-JIS encoding: Book1.zip (148 Bytes)


#2

I’ve tried to use 19.8 version and setup TxtLoadOptions.IsMultiEncoded = true, but it doesn’t help


#3

@23W,
We are analysing your requirement and need more sample files for our testing. Please share all the files mentioned here. We will observe the scenario and provide assistance accordingly.


#4

Source files:
Encodings.zip (1.2 KB)
I’ve found that code:

            var options = new TxtLoadOptions()
            {
                Separator = ','
            };
           var workbook = new Workbook("Book1.csv", options);

Works only with Unicode encoded files and it doesn’t meter whether it contains BOM or not (Book1 UTF-8 without BOM.csv processed correctly). This code is not able to read text file with any ANSI code page even if it’s current system ANSI code page.

Following code:

            var options = new TxtLoadOptions()
            {
                Encoding = Encoding.Default,
                Separator = ','
            };
           var workbook = new Workbook("Book1.csv", options);

Excellent reads text file with system current ANSI code page. So it correctly reads Book1 ShiftJIS.csv on Japanese OS. It also correctly reads Unicode encoded text files buy ONLY if it contains BOM (Book1 UTF-8 without BOM.csv is not processed correctly).

It would be nice if there is configuration of TxtLoadOptions() that allows to read any Unicode text file (with or without BOM) and text file with system current ANSI code page.


#5

@23W,
We were able to undestand the requirement of criteria information for selection of encoding. We have logged the issue in our database for investigation. Once, we will have some news for you, we will update you in this topic.

This issue has been logged as

CELLSNET-46888 – Criteria of encoding selection while reading CSV file


#6

@23W,

We evaluated your issue further. Well, CSV file is just a plain text file and you may use any way and any encoding to create it. It is impossible for us to give a solution to handle all kinds of template files. You may specify the encoding for your files by TxtLoadOptions.Encoding. Otherwise the used encoding completely depends on the System, just like you creating a StreamReader from a Stream without specifying the encoding.

Thanks for your understanding!


#7

Thank you for answer


#8

@23W,

You are welcome.