Encoding issue while reading excel/csv file

santhosh666 · March 3, 2015, 11:51pm

Hello we are using aspose cells for java version 8.1.2

For special characters as in the following string: linéaire the encoding is not correctly done and we

get lin�aire

I am reading worksheet from Document which will be uploaded from UI.

This issue we face only in Linux.

my code looks like below.

void generateWorkbookWithStream(List workbooks, Document document, String delimiter) throws FileNotFoundException,

IOException {

Workbook workbook = new Workbook();

Worksheet sheet0 = workbook.getWorksheets().get(0);

try {

InputStream is = new ByteArrayInputStream(document.getDocumentBinary());

BufferedReader br = new BufferedReader(new InputStreamReader(is));

String line = "";

int noOfRecords = 0;

while ((line = br.readLine()) != null) {

String[] data = line.toString().split(delimiter);

sheet0.getCells().importArray(data, noOfRecords, 0, false);

data = null;

noOfRecords++;

}

} catch (IOException e) {

LOG.warn(e);

}

workbooks.add(sheet0);

}

Note : delimiter will be containing semicolon.

Please advice.

Thanks in advance.

shakeel.faiz · March 4, 2015, 4:07am

Hi,

Thanks for your posting and using Aspose.Cells.

Such issues occur because proper fonts are not installed on your Linux machine. We request you to installed the Microsoft’s core fonts using the ttf-mscorefonts-installer. You can find details on how you can install this package in your Linux environment from this article. Once you have installed these fonts, please set the font directory using the CellsHelper.setFontDir method (at the start of your application & before creating an object of Workbook), and point to the location that contains the TTF files for Arial font.

Please also read this post for more understanding about Fonts and CellsHelper.setFontDir method.

( Some special symbols are not displaying correctly in converted PDF from excel using ASPOSE )

santhosh666 · September 18, 2015, 2:47am

Hi,

We have installed the fonts in our Linux machine and tried as you suggested but still we are facing the same issue(French charectors are saving as boxes). This is only in case of CSV file import.

We are using Aspose 8.5.2 version. This is long pending issue.

Could you please suggest us if there is any work around for this?

-Regards,

Santhosh.

babar.raza · September 18, 2015, 3:31am

Hi Santhosh,

Could you please share your sample CSV or at least a portion of it to replicate and investigate the matter on our side? Please also provide the source code covering your complete scenario. It would be more appropriate if you can share a standalone sample console application for quick investigation.

santhosh666 · September 18, 2015, 3:54am

hi,

I have attached sample file(TEST.csv).

Our code goes like below

CellsHelper.setFontDir("/usr/share/fonts/");

Workbook wb = getWorkbookFromDocuments(testDocument);

Worksheet workSheet = wb.getWorksheets().get("sheetName");

Cells cells = worksheet.getCells();

cells.get(1, 22).getStringValue();

This value we are saving to database.

NOTE : when I print this line I am getting the value "Amortissement avec ???"

special chars are replaced with "?"

-Regards,

Santhosh.

babar.raza · September 18, 2015, 4:08am

Hi Santhosh,

Thank you for sharing the sample. We are currently performing test on our side and we will shortly get back to you with updates in this regard. In the meanwhile, could you please confirm that if you print the aforesaid value to terminal/console, you still see the question mark (?) instead of the original character or the problem only occurs when you insert the value in database?

santhosh666 · September 18, 2015, 4:16am

Hi,

When print the aforesaid value to console I see the question mark (?) instead of the original character.

EX: The text "Amortissement avec éêë" is printed as "Amortissement avec ???"

-Regards,

Santhosh.

babar.raza · September 18, 2015, 9:34am

Hi Santhosh,

Thank you for the information.

The characters éêë correspond to the ISO-8859-1 encoding therefore you can avoid the said problem by setting the encoding before loading the CSV in an instance of Workbook as demonstrated below.

Please note, I have verified this scenario on Linux (CentOS 6.6) against latest version of Aspose.Cells for Java (Latest Version) with expected results, that is; Amortissement avec éêë.

Java

TxtLoadOptions options = new TxtLoadOptions(LoadFormat.CSV);
options.setSeparator(‘;’);
options.setEncoding(Encoding.getEncoding(“ISO-8859-1”));

Workbook workbook = new Workbook(input, options);
Worksheet worksheet = workbook.getWorksheets().get(0);
System.out.println(worksheet.getCells().get(“W1”).getStringValue());

santhosh666 · September 21, 2015, 12:45am

Hi,

Thanks a lot for the advice and this works well for me.

I need few more clarifications

Is there any way to set this TxtLoadOptions after creating Workbook ?

Because in our code we are using List of worksheets and our code goes like below.

generateWorksheetWithStream(List worksheets, Document document, String delimiter) throws FileNotFoundException,

IOException {

Workbook workbook = new Workbook();

Worksheet worksheet = workbook.getWorksheets().get(0);

Cells cells = null;

try {

InputStream is = new ByteArrayInputStream(document.getDocumentBinary());

BufferedReader br = new BufferedReader(new InputStreamReader(is));

String line = "";

int noOfRecords = 0;

while ((line = br.readLine()) != null) {

String[] data = line.toString().split(delimiter);

cells = worksheet.getCells();

cells.setMemorySetting(MemorySetting.MEMORY_PREFERENCE);

cells.importArray(data, noOfRecords, 0, false);

data = null;

noOfRecords++;

}

} catch (IOException e) {

LOG.warn(e);

}

worksheets.add(worksheet);

}

I have replaced this code with the below code..

generateWorksheetWithStream(List worksheets, Document document) {

TxtLoadOptions options = new TxtLoadOptions(LoadFormat.CSV);

options.setSeparator(';');

options.setEncoding(Encoding.getEncoding("ISO-8859-1"));

Worksheet worksheet = null;

try {

InputStream is = new ByteArrayInputStream(document.getDocumentBinary());

Workbook workbook = new Workbook(is, options);

worksheet = workbook.getWorksheets().get(0);

} catch (Exception e) {

LOG.warn(e);

}

worksheets.add(worksheet);

}

Will it work for CSV file with multiple subsheets?

Please advice.

-Regards,

Santhosh.

babar.raza · September 21, 2015, 2:05am

Hi Santhosh,

Thank you for the confirmation on previously reported problem. Regarding the recent concerns, I am afraid the encoding has to be set while loading the spreadsheet (while using TxtLoadOptions) or saving it (while using TxtSaveOptions) so you have to tackle your application requirements accordingly.

By the way, are you talking about a CSV having multiple sub-sheets? If so, could you please share a sample with us?

santhosh666 · September 21, 2015, 2:25am

Hi Babar Raza,

Currently we dont have multiple subsheets in CSV, but we do have in Excel.

We had issue only with CSV file, so we will take care of setting the encoding options while loading the spreadsheet.

Thanks again for the clarifications.

-Regards,

Santhosh.

babar.raza · September 21, 2015, 2:44am

Hi Santhosh,

Thank you for the clarification,

The scenario presented in this thread was related to the encoding of the CSV, you have to handle it while loading it using the approach suggested before. In case of spreadsheets, you may get different characters due to the locale/region/language, you may set these options after loading the spreadsheet while using the WorkbookSettings class.

Please feel free to get in touch if you need our further assistance with Aspose APIs.

santhosh666 · September 23, 2015, 2:17am

Hi Babar Raza,

This encoding issue works fine for the smaller CSV files, we have handled it while loading the file as you mentioned.

We have another scenario where the file size is 150MB in that case we are splitting the file into multiple chunks(20000 lines) and adding each chunk to a worksheet. we will be having List of worksheets then we are processing the worksheets.

If we load full file to a workbook we will get out of memory exception.

We are getting the worksheet like below.

workbook = new Workbook();

worksheet = workbook.getWorksheets().get(0);

I also tried

InputStream x = new ByteArrayInputStream(new String("").getBytes());
TxtLoadOptions options = new TxtLoadOptions(LoadFormat.CSV);

options.setSeparator(';');

options.setEncoding(Encoding.getEncoding("ISO-8859-1"));
workbook = new Workbook(x , options);
workbook.getWorksheets().clear();

workbook.getWorksheets().add("sheet1");

worksheet = workbook.getWorksheets().get(0);

But still facing same issue of saving squares instead of special charectors with large files(4 lakh rows).

Could you please advice if there is anyway to resolve this?

-Regards,

Santhosh.

babar.raza · September 23, 2015, 5:06am

Hi Santhosh,

Thank you for writing back.

Please note, the presented problem is not related to the size of the CSV file but the characters in the CSV, I suspect that your CSV has characters that correspond to two different encoding standards therefore while setting the encoding fixes the problem for one charterer but not for the others. If this is the case then I am afraid, there is no solution because you can set only one encoding for the TxtLoadOptions.

I would request you to please share the subset of the CSV containing the problematic characters, we will review it to find an alternative approach.

babar.raza · September 23, 2015, 5:10am

santhosh666:

If we load full file to a workbook we will get out of memory exception.

You may try setting the Memory Preferences offered by Aspose.Cells APIs to reduce the memory cost for the loading operation to some extent. Moreover, the available memory for Java application also depends on the Java heap size so you can increase it gradually to see what value fits your requirements.

Java

TxtLoadOptions options = new TxtLoadOptions();
options.setMemorySetting(MemorySetting.MEMORY_PREFERENCE);

santhosh666 · September 25, 2015, 1:37am

Hi Babar Raja,

Thank you for the reply,

The problem now I am facing is,

In case of big CSV file, I am reading the file from the physical location(We are not uploading).

Below is our code to create workbook

TxtLoadOptions options = new TxtLoadOptions(LoadFormat.CSV);

options.setSeparator(',');

options.setEncoding(Encoding.getEncoding("ISO-8859-1"));

options.setMemorySetting(MemorySetting.MEMORY_PREFERENCE);

workbook = new Workbook(fileLocation, options);

worksheet = workbook.getWorksheets().get(0);

I this case we are still facing the special charectors issue only in Linux server.

Note : delimitor we are using is ',' in this case.

Could you please check and write back to us?

Regards,

Santhosh.

babar.raza · September 25, 2015, 6:39am

Hi Santhosh,

Please note, setting the Memory Preferences for the Workbook does not have any thing to do with the encoding issue. I have recommended you to use these settings to avoid the OutOfMemoryError while loading huge files to the Workbook object.

In reference to the encoding issue, please check my response here. As discussed earlier, if your CSV file has characters that correspond to 2 or more encoding standards then I am afraid, this cannot be resolved because the TxtLoadOptions can set only one encoding.

santhosh666 · September 25, 2015, 6:51am

Hi Babar Raja,

Thank you for the reply,

We dont have two or more encoding standards and please understand my question.

We are reading a file from a location and I am setting filePath and passing the TxtLoadOptions.

I have posted the code in previous comment.

In this case same charectors are not encoding.

-Regards,

Santhosh.

babar.raza · September 25, 2015, 9:01am

Hi Santhosh,

Please share the sample CSV for further review. You do not have to share the complete file, only the subset that holds the row with problematic characters would suffice.

santhosh666 · September 28, 2015, 12:23am

hi Babar Raza,

Thank you for the reply,

Please find the attached sample file TEST.zip, it contains two csv files(TEST1,TEST2).
In the code we are extracting the zip file and reading the files from location.

The problematic charectors are in TEST1.csv.

they are : "indexé révisable" and "Amortissement progressif avéêëc"

Sample code we tried

-----

if (document.isPhysicalLoc()) {
if (CollectionUtils.isNotEmpty(filePaths)) {
for (String fileLocation : filePaths) {

TxtLoadOptions options = new TxtLoadOptions(LoadFormat.CSV);

options.setSeparator(',');

options.setEncoding(Encoding.getEncoding("ISO-8859-1"));

options.setMemorySetting(MemorySetting.MEMORY_PREFERENCE);

workbook = new Workbook(fileLocation, options);

worksheet = workbook.getWorksheets().get(0);
}
}

}

-Regards,

Santhosh.