Converting HTML into CSV in Java

@nvn16,
We will fix it in one week.

is it issue from aspose side ?

There is no issue with html right ? Aspose not able to convert this html to csv ?

can you please provide detailed update .

@nvn16,

We found the source file is not a standard html file. The main cause of this issue is that many start tags are lost, and only the end tags exist in the source file. e.g. see screenshot “tag_screenshot.png” (attached).
tag_screenshot.png (48.9 KB)

Anyways, we will be evaluating the issue and hopefully it will be fixed soon.
We will keep you posted on latest updates (if available) in this regard.

@nvn16

Your issue is fixed in v21.8.5.
aspose-cells-21.8.5-java.zip (7.4 MB)

hi Team ,

Thanks for update . I am able to convert provided html into csv . However in CSV file , chinese characters are not rendered properly .do we need to use any encoding in code ? If yes , could you please share sample code .

Thanks,
Akshay

@nvn16

There is no need to set encoding while reading html.
Code:

HtmlLoadOptions options = new HtmlLoadOptions();
options.setAutoFitColsAndRows(true);
options.setCheckDataValid(false);
//options.setEncoding(Encoding.getEncoding("gb2312"));

//options.setDeleteRedundantSpaces(true);
Workbook workbook = new Workbook("sample.html", options);
TxtSaveOptions opts = new TxtSaveOptions(SaveFormat.CSV);
//opts.setEncoding(Encoding.getUTF8());
workbook.save("output.csv",opts);

For the encoding of output csv file, by default, the encoding is same as the encoding specified in source html(for your html file, it is gb2312 encoding) . Also, you can specify the output csv encoding by TxtSaveOptions.setEncoding. e.g. TxtSaveOptions.setEncoding(Encoding.getUTF8());

Also, I shared output csv files generated on my end.
output.zip (104.7 KB)

The issues you have found earlier (filed as CELLSJAVA-43724) have been fixed in this update. This message was posted using Bugs notification tool by johnson.shi

https://repository.aspose.com/repo/com/aspose/aspose-cells/

this fix will be part of which jar from above link ? we are downloading jars using maven .

@nvn16,

You may download/get the latest JARs (Aspose.Cells for Java 21.9) here.

Let us know if you still find any issue.

Hi Team ,

final License license = new License();
license.setLicense(licensePath);
HtmlLoadOptions options = new HtmlLoadOptions();
options.setAutoFitColsAndRows(true);
options.setCheckDataValid(false);

    //options.setDeleteRedundantSpaces(true);
    Workbook workbook = new Workbook(inputFileName, options);
    TxtSaveOptions opts = new TxtSaveOptions(SaveFormat.CSV);
    opts.setEncoding(Encoding.getUTF8());

    Files.deleteIfExists(getPath(outputFilePath));
     workbook.save(outputFilePath,opts);

This is code i am writing for conversion from HTML to csv .
But when I am taking file with chinese characters as Input - output is not correct . can you advice on Encoding part for output ? What encoding should I keep which can handle files with or without chinese characters ?

current output looks like -
����,��Ʒ���,�ֵ�,�ֵ�����,��Ʒ����,����,���,��λ,��������,���۽��,����˰�ɱ�,����,�����,ԭӡ������
2021/7/13,013639 ,001 ,��Ʒ���ֿ� ,������������ԭζ��Ƭ, ,104g ,Ͱ ,48,327.84,888,6924743915763

@nvn16

utf-8 encoding should work OK for files with or without chinese characters.
Please share us the output csv file generated on your side.

sure . please file attached output file testfilesabcd.zip (52.1 KB)

@nvn16
It seems that you convert another html(different with the one your shared) to csv, and the text is not readed correctly. You can check it with the following code after the workbook is init.

System.out.println(workbook.getWorksheets().get(0).getCells().get("A1").getValue());

Please share us the source html file.

sample.7z (102.7 KB)
this is source html that i am using .

code i have shared in above comment . please advice .

@nvn16

Your newly shared source html file is encoded by gb2312, however, there is no meta data to indicate the encoding. In this case, utf-8 will be used to read the source html file by default.
You can also set the encoding for reading source html file.

code:

HtmlLoadOptions options = new HtmlLoadOptions();
options.setAutoFitColsAndRows(true);
options.setCheckDataValid(false);
//set loading encoding
options.setEncoding(Encoding.getEncoding("gb2312"));
//options.setDeleteRedundantSpaces(true);

Workbook workbook = new Workbook(inputFileName, options);
TxtSaveOptions opts = new TxtSaveOptions(SaveFormat.CSV);
opts.setEncoding(Encoding.getUTF8());

Files.deleteIfExists(getPath(outputFilePath));
workbook.save(outputFilePath,opts);