Single Quote is Converted to †| HTML to TXT Conversion using Java

Hello,

We are using aspose-words-19.5-jdk17.jar with following snippet wherein an HTML file is saved into txt file. The issue is that the single quotes ’ in the html get converted into â€.
For instance, **Write ‘as advised’ unless ** gets saved in txt as Write ‘as advised’ unless.

public static void main(String[] args) throws Exception {
String sourcePath = “E:\BOLA Refunds (1).html”;
Document doc = new Document(sourcePath);
String textFilePath = “E:\BOLA Refunds (1).html…txt”;
TxtSaveOptions txtSaveOptions = new TxtSaveOptions();
txtSaveOptions.setEncoding(Charset.forName(“UTF-8”));
doc.save(textFilePath, txtSaveOptions);
}BOLA Refunds (1).html…zip (10.2 KB)

For reference, input html and output txt files have been attached. Can please analyze this issue ?

Looking forward to your response.

Thanks,
Jaspreet

@Jaspreet16

We have tested the scenario and have managed to reproduce the same issue at our side. For the sake of correction, we have logged this problem in our issue tracking system as WORDSNET-18929. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

@Jaspreet16

It is to inform you that the issue which you are facing is actually not a bug in Aspose.Words. So, we have closed this issue (WORDSNET-18929) as ‘Not a Bug’.

The source document is not a valid HTML document and it is loaded as plain text. However, it is loaded using wrong encoding, because FileFormatDetector detects that the source file is encoded in “Windows-1252” instead of “UTF-8 without BOM”.

Please use the following code example to get the desired output.

LoadOptions loadOptions = new LoadOptions();
loadOptions.setEncoding(Charset.forName("UTF-8"));
Document doc = new Document(MyDir + "BOLA Refunds (1).html", loadOptions);

TxtSaveOptions txtSaveOptions = new TxtSaveOptions();
txtSaveOptions.setEncoding(StandardCharsets.UTF_8);
doc.save(MyDir + "21.1.txt", txtSaveOptions);

The issues you have found earlier (filed as WORDSNET-18929) have been fixed in this Aspose.Words for .NET 21.5 update and this Aspose.Words for Java 21.5 update.