Saving as text format includes a UTF-8 BOM

ben.summers · December 8, 2014, 6:46am

Hello,

If you save with com.aspose.words.SaveFormat.TEXT, then the output file is in UTF-8, but includes a Unicode BOM.

It shouldn't, as UTF-8 only has one byte order.

I've had to write code to remove this, but it would be great if this could be fixed, or made optional, in a future version.

Thanks,

Ben

awais.hafeez · December 9, 2014, 12:01am

Hi Ben,

Thanks for your inquiry. Could you please attach your 1) input Word document, 2) output text file and 3) source code you're using to generate this text file here for testing? We will investigate the issue on our end and provide you more information.

Best regards,

ben.summers · December 9, 2014, 9:16am

Please find document attached. My code is:

Document doc = new Document(this.inputPathname);

TxtSaveOptions options = new TxtSaveOptions();

options.setSaveFormat(com.aspose.words.SaveFormat.TEXT);

options.setEncoding(java.nio.charset.Charset.forName("UTF-8"));

options.setExportHeadersFooters(false);

options.setParagraphBreak("\n\n");

options.setPreserveTableLayout(false);

options.setPrettyFormat(true);

doc.save(output, options);

Thanks for looking into this.

Ben

awais.hafeez · December 10, 2014, 1:21am

Hi Ben,

Thanks for your inquiry. After an initial test with Aspose.Words for Java 14.11.0, I was unable to reproduce this issue on my side (please see attached out-awjava-14.11.0.txt). I would suggest you please upgrade to the latest version of Aspose.Words. You can download it from the following link. I hope, this helps.

http://www.aspose.com/community/files/72/java-components/aspose.words-for-java/default.aspx

Best regards,

ben.summers · December 10, 2014, 3:02am

Your exported file demonstrates the problem!

Here's a hex dump of the first 16 bytes of out-awjava-14.11.0.txt

0000: EF BB BF 58 30 59 20 58 31 59 20 58 32 59 0A 0A ...X0Y X1Y X2Y..

The file starts 0xEF,0xBB,0xBF, which is a UTF-8 encoded Unicode BOM.

http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

UTF-8 files shouldn't include BOMs, as they make no sense, and just confuse consuming software.

Ben

awais.hafeez · December 10, 2014, 11:41pm

Hi Ben,

Thanks for the additional information. I have logged this problem in our issue tracking system as WORDSNET-11155. We will further look into the details of this problem and keep you updated on the status of correction. We apologize for your inconvenience.

Best regards,

AndyNorris · June 4, 2015, 11:43am

I think this answers my problem here Problems with Word Docx ContentType - Free Support Forum - aspose.com

awais.hafeez · June 5, 2015, 6:20am

Hi Andy,

Thanks for your inquiry. It is great you were able to find what you were looking for. Please let us know any time you have any further queries.

Best regards,