Default UTF8 Encoding Writes Byte Order Mark (EF BB BF) BOM at Start of Text File - Java - Convert to TEXT file using Java

Hi guys,

A customer of ours noticed that when he uses our product to generate an XML file using a TXT template, the result XML has some random characters at the beginning.

We were able to reproduce this by simply loading a TXT and save it to XML.

final Document document = new Document(“file.txt”);
document.save(“file.xml”, SaveFormat.TEXT);

These added characters are noticeable when we open the XML file in https://hexed.it/
We are currently using Aspose Words 18.10 for Java but I tested with 20.1 and the problem persists.

file.txt opened in hexed.it - image.png (25.4 KB)
file.xml opened in hexed.it - image.png (20.5 KB)

Do you have any clue why this happens?
Can we do something to fix it or is it a bug you have to fix?

Best regards,
Hugo Freixo

@Hugo_Freixo,

Since you are saving to XML format, can you please try removing the SaveFormat.TEXT parameter from the ‘save’ method and see how it goes?

In case the problem still remains, please ZIP and upload your input Text file and Aspose.Words generated XML file showing the undesired behavior here for testing. We will then investigate the issue on our end and provide you more information.

Hi @awais.hafeez,

When I remove the SaveFormat.TEXT the characters disappeared.
The problem is that in our app we save the document to a ByteArrayOutputStream and there is no save function that only receives a ByteArrayOutputStream. I need to pass a SaveFormat or a SaveOptions in the second parameter.

How can I fix the problem in this situation?

Best regards,
Hugo Freixo

@Hugo_Freixo,

As you are saving to XML, please use SaveFormat.WORD_ML instead of SaveFormat.TEXT. Hope, this helps.

Hi @awais.hafeez,

I just noticed something. In my first example, when I remove the SaveFormat.TEXT the final result is altered.
The same occurs when I save my document to an OutputStream with SaveFormat.WORD_ML the same problem occurs.

These are the files used. I simply opened the txt file and save it as a xml.
Files.zip (2.0 KB)

Best regards,
Hugo Freixo

@Hugo_Freixo,

We have logged your problem in our issue tracking system. Your ticket number is WORDSNET-19914 . We will further look into the details of this problem and will keep you updated on the status of the linked issue. We apologize for your inconvenience.

Hi @awais.hafeez,

What is WORDSNET-19914 referring to?
Will it fix the random characters the appear when we save using SaveFormat.TEXT or is it related to the changes of the XML file when saving with SaveFormat.WORD_ML?

Best regards,
Hugo Freixo

@Hugo_Freixo,

WordML (.xml) is a different file format and such files will always be different when comparing with plain .txt files. However, we will fix the issue that you are seeing while using SaveFormat.TEXT. We will keep you posted here on any further updates and let you know when this issue will will be resolved.

@Hugo_Freixo,

Regarding WORDSNET-19914, the issue appears because you select SaveFormat.Text as the saving format. It will have UTF8 as the default encoding, according to which the byte order mark (EF BB BF) is written at the beginning of the text file. But, you can workaround this problem by using the following code:

Document doc = new Document(dataDir + "file.txt");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
doc.save(baos, SaveFormat.TEXT);
byte[] buf = baos.toByteArray();
RandomAccessFile file = new RandomAccessFile(dataDir + "file_out.xml", "rw");
file.write(buf, 3, baos.size() - 3);

Hope, this helps.

The issues you have found earlier (filed as WORDSJAVA-2308) have been fixed in this Aspose.Words for .NET 20.7 update and this Aspose.Words for Java 20.7 update.