We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Converting word to text avoiding symbols

Hi,


I’m trying to convert a word document to text, and getting strange symbols which would not get if I were converting using Word automation.

Here is the relevant part of my code:

TxtSaveOptions txtSaveOptions = new TxtSaveOptions();
txtSaveOptions.Encoding = Encoding.Default;
doc.Save(firstStream, txtSaveOptions);

I’ve attached an example files with the symbols I get in Aspose VS word, you can see the symbol in a hex editor, or any compare tool. I use Beyond Compare.

Here are some more strange symbols Aspose would give me (in red):
Surname: WHITE.br\First Name: SNOW.br.br\D.O.B: 13/12/1970.br\Sex: Female.br\ .br\.br\.br\.br\.br\Medical Record – Radiology Department .br\Radiology Report.br\.br\.br\ .br\Referring Consultant: Darth Vader.br\Referring Service: Community.br\Referrer’s Address: .br\

I have to somehow avoid getting these symbols. and since I don’t know how many and which of them I can get, I don’t want to solve this issue replacing them.

I would very appreciate your help.

Thanks, Dim.

Hello

Thanks for your inquiry. Could you please attach your input document here for testing? I will check the problem on my side and provide you more information.

Best regards,

Hi,


Thanks for your quick response.

I’ve sent the file by mail.

Thanks again,
Dim.


Hello

Thank you for additional information. Please try using the following code to avoid showing LineBreak symbol in the output TXT:

Document doc = new Document("C:\\Temp\\totxt1.doc");

doc.Range.Replace(ControlChar.LineBreak, "\n", false, false);

doc.Save("C:\\Temp\\out.txt");

Best regards,

Hi,


Thanks! I’ve checked and it solved the problem with the findings line.
I actually had to use : doc.Range.Replace(ControlChar.LineBreak, “”, false, false);
and I got an Identical output to which word automation gives me.

I guess the other problem I have with the BOM and other special characters would be a different issue.
Do you have any idea how to avoid this symbols? I still get them after the fix.

Thanks,
Dim.


Hi

Thanks for your request. Maybe in your case, you should simply create your own to TXT converter as described here:

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net/howto-extract-content-using-documentvisitor.html

This approach will allow you to control how the document is converted to TXT.

Best regards,