I was using the format conversion from MS Word to text and discovered an issue with it. MS Word may use some special characters for formatting purposes such as:
0x1F - Unit Separator / Optional Hyphen
0x0C - Form Feed / Page Break
0x0E - Shift Out / Column Break
0x1E - Record Separator / Non-Breaking Hyphen
0xA0 - Non-breaking Space
and maybe some more.
These characters have no meaning, as far as I know, in plain text. However, they do cause trouble when the text is placed inside an XML file because they’re not all legal inside XMLs.
I understand that the format of the file was converted. However, I do think that this characters should also be treated or at least there should be a parameter that can allow the method caller to request a treatment of these characters.
Following code example shows how to replace page break, column break and other characters with specific text. Hope this helps you.
Document doc = new Document(MyDir + "in.docx");
NodeCollection runs = doc.GetChildNodes(NodeType.Run, true);
foreach (Run run in runs)
{
if (run.Text.Contains(ControlChar.PageBreak) ||
run.Text.Contains(ControlChar.ColumnBreak))
{
run.Text = run.Text.Replace(ControlChar.PageBreak, String.Empty);
run.Text = run.Text.Replace(ControlChar.ColumnBreak, String.Empty);
}
if (run.Text.Contains(ControlChar.NonBreakingHyphenChar))
{
run.Text = run.Text.Replace("" + ControlChar.NonBreakingHyphenChar, "-");
}
if (run.Text.Contains(ControlChar.NonBreakingSpace))
{
run.Text = run.Text.Replace("" + ControlChar.NonBreakingSpace, " ");
}
}
doc.Save(MyDir + "Out.txt");
If you still face problem, please share following detail for investigation purposes. I will
investigate how you are expecting your final output document.
Please attach your input Word document.
Please attach the output txt file that shows the undesired behavior.
Please attach your target txt document showing the desired behavior.
Though it’s good to be able to fix this issue using Aspose.Words, I currently do this on the output text. My problem with this approach is that it requires scanning the text a couple of time, each time for a different control character.
Moreover, I think this should be done in the conversion process, using a property in TxtSaveOptions, similar to the ParagraphBreak property. I don’t see any reason why someone would want an “optional hyphen” in their text document.
Thanks for your inquiry. There are many control characters. It is not good approach to add properties in TxtSaveOptions class for each control character. However, You can achieve your requirements using Aspose.Words APIs. You only need to iterate through Run nodes only once.
Please let us know if you face any issue while using Aspose.Words.
The way it works now, is that conversion from MS Word to Text keeps MS Word control characters in the text, which is not a complete conversion. There’s no real need for a property for every control character since it’s pretty reasonable that a non-breaking space should be converted to a regular space, an optional hyphen should be removed etc…
It’s true that we can use the suggested workaround but I think that this problem should be solved inside the product and not by the customer using the API.
Thanks for your inquiry. Please note that the shared solution here is not workaround of an issue. This is the usage of Aspose.Words APIs. If you face any issue while using Aspose.Words, please share following detail for investigation purposes. I will
investigate how you are expecting your final output document.
Please attach your input Word document.
Please attach the output txt file that shows the undesired behavior.
Please attach your target txt document showing the desired behavior.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
Enables storage, such as cookies, related to analytics.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.