We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Conversion from RTF to Text is adding special characters

Hi! Team,
We are looking to convert the RTF to Text as below and when the encoding is done with ASCII we find a special characters ‘?’ Getting appended and if the encoding is done with UTF7 then the special character ‘’ is getting appended. However, when we try encoding to UTF8 then the special character doesn’t appear, but our compression logic detects this special character and appends ‘?’ in the later part of our code. Can you let us know how we could resolve this issue? We need are looking to encode with only UTF8.

public static string ConvertRTFtoText(string documentContent)
{
    Aspose.Words.Document doc;
    String strRTFText = string.Empty;
    String test1 = string.Empty;
    String test2 = string.Empty;
    documentContent = "History";

    using (Stream s = GenerateStreamFromString(documentContent))
    {
        doc = new Aspose.Words.Document(s);
    }

    Aspose.Words.Saving.TxtSaveOptions saveOptions = new
    Aspose.Words.Saving.TxtSaveOptions();
    saveOptions.SaveFormat = Aspose.Words.SaveFormat.Text;

    //Replace Image with text
    foreach (Shape shape in doc.GetChildNodes(NodeType.Shape, true))
        shape.ParentParagraph.InsertBefore(
            new Run(doc, "[Image removed]"),
            shape);

    //Replace custom placeholder tags/texts
    foreach (Run run in doc.GetChildNodes(NodeType.Run, true))
        if (run.Font.Hidden)
            run.Remove();

    // Save the document to stream in HTML format.
    using (MemoryStream rtfStream = new MemoryStream())
    {
        doc.Save(rtfStream, saveOptions);

        // Read the HTML from the stream as plain text.
        strRTFText = Encoding.ASCII.GetString(rtfStream.ToArray()); //--this adds '?' to the text
        test1 = Encoding.UTF7.GetString(rtfStream.ToArray()); //--this adds '' to the text
        test2 = Encoding.UTF8.GetString(rtfStream.ToArray()); //--this adds some
        // special invisible characters which our compression logic identifies and add a '?'
    }
    return strRTFText;
}

@balajisan21 Could you please attach your input RTF here for testing? Does the problem occurs with all RTF documents on your side or with some specific one? We will check the issue and provide you more information.

The issue is not only with RTF even when we try to pass a string value also, we are encountering the issue.
Test.zip (952 Bytes)

@balajisan21 What you are talking about is UTF8 Byte Order Mark (BOM) it is added at the beginning of the text file to allow the consumer application to identify the encoding of the file. You can simply remove the first 3 bytes from the bytes array.

Hi! Alexey, We observed those three characters getting added inbetween the text as well. Is there a way we could completly ignore/remove them?

@balajisan21 Could you please provide a sample document that will allow to see the problem with these characters in-between the text? I will check it and provide you more information. BOM can occur only at the beginning of the file.

Unwantedcharacters.zip (2.3 KB)
this adds special character at the begining and also after the text “Testing Unstructured Note” within the RTF

@balajisan21 This character is inside your document. Most likely it was inserted accidentally upon improper merging documents. Aspose.Words reads everything what is in the document. Could you please let us know how this document was generated?