Free Support Forum - aspose.com

RTF to HTML Conversion error - Aspose.Words 4.4.00 (extra symbols present)

I am attempting to convert from RTF to HTML format. The resulting HTML is easily viewable and closely matches the input RTF. The problem however is that I am seeing “?” symbols where non exist in the input RTF code. As an example the symbol ??? appears before the first “” tag in resulting HTML document.

I have a line in the resulting HTML which reads “Number:?? 17730”. The source RTF contains two spaces between the colon and the “1” characters. It does not contain any “?” symbols. Where I actually do have a “?” symbol in my RTF produces three symbols. The source RTF contains “lately?” however the resulting HTML contains “lately???”.

Any ideas why I am seeing these symbols? While testing during my demo of this product I noticed the initial “???” characters before the initial “” tag. I had assumed that this was due to the product being a demo. Now that I have the “real” thing the generation is annoying at best.

I have attached a text document that contains a sample RTF and its resulting HTML. You will note that the source RTF contains one embedded image which actually is properly handled by the .dll (after adjusting the URL of the image). This document has been emailed with success however the extra “?” symbols are still present.


Hi

Thanks for your request. I have tried to convert your RTF document to HTML and it seems that all works fine on my side. I have used the following code for testing.

Document doc = new Document(@"235_96864_Lensman\in.rtf");

doc.Save(@"235_96864_Lensman\out.html", SaveFormat.Html);

Also note that Evaluation version of Aspose.Words (without a license specified) provides full product functionality, but it injects an evaluation watermark at the top of the document on open and save and limits the maximum document size to several hundred paragraphs.

Best regards.

Well, that does not explain my problems. I am using the following logic to perform the conversion. An immediate differance between our coding seems to be that you are using the filesystem while I am using streams to hold my source and output data. I have added below the code I am using to perform the conversion. The variable "msIn" is our input stream and msOut is our output stream. The variable "FolderGUID" points to the destination folder where the images should be written to.

License = new Aspose.Words.License();

License.SetLicense("Aspose.Words.lic");

doc = new Aspose.Words.Document(msIn);

doc.SaveOptions.ExportImagesFolder = FolderGUID;

doc.SaveOptions.ExportPrettyFormat = true;

doc.SaveOptions.HtmlExportImagesFolderAlias = "CID:";

msOut = new System.IO.MemoryStream();

doc.Save(msOut, Aspose.Words.SaveFormat.Html);

Hi

Thanks for additional information. Try to use the following code to get html string.

Document doc = new Document(@"235_96864_Lensman\in.rtf");

MemoryStream msOut = new System.IO.MemoryStream();

doc.Save(msOut, Aspose.Words.SaveFormat.Html);

//extract html string from stream

Encoding enc = Encoding.UTF8;

string html = enc.GetString(msOut.GetBuffer());

Best regards.

Well, I hate to say this but you appear to have a unintended feature in your code. When the output is to the filesystem, the resulting HTML is flawless. When using streams it appears that invalid characters are inserted into the stream. I wrote a piece of code (attached to this document) in c# (2.0 framework) which demos the problem. Even when using UTF8 (and streams) a "special" character is still inserted into the resulting HTML. Using ASCII encoding produces the most inaccurate results. Using UTF8 still has problems.

It is my understanding the RTF text is ASCII encoded and not UTF8. HTML technically should be ASCII (or quotable printable). I have seen it often referred to as ISO 8859-1 and UTF8 occasionally.

Due to licensing restrictions, I have not included my license or .dll.

Hi

Thanks for additional information. But when I use UTF8 encoding output HTML looks fine. I see no special symbols in the output HTML. Also note that doc.Save(@“out.html”, SaveFormat.Html) method saves the document in the HTML format uses UTF8 encoding.

Best regards.

Sorry, I guess I missed that notation. While a default is a good idea, so is following the selection of the end user. You are correct that UTF8 is the cleanest output but it is not completely clean. When using streams characters are inserted before the first "" tag. My demo will show them to you should you compile and run it.

The UTF8 is usable.

Hi

Yes I have compiled and launched your application. But I can’t see any special characters before the tag if I use the UTF8 encoding.

Best regards.

You must run it twice to see the differ. The radio buttons allow you to select the current encoding.

On my screen the second richtext (from the top) contains a special character before the tag. When run in ASCII mode (the forms default state), this is displayed as ??? before the tag. I have attached a screen shot of the critter. It looks like a miniature cursor character in front of the tag in the second window.

I re-attached a larger copy of the photo as the previous attempt was not very viewable.

It is very strange. I can’t see this character anyway on my side.

Best regards.