We are seeing issues with loading RTF generated by Aspose.Words.Net into a .NET RichTextBox. The RichTextBox is part of a WPF application, and the intent is to render HTML content, as well as the ability to subsequently modify the content via the RichTextBox.
The document flow we are using is as follow:
HTML -> RTF -> RichTextBox
We use the following roundabout way to convert HTML to RTF, then load the RTF into the FlowDocument associated with the RichTextBox:
// Save source HTML document to disk.
using (Stream fs = File.Open(path2Html, FileMode.Open))
{
LoadOptions options = new LoadOptions();
options.LoadFormat = LoadFormat.Html;
Aspose.Words.Document doc = new Aspose.Words.Document(fs, options);
// Save as RTF document.
doc.Save(path2Rtf, Aspose.Words.SaveFormat.Rtf);
}
// Load the RTF stream into FlowDocument.
using (FileStream stream = new FileStream(path2Rtf, FileMode.Open))
{
TextRange textRange = new TextRange(this.richTextBox.Document.ContentStart, this.richTextBox.Document.ContentEnd);
textRange.Load(stream, DataFormats.Rtf);
}
We are seeing varying degree of success and fidelity. But the main concern is with the following error where the RTF failed to be loaded into the FlowDocument.
Unrecognized structure in data format ‘Rich Text Format’.
Presumably some of the RTF data format used is not supported by the RichTextBox/FlowDocument? Is there a way to determine what the unrecognized structure in question?
Are we taking the right approach? Is there an alternative to importing HTML into a RichTextBox that we are unaware of using the Aspose.Words.Net component?
I have attached a sample HTML source (sourced from a HTML email) that we are experiencing difficulty as described above.
Any information, suggestion or assistance would be very much appreciated.
Hi
Thanks for your request. I found the problem in your HTML. There are hypelinks like the following:
The highlighted tag causes the problem. Such hyperlink is converted to HYPERLINK field in RTF document, but start of field and end of field are in different paragraphs. Such construction is valid, but RichTextBox does not like it for some reason.
So as a workaround of the problem you can remove
tags or whole hyperlink that has
as displayed text (anyway such hyperlink is invisible). Also, you can move FieldEnd node to the previous paragraph, if it is the first child of the paragraph. Here is sample code:
string path2Html = @"C:\Temp\in.html";
string path2Rtf = @"C:\Temp\out.rtf";
// Save source HTML document to disk.
using(Stream fs = File.Open(path2Html, FileMode.Open))
{
LoadOptions options = new LoadOptions();
options.LoadFormat = LoadFormat.Html;
Document doc = new Document(fs, options);
// WORDAROUND: Move FieldEnd node to the previouse paragraph if it is the first child of paragraph.
// Get all Fieldend nodes.
Node[] fieldEnds = doc.GetChildNodes(NodeType.FieldEnd, true).ToArray();
foreach(FieldEnd fieldEnd in fieldEnds)
{
// Get parent paragraph.
Paragraph parent = fieldEnd.ParentParagraph;
if (parent.FirstChild.Equals(fieldEnd))
{
Node currentNode = parent.PreviousPreOrder(doc);
while (currentNode.NodeType != NodeType.Paragraph)
currentNode = currentNode.PreviousPreOrder(doc);
Paragraph previouseParagraph = (Paragraph) currentNode;
previouseParagraph.AppendChild(fieldEnd);
}
}
// Save as RTF document.
doc.Save(path2Rtf, SaveFormat.Rtf);
}
using(FileStream stream = new FileStream(path2Rtf, FileMode.Open, FileAccess.Read))
{
TextRange textRange = new TextRange(this.richTextBox.Document.ContentStart, this.richTextBox.Document.ContentEnd);
textRange.Load(stream, DataFormats.Rtf);
}
Hope this helps.
Best regards,
Hello Alexey,
Thank you for the quick response and analysis. Interesting observation, the workaround seems to work well, the rendered RTF can now be loaded into the RichTextBox without any error, and the rendering is more accurate without the extra line break.
I have attached another HTML file and corresponding RTF generated by Aspose.Words, this proved to be problematic as well. The HTML is from a very simple Outlook email, and has the usual Microsoft Office markup and style. We are seeing the same unrecognized structure error when loading the RTF into a RichTextBox, but I don’t believe it is related to the line break observed aboved.
Would you be able to have a quick look at this as well? Thanks.
Regards,
Jonathan
Hi Jonathan,
Thank you for additional information. Unfortunately, I cannot reproduce the problem with the attached HTML document. I can load the output RTF into RichTextBox without any issues.
Best regards,
Hi Alexey,
I have attached a very simple test program which I am able to reproduce the error. Would you be able to take a look at see what is happening?
Thanks.
Regards,
Jonathan
Hi
Thank you for additional information. I managed to reproduce the problem on my side. This time the problem is on our side. Your request has been linked to the appropriate issue. You will be notified as soon as it is resolved.
However, there is a simple workaround. Please see the following code:
///
/// Convert HTML document to RTF
///
/// path to HTML source document
/// path to RTF output document
///
private bool ConvertHtml2Rtf(string path2Html, string path2Rtf)
{
bool success = false;
try
{
if (File.Exists(path2Html))
{
// WORKAROUND: Problem occurs becuase style contains HTML encoded quotes (").
// Read all text from HTML document.
string html = File.ReadAllText(path2Html);
// Replace HTML encoded quotes with quotes.
html = html.Replace(""", "'");
using(MemoryStream fs = new MemoryStream(Encoding.UTF8.GetBytes(html)))
{
// Load source HTML document.
LoadOptions options = new LoadOptions();
options.LoadFormat = LoadFormat.Html;
Document doc = new Document(fs, options);
// WORKAROUND:
// Move all FieldEnd node to the previous paragraph if it is the first child of paragraph.
Node[] fieldEnds = doc.GetChildNodes(NodeType.FieldEnd, true).ToArray();
foreach(Aspose.Words.Fields.FieldEnd fieldEnd in fieldEnds)
{
// Get parent paragraph.
Aspose.Words.Paragraph parent = fieldEnd.ParentParagraph;
if (parent.FirstChild.Equals(fieldEnd))
{
Node currentNode = parent.PreviousPreOrder(doc);
while (currentNode.NodeType != NodeType.Paragraph)
currentNode = currentNode.PreviousPreOrder(doc);
Aspose.Words.Paragraph previouseParagraph = (Aspose.Words.Paragraph) currentNode;
previouseParagraph.AppendChild(fieldEnd);
}
}
// Save as RTF document.
doc.Save(path2Rtf, SaveFormat.Rtf);
// Done.
success = true;
}
}
}
catch (Exception ex)
{
Trace.WriteLine(ex.Message);
}
return success;
}
Best regards,
Hi Alexey,
Thank you for looking into that, and confirming the issue with the HTML encoded quote ("). We are now able to load the HTML into RichTextBox (via RTF).
To summarise our evaluation of the Aspose.Words component, here is where things stand:
- We need to work around issues related to line breaks ( ) that are present in link/href elements, i.e. the first workaround as documented above.
- The line breaks workaround above appears to be introducing additional line breaks, i.e. double line breaks outside of the context of a link/href element, but we cannot confirm whether this is the case, is there a way to narrow down the handling if line breaks to link/href elements only to determine whether this is the cause of the double line breaks? This is important as far as the aesthetic of the conversion outcome is concerned, so those paragraphs are not spaced too far apart.
- The workaround for HTML encoded quotes ("), the second work around documented above, solved the issues with HTML encoded quoted within a style element, i.e. replacing " with single quote (’). However, when " is used outside of the context of a style element, it should be interpreted/translated as a double quote ("), not a single quote, especially in the HTML body.
Will the above be addressed in a future release of Aspose.Words? i.e. without the need for the workarounds as documented?
Now coming back to the original exercise of converting from HTML -> RTF -> RichTextBox, is there any plan in adding support for conversion between HTML and FlowDocument? Or to put it another way, will FlowDocument be added as a target conversion type in future releases of Aspose.Words?
Any information you can provide with regards to the above would be very much appreciated, and will help us with the outcome of our evaluation. Thanks.
Regards,
Jonathan
Hi Jonathan,
Thanks for your inquiry.
- Workaround needed because RichTextBox does not like when start of field is in one paragraph and end in another. This is not Aspose.Words bug, but bug of RichTextBox. RTF generated by Aspose.Words is valid and can be successfully opened by MS Word, OpenOffice or WordPad.
- You can remove empty paragraph after moving FieldEnd to previous paragraph. See the following code:
// WORKAROUND:
// Move all FieldEnd node to the previous paragraph if it is the first child of paragraph.
Node[] fieldEnds = doc.GetChildNodes(NodeType.FieldEnd, true).ToArray();
foreach(Aspose.Words.Fields.FieldEnd fieldEnd in fieldEnds)
{
// Get parent paragraph.
Aspose.Words.Paragraph parent = fieldEnd.ParentParagraph;
if (parent.FirstChild.Equals(fieldEnd))
{
Node currentNode = parent.PreviousPreOrder(doc);
while (currentNode.NodeType != NodeType.Paragraph)
currentNode = currentNode.PreviousPreOrder(doc);
Aspose.Words.Paragraph previouseParagraph = (Aspose.Words.Paragraph) currentNode;
previouseParagraph.AppendChild(fieldEnd);
}
if (!parent.HasChildNodes)
parent.Remove();
}
- You can use regular expressions to replace " only in styles.
Best regards,
The issues you have found earlier (filed as 20707) have been fixed in this update.
This message was posted using Notification2Forum from Downloads module by aspose.notifier.