Transform HTML to docx problem with &lt;style&gt;

Enviromatic · September 9, 2011, 8:14am

Hi,
I use aspose.words to get the text part of emails that do not have a textbody defined. I open HTMLBody with aspose.words and then save the document as text file. This works well in general but I experience some problems.
Find attached an email, take its htmlbody, open it in aspose.words as if it was an HTML file and then save the doc as text (I obtain the attached text file). As you can see, the tag appears but it should not.
Can you do something?
Another question please : look at how I encode (normalize) the HTMLbody to create a memorystream I will open with aspose.words. Is it the right way? And after, when I want to get the text back from the memorystream generated with Words, I use UTF8 encoding. Is it the right encoding?
Best regards,
Here is the code I use

MemoryStream ms = new MemoryStream();
UnicodeEncoding uniEncoding = new UnicodeEncoding();
byte[] html = uniEncoding.GetBytes(email.Message.HtmlBody.Normalize());
ms.Write(html, 0, html.Length);
LoadOptions lo = new LoadOptions();
lo.LoadFormat = LoadFormat.Html;
Document doc = new Document(ms, lo);
MemoryStream msOut = new MemoryStream();
doc.Save("d:\\test.docx", Aspose.Words.SaveFormat.Docx);
doc.Save("d:\\test.txt", Aspose.Words.SaveFormat.Text);
doc.Save(msOut, Aspose.Words.SaveFormat.Text);
byte[] txt = msOut.ToArray();
email.EmailTextBody = Encoding.UTF8.GetString(txt);

alexey.noskov · September 10, 2011, 12:42pm

Hi
Thanks for your request. Please try using the following code:

// Load the MSG file using Aspose.Network for .NET
MailMessage msg = MailMessage.Load(@"Test001\Test.eml", MessageFormat.Eml);
// Convert MSG to MHTML and save to stream
MemoryStream msgStream = new MemoryStream();
msg.Save(msgStream, MailMessageSaveType.MHtmlFromat);
msgStream.Position = 0;
// Load the MHTML stream using Aspose.Words for .NET
Document msgDocument = new Document(msgStream);
msgDocument.Save(@"Test001\out.docx");

This code produces the correct output on my side.
Best regards,

Enviromatic · September 12, 2011, 2:57am

Hi,
Thank you for your answer but when I do what you say, I get a lot of header lines I don’t want: from, to, subject, etc. I just need the text inside the body of the email.
I need to show about 25 emails in a gridview so I need a very synthetic view with no images and no header lines. Only the text in the body.
Regards,

alexey.noskov · September 12, 2011, 9:24am

Hi
Thank you for additional information. In this case, you should use code like the following:

// Load the MSG file using Aspose.Network for .NET
MailMessage msg = MailMessage.Load(@"Test001\Test.eml", MessageFormat.Eml);
// Convert body of MSG to HTML and save to stream
string bodyHtml = msg.HtmlBody;
byte[] bodyHtmlBytes = Encoding.UTF8.GetBytes(bodyHtml);
using(MemoryStream bodyHtmlStream = new MemoryStream(bodyHtmlBytes))
{
    // Open HTML document using Aspose.Words.
    Document doc = new Document(bodyHtmlStream);
    // Save document.
    doc.Save(@"Test001\out.docx");
}

Hope this helps.
Best regards,

Enviromatic · September 13, 2011, 3:40am

Hi,
Thank you for your answer, it is working this way.
The only thing I had to modify is the line

byte[] bodyHtmlBytes = Encoding.UTF8.GetBytes(bodyHtml);

Indeed, with the UTF8 encoding I had some strange characters. I had to use this line instead.

byte[] bodyHtmlBytes = Encoding.Default.GetBytes(bodyHtml);

I don’t really understand why, do you have an explaination?
Thank you

alexey.noskov · September 13, 2011, 5:24am

Hi
Thank you for additional information. It is perfect that you managed to achieve what you need.
Unfortunately, I also do not have an explanation why you have you change the encoding. Maybe content in your message is not in UTF-8 encoding.
Best regards,

Enviromatic · September 13, 2011, 9:21am

Hi,
A last question: I don’t understand why the docx file contains carriage returns but the text file does not. In the text file the carriage returns appears like strange characters (see both files attached).
Finally I use the aspose.MailMessage.PreferredTextEncoding to create the stream from the HTMLBody of the email. Am I right?
Regards,

alexey.noskov · September 13, 2011, 2:10pm

Hi
Thanks for your request. The problem occurs because manual line breaks are used in your document (in HTML this is
in Word document it can be inserted by pressing Shift+Enter). To resolve the problem, you should replace line break characters in your txt document using carriage return character. For instance, see the following code:

string txt = File.ReadAllText(@"Test001\test.txt");
// Replace line break with paragraph carriage return.
txt = txt.Replace("\v", "\r\n");
using(FileStream fs = new FileStream(@"Test001\out.txt", FileMode.Create))
{
    using(StreamWriter writer = new StreamWriter(fs))
    {
        writer.Write(txt);
    }
}

Hope this helps.
Best regards,

Enviromatic · September 14, 2011, 5:02am

Hi,
Thank you for your answer and detailed informations.
It is working perferctly now.
Thank you very much for your help.

Enviromatic · September 14, 2011, 5:59am

Hi again,
A problem remains but I’m not sure that Aspose.Words is responsible for it. I think maybe Aspose.Network is responsible for it.
Look at the email attached, if I want to open the HTMLBody of the email with Aspose.Words, I first need to decode the HTMLBody. Aspose.network email object have information about the body encoding. The property is Message.BodyEncoding If this property is null, there is another property witch is Message.PreferredTextEncoding

So, to decode the body, here is what I do to find the correct encoder to use :

Encoding en = Encoding.Default;
if (email.Message.PreferredTextEncoding != null)
{
    en = email.Message.PreferredTextEncoding;
}
if (email.Message.BodyEncoding != null)
{
    en = email.Message.BodyEncoding;
}
byte[] html = en.GetBytes(email.Message.HtmlBody);
MemoryStream ms = new MemoryStream(html);
LoadOptions lo = new LoadOptions();
lo.LoadFormat = LoadFormat.Html;
Document doc = new Document(ms, lo);

.
For the attached email : bodyencoding is null and preferedTextEncoding is the one detailed in the attached picture. But if I use the preferedTextEncoding then the docx file created with Aspose.Words contains stange characters.
So here is my question : how do I know witch encoding to use to get the correct text in Aspose.words?
Regards,

alexey.noskov · September 14, 2011, 2:13pm

Hi
Thanks for your request. I suppose it would be better to ask this question in Aspose.Network forum. My colleagues from Aspose.Network team will answer you shortly.
Best regards,

Enviromatic · September 15, 2011, 2:47am

Ok, thank you!

Transform HTML to docx problem with &amp;lt;style&amp;gt;

Transform HTML to docx problem with <style>