Convert Internet Explorer 8 generated MHT (MHTML) Files with multipart / alternative to Word using C# .NET

Klepus · August 26, 2009, 8:44am

Hello Alexander!

Thank you for providing thoughtful feedback.

I read the same articles and RFCs on the Internet, thank you J. We’ve also been testing our MHTML import implementation using Jacob Palme samples from the point we started working on it. Now I’ve made all combinations of multipart structures pass from that set. You’ll be able to verify this with our next version. Of course we have many other tests in the range from real cases to very marginal of them.

multipart/alternative and multipart/related are typically combined in exactly two ways: one of them two is the outer level and another is the inner. Please read the referenced mail message carefully. multipart/mixed can occur in multipart/alternative. But this case is uncommon. Usually the same is done by placing multipart/alternative inside multipart/related. If a mailer doesn’t recognize text/html it should take text/plain and treat subsidiary parts of the outer multipart/related as if they were in multipart/mixed. This will give the same result but without the need of repetition of every subsidiary part in the two alternatives.

In any case lack of support multipart/mixed in multipart/alternative won’t affect you since we always take multipart/related alternative to find text/html inside. Why should Aspose.Words read other alternatives? Maybe only in the case when preferable alternative is absent or damaged. Do you agree? If yes then this is a very minor case. I have created the corresponding issue and linked to this thread. We’ll notify you when it’s fixed. But I can repeat that business priority is very low.

Regarding extremely big images and tables I’ve already answered your question. You can edit the document programmatically when it has been imported. Can I help you more with this question?

Best regards,

Alex_74 · August 26, 2009, 11:17am

Hello Viktor,

at this time it is important for us we can read/parse real mails we have already received from customers. Some example of them was attached to my previous message.

Parallel we can try to build a workaround for storage and manuall processing of mails for cases mail is not automatically processed. And analyse this rare (may be it will be allways exactly the same unusualy case) cases later.

I’m waiting for the next version. If 've understand you correct, next version will parse more cases, right?

Is there any samples to see, how programmatically traverse docuemnt and extend alls sites in manner that all truncated elements on such sites bekame visible?

Many thanks in advice!

Klepus · August 26, 2009, 7:32pm

Yes, we have supported in the current mainstream some new features. They’ll be available with the next release. If you have any other questionable samples please share them here in the forum.

No, there is no existing sample for extending page size to fit content. Maybe someone wrote such a sample but I don’t know of it. If it’s difficult then I can try myself. It’s also good to have a sample of source document to be absolutely sure that code will cover the cases.

Regards,

Alex_74 · September 14, 2009, 3:11am

Hello Viktor,

is some efforts already dome in this direction or new update is available soon?

Thanks!

Klepus · September 14, 2009, 4:52am

Hello Alexander!

Thank you for your patience. We are going to release in a few days.

Have you got right with alternating page size to fit contents?

Regards,

Alex_74 · September 15, 2009, 11:12am

Hello Viktor,

thanks for your support.

At the moment I’ve not a right page sizing algorithm.

But we catch new real HTML-mail, with very uncommon internal format - this mail can be viewed by Internet Explorer and Outlook, but can’t be opened with MS Word.

Moreover, at the botoom of mail is some error in boundary name, it seem to be truncated, which not prevent IE und OU from correct rendering of this file. If I’ve corrected the bottom line, the mail can’t be opened by MS Word anywhere.

I’ve attached this mail to my post. May be you can take a look bevore publish a new release.

Thanks.

Klepus · September 15, 2009, 8:03pm

Hello Alexander!

I’ll take a look. This file seems to be corrupted. Maybe I would be able to recover when reading it.

The new release is already being built. So that’s impossible to incorporate any more fixes to it.

Regards,

Alex_74 · September 16, 2009, 7:37am

Hello Viktor,

where can I download the new trial version?

Thanks!

Klepus · September 16, 2009, 8:20am

Hello!

New version will be available as usual from the download page. It’s not trial, it’s the next regular version. Will be published today or tomorrow.

Regards,

Klepus · September 16, 2009, 5:12pm

Hi again.

I have looked on your last attached file more precisely and figured out three issues:

Multipart boundary is not recognized if contains spaces. Not a problem to support. This is most probably a deviation from standard since Microsoft Word doesn’t support such boundaries. As a workaround you can replace spaces in boundary strings with underscore characters or whatever else.
Ending multipart/related boundary is damaged. When I fixed these two manually I managed to open the document with Microsoft Word. A new issue has been created but I’m not sure we’ll fix this in considerable future. Repairing damaged files is a complex and “endless” task since we cannot predict any “affordable level of corruption”.

between two tables doesn’t take effect. It’s an issue with HTML importer which is also logged.

Best regards,

aspose.notifier · September 16, 2009, 11:26pm

The issues you have found earlier (filed as 10008;9692) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

Alex_74 · September 17, 2009, 4:48am

Hello Viktor!

Great work! Now it openes EML with multipart/related and we don’t need to use MS Word anymore.

And also many thanks for analyse with whitespaces inside multipart-boundaries.

Unfortunatelly, I’ve found a big issue in the new version :((( The content of second table is not rendered at page at all The content is

Indos	1509707931

Only the top line of this table is rendered at botton of first page.

If I first open the (boundary-corrected) file with MS Word, store as DOC and open this DOC-File all content seems to be correctly rendered.

Is it possibly to make a hot-fix for this issue?

Thanks in advice.

alexey.noskov · September 17, 2009, 6:04am

Hi Alexander,

Thanks for your inquiry. Could you please attach sample document, which will allow us to reproduce the problem? We will check it and provide you more information.

Best regards.

Alex_74 · September 17, 2009, 6:51am

Hello Alexey,

this is the same document I’ve attached in my previous post (with very long file name) in this thread.

Here the attachment once again (with corrected last boundary footer).

But I think, I found what is caused a problem: the trial-string on the first page seems to pull down the content on first page so that the last line does not fit the page boundary:)

What is a littel curious, that line will be placed bottom on the page, not below the last line bevore.

alexey.noskov · September 17, 2009, 9:29am

Hi Alexander,

Thank you for additional information. The problem might occur because there is two tables one by one in the document. I linked your request to the appropriate issue, you will be notified as soon as it is resolved.

As a workaround, you can just add an empty paragraph between tables. For example, see the following code:

// Open document.
Document doc = new Document(@"Test001\in.mhtml");
// Get collection of tables
NodeCollection tables = doc.GetChildNodes(NodeType.Table, true);
// loop through all tables
foreach (Table table in tables)
{
    // Check if the next node after the tabel is another table.
    // If so, insert an empty paragraph between tables.
    if (table.NextSibling != null && table.NextSibling.NodeType == NodeType.Table)
        table.ParentNode.InsertAfter(new Paragraph(doc), table);
}
// Save output document
doc.SaveToPdf(@"Test001\out.pdf");

Hope this helps.

Best regards.

Alex_74 · September 21, 2009, 10:00am

Hello Alexey,

many thanks for this workaround!

Previously I’ve found and diskuss a problem with a spaces inside boundary names (example attached) and FileCorrupted-exception while opening such files with Aspose.Words.

Can you please, supply a fix for this problem?

It’s a little bit difficult to analyse the file and replace spaces in boundries within undercores in a robust way.

Thanks in advice!

alexey.noskov · September 21, 2009, 10:46am

Hi Alex,

Thanks for your inquiry. The problem with spaces in multipart boundaries is already resolved in the current codebase. The fix will be included into the next hotfix, which is released in 3-4 weeks. You will be notified as soon as it is published.

Best regards.

aspose.notifier · November 12, 2009, 6:44am

The issues you have found earlier (filed as 10549) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

AndreyN · September 10, 2010, 9:54am

The issues you have found earlier (filed as 10009) have been fixed in this update.

aspose.notifier · February 6, 2011, 7:32am

The issues you have found earlier (filed as 9902) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.
(24)