Convert Internet Explorer 8 generated MHT (MHTML) Files with multipart / alternative to Word using C# .NET

Alex_74 · July 23, 2009, 7:50am

Hallo,

I’ve tryed to open one simple HMT-File, but got this exception.

Can you please, help me?!

P.S. File and screenshoot of exception inside attached file.

Thanks in advice!

Klepus · July 23, 2009, 4:45pm

Hello Alexander!

Thank you for reporting this.

I have reproduced the exception. This is a known issue #7473 with Aspose.Words. It has been fixed in the current codebase. The correction will be available with the next public build (3-4 weeks). Technical reason is that “multipart/alternative” hasn’t been supported. We’ll notify when the new release comes.

Regards,

Alex_74 · July 27, 2009, 5:04am

Hallo Klepus!

Thanks for your response!

Klepus:

Hello Alexander!

Thank you for reporting this.

I have reproduced the exception. This is a known issue #7473 with Aspose.Words. It has been fixed in the current codebase. The correction will be available with the next public build (3-4 weeks). Technical reason is that “multipart/alternative” hasn’t been supported. We’ll notify when the new release comes.

Regards,

I would like to know how to get and apply a path for this issue, because I must immediately create and evaluate a test-bevore-buy application.

Thanks in advice!

Klepus · July 27, 2009, 5:38am

Hello!

Thank you for interest in Aspose.Words.

We are going to release a new version very soon. Hopefully on the next Monday. It’s difficult to provide a patch right now because it requires a lot of activities to build and test. Making a release is quite a complex procedure even despite it’s highly automated. So let us fix some more issues and improve the product a bit more. Thank you for understanding.

Regards,

Alex_74 · August 4, 2009, 3:49am

Hello!

Klepus:

Hello!

Thank you for interest in Aspose.Words.

We are going to release a new version very soon. Hopefully on the next Monday. It’s difficult to provide a patch right now because it requires a lot of activities to build and test. Making a release is quite a complex procedure even despite it’s highly automated. So let us fix some more issues and improve the product a bit more. Thank you for understanding.

Regards,

Is new version already available?

Thank you!

Klepus · August 4, 2009, 4:05am

Hello Alexander!

Thank you for your patience. No, it isn’t. We’re working on this. Will release today or tomorrow. You’ll be notified via the forum.

Regards,

aspose.notifier · August 5, 2009, 8:40am

The issues you have found earlier (filed as 7473) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

Alex_74 · August 7, 2009, 7:16am

Thanks for update!

aspose.notifier:

The issues you have found earlier (filed as 7473) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

Unfortunatelly, similar another mhtml problem a still peresent in the newest update - I got Exception “Cannot read from closed file” or “Cannot read from closed strem”.

File is attached.

With our C# MIME-Parser I can parse this file without any problems and also Outlook can open and show this file correct.

Do you like to share our code? For such case I need your mail-account.

Have you fix for this problem?

Thanks in advice.

Alex_74 · August 7, 2009, 8:43am

I’ve done conversion so that convertet file does not have multipart/related and have only multipart/mixed and multipart/laternative (which must be supported in the 6.6.0 release), but this file is not loaded properly also.

File attached.

Thanks!

alexey.noskov · August 7, 2009, 9:41am

Hi

Thank you for reporting this problems to us. I managed to reproduce both of issues. You will be notified as soon as them are resolved.

Best regards.

Klepus · August 7, 2009, 11:10am

Hello Alexander!
Thank you for experimenting with our new release.
I can explain what happens with these documents. Probably you would be able to find workarounds.
a5.eml has a very long primary header. When Aspose.Words reads a document it first briefly detects file format, then instantiates appropriate huge importer objects. It’s a very lightweight algorithm analyzing first 512 bytes of any file. During file format detection signatures, mime types etc. might be checked. HTML and MHTML don’t have any signatures. So we have to accept anything that looks like a mime header searching for some mandatory records. For instance, MHTML detection requires that “Content-Type” occurs within first 512 bytes. This doesn’t happen in your document and format detection fails. How long the detection buffer should be s a philosophical question. RFC 2557 and RFC 822 don’t restrict header length. But we don’t want to read any candidate file potentially up to the end. I’ll try to improve detection to cover your cases but right now you can enforce importing MHTML by specifying format explicitly:

Document doc = new Document("a5.eml", LoadFormat.Mhtml, string.Empty);

When I did this I found another issue. The main HTML document is not read. Here is also format detection. It’s needed in general case to ensure that the source has proper encoding. Since encoding is already specified in a subsidiary header we can soften the conditions. As a workaround you can enclose the whole document in … tags.
a6.eml is not read because multipart/mixed content type is not supported. Other multipart messages (alternative and related) are imported okay.
I have linked your requests to appropriate issues. You’ll be notified when they are fixed.
You can share any materials with us by sending them via Aspose website:
https://forum.aspose.com/t/aspose-words-faq/2711
Regards,

Alex_74 · August 7, 2009, 11:46am

Hello!

Thanks for quick response!

I’ve tried some of this methods but it does not help.

Document doc = new Document("a5.eml", LoadFormat.Mhtml, string.Empty);

does not load the document.

Encoluring inside … and load with LoadFormat.Html have result, the whole file is totally unreadable plain ascii text.

I think, it’s not to easy on customer site to do any fixes.

Thanks.

Alex_74 · August 7, 2009, 11:58am

Hello,

Klepus:

…

**

You can share any materials with us by sending them via Aspose website:

[https://forum.aspose.com/t/aspose-words-faq/2711]https://forum.aspose.com/t/aspose-words-faq/2711

Regards,

Core MIMER classes send you per mail.

Thanks.

Klepus · August 7, 2009, 1:06pm

Thank you for your materials. I’ll take a look. I mean as a workaround you should add … and force MHTML format (not HTML) when reading the document. There are some minor issues but at least the document will be read without any fixes to the library.

Regards,

Klepus · August 14, 2009, 4:45am

Hello Alexander!

I’d like to ask for your advice regarding multipart/mixed MIME content type. Recently we got a file with this content type from you (a6.eml). Multipart/mixed is intended to pack several independent documents in one archive. In general there could be several items of any type: documents, images, attachments etc. When viewed in compatible browsers and mailers multipart/mixed parts are shown one after another. But it’s difficult to guess how they could be read into Aspose.Words document model. Import is designed to load one document file and produce one document model. Shall we choose from multipart/mixed parts or just concatenate anything found in one document? What can we do if multipart/mixed occurs on nested levels? Maybe it’s good to have several documents in output. This depends on the application tasks. Of course parsing MIME archive with multipart/mixed is not a problem since multipart structure is the same as for multipart/related and multipart/alternative.

This MIME content type is not frequently used. And I don’t see realistic scenarios how it can be processed by Aspose.Words. Maybe you tried multipart/mixed only as an experiment. Please let me know how you see this. There are two more multipart subtypes: multipart/parallel and multipart/digest. They are also not supported under the same reason.

Thank you in advance.

Regards,

Alex_74 · August 14, 2009, 5:26am

Klepus:

Hello Alexander!

I’d like to ask for your advice regarding multipart/mixed MIME content type. Recently we got a file with this content type from you (a6.eml). Multipart/mixed is intended to pack several independent documents in one archive. In general there could be several items of any type: documents, images, attachments etc. When viewed in compatible browsers and mailers multipart/mixed parts are shown one after another. But it’s difficult to guess how they could be read into Aspose.Words document model. Import is designed to load one document file and produce one document model. Shall we choose from multipart/mixed parts or just concatenate anything found in one document? What can we do if multipart/mixed occurs on nested levels? Maybe it’s good to have several documents in output. This depends on the application tasks. Of course parsing MIME archive with multipart/mixed is not a problem since multipart structure is the same as for multipart/related and multipart/alternative.

This MIME content type is not frequently used. And I don’t see realistic scenarios how it can be processed by Aspose.Words. Maybe you tried multipart/mixed only as an experiment. Please let me know how you see this. There are two more multipart subtypes: multipart/parallel and multipart/digest. They are also not supported under the same reason.

Thank you in advance.

Regards,

Hello Klepus!

First of all many thanks for you really deep investigation of problem!

Here my answers: multipart structure is the same as for multipart/related and multipart/alternative,

Since multipart structure is the same as for multipart/related and multipart/alternative so this should by parsed the same way :)))
We have really often mails within this MIME type, it was not an experiment. But, I have only seen mails within only one this MIME-Part (but many items inside, in must cases this is HTML-Mail within embedded image ressources, but sometimes text-mail with attachments - in this case some Mail-Clients show text mail-part followed by images).
what I think and what we really need: we do import of eMail as ONE document - because of this costellation it is good to CONCATENATE multipart/mixed parts one after other same order as in mail if there are more then one. If this contain recursive MIME parts, it should be processed as RECURSIVE CONCATENATION - this mean, CONCATENATE MIME-Parts on the same level INSIDE parent part. Also I think, this may be required by design: If eMail contain some one MHTML-part within embedded image inside this part as other MIME-part, so we need to recursive include image-part inside parent MHTML-Part.
Generally I can say, the created internal structure is not important for us for this time, because we need only page-images, but it would be very nice, if the structure bekame same order as described above - may be later we need to visulise this structure.
We have not found multipart/digest or mutipart/parallel mails.

I’ve also some other question - if the Mail contain some really big image inside, so that it does not fit A4-Page, would bekame this page bigger or would be image cutted on page site?

I hope, we can get newest update as soon as you include multipart/mixed processing.

Thanks in advice!

Klepus · August 14, 2009, 7:23am

Thank you Alexander!

Recursion looks natural here. But I think that multipart/mixed may not be always processed. For instance, if it occurs inside multipart/related or multipart/alternative it has no sense. In your sample one multipart/mixed occurs in another multipart/mixed. That’s better. We can consider concatenating this way but I don’t promise any release dates.

Please note if you need one root document with several subsidiary parts then you should put them all (document plus parts) into multipart/related. This is the most frequently used multipart subtype. As an example you can inspect how Aspose.Words exports MHTML format.

If multipart/mixed contains images they should be normally imported to the same output document. That’s the consequence of our discussion. Output document has some default page setup and I don’t think it’s a good idea to change it. At least that’s not good to do silently. You would be able to do with images and the whole output document anything after import completes: change either page setup or image scale. In the future we consider providing something like LoadOptions by analogy with SaveOptions. This would be the natural way to support miscellaneous options for document import.

Please feel free to share any other consideration and suggestions. Your feedback is very important for us and is much appreciated.

Have a nice day!

Alex_74 · August 14, 2009, 8:09am

Klepus:

Thank you Alexander!

Recursion looks natural here.

I think so

Klepus:

But I think that multipart/mixed may not be always processed. For instance, if it occurs inside multipart/related or multipart/alternative it has no sense. In your sample one multipart/mixed occurs in another multipart/mixed. That’s better.

I’m not very deep inside possibilities of some MIME-Types. I don’t understand why it have not sense of multipart/mixed inside multipart/related or multipart/alternative. If it has no sense, that it can be also restcricted by RFC. But if this parts contain multipart/mixed, that it can be good to process it in a some “natural way”, also as if it was in the root level - done same recursion and concatenation and attach the whole to parent MIME multipart

Klepus:

We can consider concatenating this way but I don’t promise any release dates.

Can you please send some development release or fix? It sounds like it is not to complicated to support multipart/mixed for you;)

Klepus:

Please note if you need one root document with several subsidiary parts then you should put them all (document plus parts) into multipart/related. This is the most frequently used multipart subtype. As an example you can inspect how Aspose.Words exports MHTML format.

It’s not possible to do some eMail manipulations because they are send by customers and receiving be mail server.

Klepus:

If multipart/mixed contains images they should be normally imported to the same output document. That’s the consequence of our discussion.

I agree.

Klepus:

Output document has some default page setup and I don’t think it’s a good idea to change it. At least that’s not good to do silently. You would be able to do with images and the whole output document anything after import completes: change either page setup or image scale. In the future we consider providing something like LoadOptions by analogy with SaveOptions. This would be the natural way to support miscellaneous options for document import.

We don’t need to change anything in a default page setup. What is maybe reqired is that not every page has same default size. I can descibe the problem with MS Word: you can open Mhtml file with MS Word; if MHTML file contain big image than the result is tha you get documents with default page size for every page and content of some pages is cutted on page site. It is not really acceptable because of information (parts of images or parts of text in big tables) is removed from output document site.

P.S. can be only one part of reply quoted?

Thank you very much!

Klepus · August 14, 2009, 9:46am

What RFC do you mention? RFC2557 defines how multipart message is organized in general. It doesn’t provide analysis for all cases. The question what cases are meaningless stay out of scope. Suppose you have a MIME archive with multipart/related top level. Every item inside it except for the root document should correspond to a subsidiary part related to that root document. And HTML code from within the root document refers them by URLs. How can it refer to a multipart/mixed since it’s not a resource but a resource collection? Multipart/mixed is just something like a file system: a folder with multiple files that can recursively contain other multipart instances. Of course I’ll experiment with such cases but I’m afraid that multipart/mixed inside other multipart subtypes should be just ignored or an exception should be thrown. To convince me in any reason please show a realistic sample and explain how this should work (referenced from the root document and imported).

Images in MHTML when they are imported to MS Word or Aspose.Words document model become shapes. Shape is treated as solid (atomic) object and no parts are removed. Really if a shape exceeds page margins some parts might become invisible. In this case you are free to choose: either increase page size/margins or decrease shape scale. We cannot perform this in general import and insertion methods since we don’t know your intentions.

I think we can support multipart/mixed but not sure that would be as easy as we’d like it.

P.S. You are already quoting parts of my replies. Do you mean anything else?

Regards,

Alex_74 · August 26, 2009, 7:34am

Hello!
I’ve even seen, that my last for a long time wrtitten message is not here :(((((> *Klepus:

What RFC do you mention? RFC2557 defines how multipart message is organized in general.*

Yes

Klepus:
It doesn’t provide analysis for all cases. The question what cases are meaningless stay out of scope.

It’s difficult to say, becaus we don’t know, which mail cleints and servers can be used by customers.

Klepus:
Suppose you have a MIME archive with multipart/related top level. Every item inside it except for the root document should correspond to a subsidiary part related to that root document. And HTML code from within the root document refers them by URLs. How can it refer to a multipart/mixed since it’s not a resource but a resource collection?

I agree, multipart/related inside multipart/mixed is unsual. Can we say every part (child) of multipart/related correspond to the same part of root of that multipart/related?
But multipart/mixed inside multipart/alternative should be not a problem.
Suggestions how to handle can be found here:
http://segate.sunet.se/cgi-bin/wa?A2=ind9903&L=mhtml&F=P&P=11909

Klepus:
Multipart/mixed is just something like a file system: a folder with multiple files that can recursively contain other multipart instances. Of course I’ll experiment with such cases but I’m afraid that multipart/mixed inside other multipart subtypes should be just ignored or an exception should be thrown. To convince me in any reason please show a realistic sample and explain how this should work (referenced from the root document and imported).

Here a inetersting webressource with usually test mails.
http://people.dsv.su.se/~jpalme/ietf/mhtml-test/mhtml.html
I think, may be some customers use this ressource to test our software

Klepus:
Images in MHTML when they are imported to MS Word or Aspose.Words document model become shapes. Shape is treated as solid (atomic) object and no parts are removed. Really if a shape exceeds page margins some parts might become invisible. In this case you are free to choose: either increase page size/margins or decrease shape scale. We cannot perform this in general import and insertion methods since we don’t know your intentions.

I can show an example - very big image or table contain some important information. This should not be truncated in every case. It’s posible to implement user specific parameters if it is better to resize that page or to change DPI of that image (but what to with tables). It would be really very nice.
MS Word has same problem: some images in MHTML gets truncated at page site and this is a problem because image contain very important information.

Klepus:
I think we can support multipart/mixed but not sure that would be as easy as we’d like it.

What is actual status of your effort?

Klepus:
P.S. You are already quoting parts of my replies. Do you mean anything else?

First quota in forum has other look als later quotas.
Thank you very much!