Email to Text -- First line in header

Hello,


When we convert an email to text, we get a single line above the email header that shows the email Author.

Examples:
----------------------------
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.0px Courier; color: #4c2f2d; background-color: #dfdbc4} span.s1 {font-variant-ligatures: no-common-ligatures}

Zoom.Quiet

From: Zoom.Quiet

Sent: Mon, 14 Aug 2006 10:25:01 AM GMT

To: ubuntu-zh@lists.ubuntu.com

Subject: [Ubuntu-zh] [Wiki]样式错乱!

Attachments: 2006-08-14-181636_418x105_scrot.png

----------------------------
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.0px Courier; color: #4c2f2d; background-color: #dfdbc4} span.s1 {font-variant-ligatures: no-common-ligatures}

Meyers

From: Meyers

Sent: Wed, 2 Jan 2002 4:27:27 AM GMT

To: Solberg;Geir;Williams III;Bill

Cc: DL-Portland Volume Mgmt

Subject: DA ahead Cali schedule taken to the HA market 01/02/02

Importance: Low

----------------------------

Is this configurable in any way? I know I could always strip the first line in the text output, but wanted to see if there was a way to have Aspose handle it on the way out.

Thanks!
Eric


p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.0px Courier; color: #4c2f2d; background-color: #dfdbc4} span.s1 {font-variant-ligatures: no-common-ligatures}

Hi Eric,


Thank you for contacting Aspose support team.

We need to reproduce this issue here, therefore you are requested to share the source emails along with the code used for converting them to text file. It will help us to observe the problem and provide assistance accordingly. Please try to provide simple console application which can be compiled and executed here.

Thanks Kashif,


I’ve attached an MSG and an EML, as well as a class with 2 unit tests outlining the behavior i’m seeing. Let me know if you need anything else, Thanks!

-Eric

If anyone else wants to see the code without downloading the zip and stuff, here it is:

import com.aspose.email.;
import com.aspose.words.Document;
import com.aspose.words.SaveFormat;
import com.aspose.words.TxtSaveOptions;
import com.google.common.base.Charsets;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Before;
import org.junit.Test;

import java.io.;
import java.nio.charset.Charset;
import java.util.TimeZone;

import static org.junit.Assert.assertEquals;


public class TestEmailExtractor {

   private static final Log log = LogFactory.getLog(TestEmailExtractor.class);

   public static final char UTF8_BOM = ‘\ufeff’;

   @Before
   public void before() throws Exception {

      String licensePath = TestEmailExtractor.class.getResource("/META-INF/Aspose.Total.Java.lic").getPath();

      // Set license. Provide full path and license file name
      com.aspose.email.License licEmail = new com.aspose.email.License();
      licEmail.setLicense(licensePath);

      com.aspose.words.License licWords = new com.aspose.words.License();
      licWords.setLicense(licensePath);
   }


   private File getText(MailMessage mailMessage, TimeZone timezone) throws Exception {

      try (ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {

         MhtSaveOptions saveOptions = SaveOptions.getDefaultMhtml();
         MhtMessageFormatter mailFormatter = new MhtMessageFormatter();

         // Properly format date/times
         saveOptions.setMhtFormatOptions(MhtFormatOptions.None);
         if (mailMessage.getDate() != null) {
            mailMessage.setTimeZoneOffset(timezone.getOffset(mailMessage.getDate().getTime()));
         }
         mailFormatter.setDateTimeFormat("ddd, d MMM yyyy h:mm:ss a '" + timezone.getID() + "’");
         mailFormatter.format(mailMessage);
         mailMessage.save(outputStream, saveOptions);
 
         File textFile = File.createTempFile(“text-file-”, “.txt”);

         try (final FileOutputStream txtOutputStream = new FileOutputStream(textFile);
             final InputStream inputStream = new ByteArrayInputStream(outputStream.toByteArray())) {

            final Document document = new Document(inputStream);

            TxtSaveOptions txtOptions = new TxtSaveOptions();
            txtOptions.setSaveFormat(SaveFormat.TEXT);
            txtOptions.setEncoding(Charsets.UTF_8);
            txtOptions.setExportHeadersFooters(true);
            txtOptions.setPrettyFormat(true);
            txtOptions.setPreserveTableLayout(true);

            document.save(txtOutputStream, txtOptions);

            return textFile;
         }
      }
   }


   private String getFirstLine(File textFile) throws IOException {
      try (BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(textFile), “UTF-8”))) {

         // Strip this BOM (why do we even have this?)
         reader.mark(4);
         if (UTF8_BOM != reader.read()) reader.reset();

         return reader.readLine();
      }
   }

   @Test
   public void TestExtractTextFromMsg() throws Exception {

      File file = new File(“98.msg”);
      TimeZone timezone = TimeZone.getTimeZone(“GMT”);

      MapiMessage message = MapiMessage.fromFile(file.toString());
      MailMessageInterpretor mi = MailMessageInterpretorFactory.getInstance().getIntepretor(message.getMessageClass());
      MailMessage mailMessage = mi.interpret(message);

      String displayName = mailMessage.getFrom().getDisplayName();
      log.debug("DisplayName: " + displayName);

      File textFile = getText(mailMessage, timezone);
      log.debug("TextFile: " + textFile);

      String firstLine = getFirstLine(textFile);
      log.debug("FirstLine: " + firstLine);

      assertEquals(“First line of text output is equal to message FROM”, displayName, firstLine);
   }

   @Test
   public void TestExtractTextFromEml() throws Exception {

      File file = new File(“319.eml”);
      TimeZone timezone = TimeZone.getTimeZone(“GMT”);

      EmlLoadOptions options = new EmlLoadOptions();
      options.setPrefferedTextEncoding(Charset.forName(“UTF-32”));
      options.setPreserveTnefAttachments(true);

      MailMessage mailMessage = MailMessage.load(file.toString(), options);

      String displayName = mailMessage.getFrom().getDisplayName();
      log.debug("DisplayName: " + displayName);

      File textFile = getText(mailMessage, timezone);
      log.debug("TextFile: " + textFile);

      String firstLine = getFirstLine(textFile);
      log.debug("FirstLine: " + firstLine);

      assertEquals(“First line of text output is equal to message FROM”, displayName, firstLine);
   }

Hi Eric,


While saving the email to MHTML, please use MhtFormatOptions.HideExtraPrintHeader option to avoid writing the extra information to output as shown in the code sample below. You can also refer to our documentation section, Saving to MHTML with Optional Settings, for further information in this regard.

Sample Code

MailMessage eml = MailMessage.load(dataDir + “test.eml”);
// Save as Mht with header
MhtSaveOptions mhtSaveOptions = new MhtSaveOptions();
int iSaveOptions = MhtFormatOptions.WriteHeader | MhtFormatOptions.HideExtraPrintHeader;
mhtSaveOptions.setMhtFormatOptions(iSaveOptions);
eml.save(dataDir + “ConvertingToMHTMLWithOptionalSettings_out.mht”, mhtSaveOptions);

Thanks Kashif,

I adjusted my code to use those mht format options, and now my text file has 2 headers. It looks like the top-most header is correct. I tried the setting:

txtOptions.setExportHeadersFooters(false);

with similar results.
I must be missing another setting somewhere.
Example text output:
From: Zoom.Quiet
Sent: Mon, 14 Aug 2006 10:25:01 +0000
To: ubuntu-zh@lists.ubuntu.com
Subject: [Ubuntu-zh] [Wiki]样式错乱!
Attachments: 2006-08-14-181636_418x105_scrot.png

Zoom.Quiet
From: Zoom.Quiet
Sent: Mon, 14 Aug 2006 10:25:01 AM GMT
To: ubuntu-zh@lists.ubuntu.com
Subject: [Ubuntu-zh] [Wiki]样式错乱!
Attachments: 2006-08-14-181636_418x105_scrot.png

如图!
用户登录 Wiki 后,一些快捷链接在主导航之后了!不能点击!

实际上,登录链接就已经不能点击了!


“”“Time is unimportant, only life important!
blogging : http://blog.zoomquiet.org/pyblosxom/
wiki enter: http://wiki.woodpecker.org.cn/moin/ZoomQuiet
in douban: http://www.douban.com/people/zoomq/

Hi Eric,


If we run the following simplified code with your sample input files, the intermediate output MHTML and finalized text files generated do not exhibit the issue of dual headers as you have specified. Could you please make sure that you are using the latest versions of the Aspose.Email and Aspose.Words APIs at your end?

Sample Code

MailMessage eml = MailMessage.load(“845521\98.msg”, new MsgLoadOptions());
// Save as Mht with header
MhtSaveOptions mhtSaveOptions = new MhtSaveOptions();
int iSaveOptions = MhtFormatOptions.WriteHeader | MhtFormatOptions.HideExtraPrintHeader;
mhtSaveOptions.setMhtFormatOptions(iSaveOptions);

eml.save(“845521\98_out.mhtml”, mhtSaveOptions);
final Document document = new Document(“845521\98_out.mhtml”);

TxtSaveOptions txtOptions = new TxtSaveOptions();
txtOptions.setSaveFormat(SaveFormat.TEXT);
txtOptions.setEncoding(Charsets.UTF_8);
txtOptions.setExportHeadersFooters(true);
txtOptions.setPrettyFormat(true);
txtOptions.setPreserveTableLayout(true);

document.save(“845521\98_out.txt”, txtOptions);

Ok It would appear that formatting the email through the (now deprecated) MhtMessageFormatter was at fault for the dual headers. Not sure that that was about so I figured i’d try to format my date/times the “non deprecated way”. Couldn’t find documentation on how to do that, but turns out there is a “DateTime” key you can add to the format templates to do what MhtMessageFormatter was doing.


Is there a page in the documentation that outlines all of the format templates and how they’re used?
Here is the “simplified” working code from above, along with the date/time formatting I required:

Thanks for the help Kashif.

@Test
public void TestExtractTextFromMsgAsposeCode() throws Exception {
MailMessage eml = MailMessage.load(“98.msg”, new MsgLoadOptions());
TimeZone timezone = TimeZone.getTimeZone(“GMT”);
String dateTimeFormat = “ddd, d MMM yyyy h:mm:ss a '” + timezone.getID() + “’”;

// Save as Mht with header
MhtSaveOptions mhtSaveOptions = MhtSaveOptions.getDefaultMhtml();
mhtSaveOptions.setMhtFormatOptions(MhtFormatOptions.WriteHeader | MhtFormatOptions.HideExtraPrintHeader);

if (mhtSaveOptions.getFormatTemplates().containsKey(“DateTime”)) {
mhtSaveOptions.getFormatTemplates().set_Item(“DateTime”, dateTimeFormat);
} else {
mhtSaveOptions.getFormatTemplates().add(“DateTime”, dateTimeFormat);
}

eml.setTimeZoneOffset(timezone.getOffset(eml.getDate().getTime()));
eml.save(“98_out.mhtml”, mhtSaveOptions);

final Document document = new Document(“98_out.mhtml”);

TxtSaveOptions txtOptions = new TxtSaveOptions();
txtOptions.setSaveFormat(SaveFormat.TEXT);
txtOptions.setEncoding(Charsets.UTF_8);
txtOptions.setExportHeadersFooters(true);
txtOptions.setPrettyFormat(true);
txtOptions.setPreserveTableLayout(true);

document.save(“98_out.txt”, txtOptions);
}

Hi Eric,


You can find information about the format templates in one of our documentation examples, Rendering events during conversion to MHTML. We are glad that the suggested code sample helped you in this regard. Please feel free to write to us in case you have any further inquiry in this regard.