Extraction of body text is extremely slow for some MSG files

Samples.zip (1.8 MB)

Aspose Team,

We use the Aspose Email java package to extract the body text of msg files and found out that the extraction method is extremely slow with some msg files with images to external urls. Some of the client files with size between 5 and 9 MB take from 45 minutes to 2.5 hours to finish the extraction.

Following is the sample code that reproduces the problem and attached are two non-client sample files (Sample1.msg and Sample2.msg). Sample2.msg was created by merging multiple copies of Sample1.msg. Thus it has repeated contents. Body text extracton of Sample2.msg takes about 4 seconds. It is much shorter than that with the client sample files. I hope the sample files can help you to analyze the problem.

Because the sample files has html body we use the MailMessage:getHtmlBodyText() to extract the body text. It is this method that takes long time. Can you tell us what happens inside the method and why it takes long time?

We tried the MailMessage:getBody() method and it is very fast (less than one second). However, the extracted text is slight different from that with the MailMessage:getHtmlBodyText(). What method should be better for msg files with html body?

We also tried the MapiMessage:getBody() method, and it is fast and it has the same result as MailMessage:getBody().

We noticed that MailMessage:getHtmlBodyText() is deprecated. What is the recommended method to replace it?

The operating system is Ubuntu 18.04. Java version is 1.8. Aspose PDF java package is 21.3.

import com.aspose.email.*;
import java.time.LocalDateTime;

public class GetMsgBodyText {

public static void main(String[] args) {
    System.out.println(LocalDateTime.now() + " --- Start");

    try {
        String filepath = "/home/ubuntu/testdirs/testdir_msg_with_links/Sample1.msg";

        System.out.println(LocalDateTime.now() + " --- load mapiMessage");
        MapiMessage mapiMessage = MapiMessage.load(filepath, new MsgLoadOptions());

        System.out.println(LocalDateTime.now() + " --- load mailMessage");
        MailMessage mailMessage = MailMessage.load(filepath, new MsgLoadOptions());

        String bodyText;
        if(mailMessage.isBodyHtml()){
            System.out.println(LocalDateTime.now() + " --- begin getHtmlBodyText");
            bodyText = mailMessage.getHtmlBodyText();
            System.out.println(LocalDateTime.now() + " --- finished getHtmlBodyText");
        }
        else {
            System.out.println(LocalDateTime.now() + " --- begin getBody");
            bodyText = mailMessage.getBody();
            System.out.println(LocalDateTime.now() + " --- finished getBody");
        }
    } catch (Exception ex) {
        ex.printStackTrace();
    }

    System.out.println(LocalDateTime.now() + " --- Done");
}

}

@xyang,
Thank you for the issue description. I added a ticket with ID EMAILJAVA-34892 to our tracking system. Our development team will investigate this case. I will answer your questions later.

@xyang,
Our development team investigated the issue and made some optimization. Could we check the issue on your side with the client files? The file you suggested does not reproduce the issue well. We need another more suitable message file. We have prepared a special build for you: Aspose.Email 21.5.2. Please check the problem with this build.

Please use GetHtmlBodyText(true) method instead.
API Reference: getHtmlBodyText method

Thank you very much. I will try and let you know.

@Andrey_Potapov
I tested three client sample files (5.3, 6.4 and 8.5 mb respectively) with the recommended method MailMessage::getHtmlBodyText(true) with the special build Aspose.Email 21.5.2. Extraction of body text took 7, 9 and 28 minutes respectively. It used to take 75, 44 and 173 minutes respectively.

Note that the extracted texts are the same as before.

@xyang,
I am glad to know that the optimization works. It would be great if you could share other MSG examples to investigate this issue further.

@Andrey_Potapov,
I am sorry that we do not have the client permission to share the sample files now. If things change later I will let you know.

Can you tell me when the implementation will be officially released? Are you considering to release as a patch?

@xyang,
This hotfix will be included in Aspose.Email 21.6. This version will be released towards the end of June or early July.

@Andrey_Potapov,
Thank you very much.

The issues you have found earlier (filed as EMAILJAVA-34892) have been fixed in this update.