Missing space from HTML Body text

We are seeing an issue in HTML bodies where a space is missing in the extracted text after underlined words. “For example this” becomes “For examplethis” if example was an underlined word in the original email.

So far we have observed this is in msg, eml, and mht formats, but we have not seen it occurring for any other HTML tags like bold or italics.

Attached archive has a picture of the output, and the files used to create the output.

Sample code for text fetching

if(mailMessage.isBodyHtml()) {
    return mailMessage.getHtmlBodyText(true)
}

space_issue.zip (144.9 KB)

We are seeing this in java v24.8

@cbruner

It seems you are encountering an issue where a space is missing in the extracted text after underlined words when processing HTML bodies in email formats like MSG, EML, and MHT. This can be particularly problematic when the underlined text is followed by other text, resulting in concatenated words.

While the specific issue you are facing is not directly addressed in the available documentation, it is important to note that when converting HTML content to plain text, certain HTML elements may not be handled consistently. For instance, while underlined text may not preserve spaces correctly, other formatting such as bold or italics might not exhibit the same problem.

To troubleshoot this issue, you might consider the following approaches:

  1. Post-Processing the Extracted Text: After extracting the plain text, you could implement a post-processing step to check for instances where a word is immediately followed by another word without a space. You can insert a space in such cases based on your specific requirements.

  2. Using Aspose.Words for Enhanced Handling: If you are not already using it, consider leveraging Aspose.Words in conjunction with Aspose.Email. Aspose.Words is designed to handle complex HTML structures and may provide better results when converting HTML to plain text.

Here’s a simple example of how you might implement a post-processing step in your code:

string htmlBodyText = mailMessage.getHtmlBodyText(true);
string processedText = Regex.Replace(htmlBodyText, @"(\w)(?=\w)", "$1 "); // Adds space between concatenated words
return processedText;

This code uses a regular expression to find instances where a word character is immediately followed by another word character and inserts a space between them.

If the problem persists, consider reaching out to the Aspose support community or checking for updates in the library that might address this specific issue.

Sources:
[1]: C# Technique - Converting HTML Body to Plain Text - Aspose

Hello @cbruner,

The issue could be related to the behavior of the getHtmlBodyText method. We will investigate the issue and will respond to you soon.

@cbruner
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): EMAILJAVA-35314

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

After the 24.8.1 hotfix we are still seeing related issues with MHT files, attached screenshot, the same sample data from original post can be used to reproduce. On the left is the original native text, and the right is the extracted. “document with” became “doc=umentwith”.

mht_space.jpg (44.6 KB)

Hello @cbruner ,

We have re-run the tests and were unable to reproduce the error.
Please provide a code sample that demonstrates the issue.

String fileName = "space_issue.msg";

MailMessage mailMessage = MailMessage.load(fileName);
if (mailMessage.isBodyHtml()) {
    Assert.assertTrue(contains(mailMessage.getHtmlBodyText(true), "underline contained"));
}
System.out.println(mailMessage.getHtmlBodyText(true));
mailMessage.save("space_issue_out.mhtml", SaveOptions.getDefaultMhtml());
mailMessage.save("space_issue_out.eml");

output text:

Here is a document with an underline contained within
Here is a document with italics text
Here is a document with bold text
Here is a document with a strikethrough text

space_issue_out.eml.png (3.4 KB)
space_issue_out.mhtml.png (2.7 KB)

It looks like there are some file issues around “document with” that may not be avoidable, however “strikethroughtext” occurs now in both mht and eml formats similar to the original bug with underlines. This occurs in the aspose-email-24.8.0.1-jdk16.jar hotfix version we received, prior versions only had the underline bug.

Sample code below tested with the .mht file, debugger screenshots attached.

HtmlLoadOptions options = new HtmlLoadOptions();
try(MailMessage m1 = MailMessage.load(filePath); MailMessage m2 = MailMessage.load(filePath, options))
{
    var m1Body = m1.getHtmlBodyText(true);
    var m2Body = m2.getHtmlBodyText(true);
}

mthBodies.png (45.6 KB)

@cbruner ,

We have reproduced the issue with the code you provided. We are currently working on a solution.