Extracting Text From MSG and EML

alpcos · March 6, 2013, 4:27am

Hi,

I need to extract plain text (not just the body, but all possible text data) from MSG and EML files; with the most efficient way possible. The extracted text will be used for indexing - so no formatting etc. required - just plain text.

Can you please advise on what is the best way (in terms of reliability and performance) to do this?

Thanks.

alpcos · March 6, 2013, 5:41am

Ok, I have visited all attirbutes and created a merged text. That will do the job for me; but for some emails, I have the bodyencoding field such as UTF8 and if I get bodytext directly, some characters are lost/incorrect. I do know the correct encoding from body encoding; but how will I use it? How can I get the body text with correct encoding, so the text is totally correct?

Please advise.

Thanks.

kashif.iqbal · March 6, 2013, 11:05am

Hi Alp,

Sorry for a delayed response.

Generally, you can apply any encoding to an array of bytes the following way:

Encoding utf8 = Encoding.UTF8;

string strText = utf8.GetString(BytesData);

where BytesData represents the bytes representation of text/string.

Can you please provide us with such a sample message file where the characters are lost/incorrect. We’ll also look into it for our reference.