DocumentBuilder.InsertHtml problem with multi-byte characters

Hello,

I’m attempting to insert Thai characters into a merge field. (The Thai characters were entered using an HTML editor and contain some HTML content.)

The problem is that when merging the data, it appears a squares in the output document. I’m using Times New Roman as the Word template font and when I paste the Thai characters directly into the template, it looks fine.

I think my problem is with DocumentBuilder.InsertHTML … any suggestions on how I might be able to solve this?

Here’s the pertinent code from my application:

if (e.FieldValue != null)
{
    DocumentBuilder builder = new DocumentBuilder(e.Document);
    builder.MoveToMergeField(e.DocumentFieldName);
    builder.InsertHtml(e.FieldValue.ToString());
}
else
{
    // The field isn’t mapped to a value in the current data table,
    // so leave the field in-tact so that it will still be available for other
    // data tables (ie, from a ExecuteMergeWithRegions
}

Here’s the body of the letter (in Thai):

ขอบคุณที่ติดต่อ พร็อกเตอร์ & แกมเบิ้ล เนื่องจากการติดต่อไปในครั้งที่แล้ว เรื่องขอรับผลิตภัณฑ์ Olay ที่ท่านได้ประสบปัญหา

และทางจากเรายังไม่ได้รับการตอบสนองจากท่าน

อย่างไรก็ตาม ถ้าเรายังไม่ได้รับการตอบสนองจากท่านภายในสามสิบ (30) วัน
ทางเราจะประเมินว่าท่านไม่ต้องการที่จะติดตามผล
และเรื่องของท่านจะถูกปิดลง

ถ้าท่านต้องการจะติดต่อเราโดยตรง กรุณาโทรฟรี ที่หมายเลข

Finally, I’ve attached the output document for your review.

Thanks for your thoughts.

Hi

Thanks for your inquiry. I think that you should specify LocaleID before inserting HTML snippet. Please see the following code:

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.Font.LocaleId = 1054; //Specify locale ID
builder.InsertHtml("ขอบคุณที่ติดต่อ พร็อกเตอร์ & แกมเบิ้ล เนื่องจากการติดต่อไปในครั้งที่แล้ว เรื่องขอรับผลิตภัณฑ์ Olay ที่ท่านได้ประสบปัญหา และทางจากเรายังไม่ได้รับการตอบสนองจากท่าน");
doc.Save("out.doc");

You can find list of LocaleIDs here.
<[MS-OE376]: Part 4 Section 7.6.2.39, LCID (Locale ID) | Microsoft Learn

Hope this helps.

Best regards.

Thank you, Alexey.

You were exactly right on this, and for Thai, it definitely did the trick.

Now I’m faced with a bigger issue: in my application, the HTML content might be in English, Chinese, Thai, Japanese, or possibly some other language.

I suppose I could make the user select which langauge the content is in, and thus associate a LocaleId, but it would be better if I didn’t have to do that.

Even though this isn’t an Aspose problem at this point, if you have any further suggestions on how your solution could be globalized, I’d be grateful.

Mike

Hello!

Thank you for your clarification. That’s really bothersome to set locale ID before every insertion or when the language changes. It would be better to perform detection in InsertHtml. For further investigation I have created a new issue in our defect database:

#5285 – Consider locale detection or locale-free insertion with InsertHtml

I cannot promise to fix it by any particular date. But at least we’ll address it. In general case inserting text can contain any mix of characters. That’s why it is difficult. What should defection algorithm do with that mix? It’s probably good to create several text runs with different locale IDs. There can be any other pitfalls so it’s better to have an option to switch detection off and do that manually.

As a workaround you can try the described idea but in custom code. It is known what character ranges correspond to English, Chinese, Thai and Japanese. You can analyze fragments and suggest proper locale for each of them or split into smaller parts in complex cases.

Regards,

For further clarification, it turns out I only need to set the LocaleId for Thai (at least so far). For Chinese and Japanese, the inserted HTML works with with default LocaleId (which I assume to be 1033 for English-US).

Thank you for this detail. We’ll look why this happens. Maybe behavior of default locale depends on the environment.

Now it appears I’m having difficulty merging Thai data using the MailMerge.Execute method, too:
doc.MailMerge.Execute(ds)

In the attached document, the little squares appear for FirstName and LastName (which comes from a very simple data table with those fields).

Is there a way to specife the LocaleId for when not using DocumentBuilder.InsertHTML? In other words, is it possible to set the LocaleId for the entire merge process?

Interestingly, some of the Thai characters in the FirstName and LastName appears, while other characters do not. Very puzzling!

Also, I’m already setting the CurrentCulture for the Thread, so that didn’t fix it.
System.Threading.Thread.CurrentThread.CurrentCulture;

Thanks!

I found the solution to the problem…

Changing LCID of WORD documents

Hello!

That’s nice if you found a solution. Basically it should help. But there could be situations with language mix so you’ll need several locales in one document. If you set them on styles you might need to have several styles that differ in locale only. Please let us know if you have further questions or difficulties with this.

Regards,

Hi Viktor,

Thanks for your thoughts on this. One question: would the fonts installed on the server make a difference?

I ask because I just noticed that Thai is not installed on our server, however the “Eastern Asian Languages” are installed. (We’re only experiencing this problem with Thai.)

Perhaps by installing Thai we can avoid having to set the LocaleId at all.

Michael

Hi again.

This could be the reason if you experience this issue only with Thai. I’m not absolutely certain. But installing a new language won’t break anything. You can try without any risk.

Regards,