Inserted Hebrew text comes out Backwards with Aspose.Words

ashmid · October 31, 2012, 10:45pm

We’ve recently puchased Aspose.Words for our Hebrew text project, but to our dismay we are finding that whenever we insert a string of multiple Hebrew words into a Word document, the order of the words is reversed (note that it is the order of the words that is reversed, rather than the order of the letters).

As an example, here is code to add a three-word Hebrew phrase into a comment in the document.

First I’ll show the Word interop code, which works correctly:

doc.Comments.Add(myrange, "הלכה למשה מסיני");

And here’s the Aspose code, which reverses the order:

fullcomment = "הלכה למשה מסיני";
Aspose.Words.Comment newcomment = new Aspose.Words.Comment(asposedoc);
newcomment.Paragraphs.Add(new Aspose.Words.Paragraph(asposedoc));
newcomment.FirstParagraph.Runs.Add(new Aspose.Words.Run(asposedoc, fullcomment));
myParagraph.AppendChild(newcomment);

Note that Aspose reverses the words even when the text is added as regular text into a regular paragraph:

Aspose.Words.Node newpp = new Aspose.Words.Run(asposedoc, "הלכה למשה מסיני");
nHebpp.AppendChild(newpp);

By the way, it should be noted that in all these cases, I am inserting Hebrew strings into Left-to-Right paragraphs (a fairly standard procedure). That is, the paragraphs are not right-to-left paragraphs, but rather the strings are right-to-left strings in the middle of left-to-right paragraphs.

This is a critical problem for our usage of Aspose.Words.

Sincerely,
Dr. Avi Shmidman
Israel

awais.hafeez · November 1, 2012, 1:05pm

Hi Avi,

Thanks for your inquiry. Please try specifying the Font.Bidi and ParagraphFormat.Bidi properties as follows:

DocumentBuilder builder = new DocumentBuilder();
// Signal to Microsoft Word that this run of text contains right-to-left text.
builder.Font.Bidi = true;
builder.CurrentParagraph.ParagraphFormat.Bidi = true;

// Specify the locale so Microsoft Word recognizes this text as Hebrew - Israel.
// For the list of locale identifiers please see http://www.microsoft.com/globaldev/reference/lcid-all.mspx
builder.Font.LocaleIdBi = 1037;

// Insert some Hebrew text.
builder.Writeln("הלכה למשה מסיני");

builder.Document.Save(@"C:\Temp\out.docx");

I hope, this helps.

Best Regards,

ashmid · November 1, 2012, 2:56pm

This solution is not satisfactory, because of the following:

1] I’m not actually hard coding the strings (the hard-coded string in the code was just to demonstrate the problem). The string comes from an external user database. The string that I receive could be Hebrew or English. It can also be a mixture of both. E.g., it can contain a series of English words, followed by a series of Hebrew, followed by a series of English, followed by a series of Hebrew, etc.

2] To be sure, I can write a routine to analyze each string and to break it up into separate runs at every point that the language changes; I can then set the .Font.bidi property of each run as needed. This is complicated, though, by the fact that punctuation can be either bidi or not, depending on where it occurs (punctuation between two Hebrew words is bidi; after Hebrew but before English: not bidi; between the last Hebrew character and the end of the string: not bidi; etc.) And, for now, I have indeed implemented this algorithm as a workaround.

However compare this to Word’s interop module, where none of this is necessary whatsoever. If one passes an all-Hebrew to Word, then Word properly encodes it as bidi. If one passes a string containing mixes Hebrew and English to Word, then Word properly deals with each set of Hebrew or English words as needed, and it also deals properly with the punctuation in-between, given the context of each punctuation character, based on the Unicode Bidi algorithm for processing characters of weak directionality.

Essentially, Aspose is putting a burden upon each user of implementing the Unicode Bidi algorithm him-or-herself, and of using it to break up any given string into multiple runs. This is both wasteful and error-prone (because it will need to be done separately by every developer who uses Aspose for bidi text). It is a major disadvantage of Aspose as compared with Interop, since Interop already does this fully and properly itself.

I do understand that the difference between Aspose and Interop on this matter likely stems from the underlying architecture. With Interop, MS Word works behind the scene to breaks up any given string into multiple runs. However, with Aspose, the user is specifying the actual detail of the encoding, down to the level of the run. So, if the user specifies a run that contains characters both RTL and LTR, then there is an inherent contradiction, because such a run cannot exist as a single unit.

Nevertheless, for such situations, I believe it behooves Aspose to provide a helper function which fully implements the unicode Bidi algorithm (including attention to the special bidi unicode specifiers RTL Mark, LTR Mark, RTL override, etc.), and which will take any run and break it up into a series of runs as required for proper Bidi operation.

awais.hafeez · November 3, 2012, 3:19am

Hi Avi,

Thanks for the additional information. I have logged a new feature request in our bug tracking system. The issue ID is WORDSNET-7209. Our development team will further look into the details of this problem and we will keep you updated on the status of the availability of this routine. Your request has also been linked to this issue and you will be notified as soon as it is resolved. Sorry for the inconvenience.

Best Regards,

bbrother · November 15, 2012, 4:32pm

I agree with Ashmid and hereby register a vote for this fix.

awais.hafeez · November 19, 2012, 12:44am

Hi Bill,

Thanks for your request. After an initial analysis, this does look like a shortcoming of DocumentBuilder which probably has to automatically modify Bidi context when receiving RTL text and break into several runs properly attributed when receiving mixed text. However, in this case no extra public unicode bidi routine is required, this functionality will be transparently handled inside the model. We will inform you as soon as this feature is implemented.

Best Regards,

bbrother · November 18, 2015, 1:06pm

Hello Awais,

Has there been any moment on this?

Thanks

awais.hafeez · November 20, 2015, 6:07am

Hi Bill,

Thanks for your inquiry.

This problem (WORDSNET-7209) actually requires us to implement a new feature in Aspose.Words and we regret to share with you that implementation of this issue has been postponed for now. However, the fix of this problem may definitely come onto the product roadmap in the future. Unfortunately, we can not currently promise a resolution date. We apologize for your inconvenience.

Best regards,

awais.hafeez · April 2, 2019, 1:37pm

A post was split to a new topic: Implement Unicode Bidi algorithm in Aspose.Words