Inconsistency of number order between PDF, DOC and TXT with RTL switching


#1

Hello,

Our customers edit Word documents which we later save as PDF and plain text. They need to type number ranges and do so in a specific representation (Example: (0.5 - 2)). Most of their text is right-to-left and in Hebrew. To enter the number range they type the parentheses in Hebrew. Sometimes they type the contents of the parentheses in one of two methods:

  1. Stay in Hebrew, starting the typing with the right-hand side of the range
  2. Switch to English, starting the typing with the left-hand side of the range
    When displayed as a Word Doc, both ranges appear the numbers in the same order. This is also the case when the document is saved as a PDF. However, this ordering is not maintained when saving the document as plain text.

I created a small test project to reproduce the problem. It also includes the test DOC file. Just click “Select Doc” and “OK” (it should default to the path of the file). The plain text will appear on the form but it will also be saved as a PDF and plain text in the folder of the executable.
TestAsposeRTLNumberRange.zip (3.7 MB)

I also tried using a few encodings (UTF8, Unicode, Windows-1255, ISO-8859-8 and ISO-8859-8-i) thinking it might be a visual/logical ordering of the text issue, but there was no difference in the outcome.

Our customers have a version of our product that’s distributed with Aspose Words 13.2. We currently have a license we use with the latest version of our product that comes with Aspose Words 13.9. I also tested this with the latest Aspose Words 18.1. The issue appeared on all versions.

Thanks,
Or


#2

@ors

Thanks for your inquiry. Please note that Aspose.Words mimics the behavior of MS Word.

We have converted the shared document to PDF and TXT using the latest version of Aspose.Words for .NET 18.11 and have not found the issue with output PDF. Please use Aspose.Words for .NET 18.11. We have attached the output PDF with this post for your reference. 18.11.pdf (33.6 KB)

Regarding the TXT output, could you please share your expected output here for our reference? We will then provide you more information about your query.


#3

Hello Tahir,

As I mentioned, I tested with a few versions of Aspose.Words including 18.11 (I wrote 18.1 by mistake). There is no problem with the PDF. What I meant was that the text appears in a certain way in MS Word, which we consider okay:
MS Word.png (2.0 KB)
Notice the order of the numbers. In the document I also mention exactly the order I typed the characters in.

Using Aspose Words (versions 13.2, 13.9 or 18.11), the resulting PDF maintains this order:
PDF.png (7.2 KB)

However, the plain text I generate using Aspose Words, as shown in the demo application I sent, does not keep that order:
Plain Text.png (6.7 KB)

This is inconsistent with the way it is displayed in MS Word and in the PDF. What we want is for the order of characters to appear the same after the conversion to plain text, as in the MS Word document and as in the PDF generated from it. The conversion to plain text results in a display that is inconsistent with those of the PDF and the DOC.

We noticed the same issue with other cases. I have another example (39.1 KB)
attached (with an “X” to show multiplication).

I also want to mention that I don’t think this is necessarily a Hebrew issue but more of a right-to-left language issue, so I guess it’s the same for other RTL languages.

Thanks,
Or


#4

@ors

Thanks for sharing the detail. Please note that Aspose.Words mimics the behavior of MS Word. If you convert your document to TXT file format using MS Word, you will get the same output.


#5

Though MS Word might have the same output, it is still a bug. The actual result is not the expected result, especially when it is displayed as expected in MS Word and in the resulting PDF. Is there any solution or workaround to this issue?


#6

@ors

Thanks for your inquiry. Please note that the document formats DOCX, PDF and TXT are quite different. You cannot insert tables, images, etc. in TXT file. So, the output may not be same in these formats. Aspose.Words and MS Word save the document to TXT file format according to the text encoding and text information e.g. language setting. So, you are getting the expected output. Please check the attached screenshot.


#7

Hi,

I understand that the formats are different but there are inherent problem that prevent the use of images in plain text files. However, this is no the case for direction and order of characters which can be maintained. This is a wrong formatting of the text, if the order of characters is not maintained. However, it can be done in MS Word from that very same File Conversion window. If you set it to “Add bi-directional marks” it works well:
Add Bi-Directional Marks.png (36.4 KB)

Also, I’m not sure why, but for the multiplication “X” case I mentioned we also need the “Insert line breaks” optional:
Insert line breaks.png (26.6 KB)

How do we get Aspose Words to do that?

Or


#8

The TxtSaveOptions.AddBidiMarks property serves this purpose. The default value of this property is true.

Could you please save your document to TXT file format using MS Word and share it here for our reference? Please share the MS Word version that you are using. Thanks for your cooperation.

Unfortunately, Aspose.Words does not support the requested feature at the moment. However, we already logged this feature request as WORDSNET-14135 in our issue tracking system. You will be notified via this forum thread once this feature is available.

We apologize for your inconvenience.


#9

Unfortunately, TxtSaveOptions.AddBidiMarks was added only in Aspose.Words 18.7 and we’re using 13.9. However, I tested this with the evaluation version of Aspose.Words 18.11 and it doesn’t work either.

I am using Microsoft Office Standard 2013.

Attached are the relevant files: HebrewDigitsOrder.zip (7.3 KB)

Thanks,
Or


#10

@ors

Thanks for sharing the detail. In your case, MS Word also does not generate the correct output. Please check the attach image of “HebrewDigitsOrder - Saved in Word 2013 with Add Bi-Di Marks.txt”.


#11

On my computer it appears as expected: Conversion with MS Word result screenshot.png (73.4 KB)

The result from Aspose.Words appears like this: Conversion with Aspose.Words result screenshot.png (41.4 KB)

Could it be a difference in system locale causing a different presentation on your computer and on mine? I tried a bunch of different fonts (Tahoma, Times New Roman, David, Arial, Segoe UI,…) on Notepad and couldn’t reproduce the way your computer displays it. The closest I’ve come was use the Terminal font with the OEM/DOS script:
MS Word conversion with Terminal font screenshot.png (28.5 KB)


#12

@ors

We have logged this problem in our issue tracking system as WORDSNET-17853 . You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.


#13

Here are some of my settings:

Let me know if you need me to provide additional information.


#14

@ors

Thanks for sharing the detail. We have logged it in our issue tracking system.


#15

@ors

Thanks for your patience. It is to inform you that the issue which you are facing is actually not a bug in Aspose.Words. Please use the following code example to get the desired output.

Document doc = new Document(MyDir + "HebrewDigitsOrder - Original.doc");

// Use these additional lines to write TXT in Hebrew encoding.
TxtSaveOptions so = new TxtSaveOptions();
so.Encoding = Encoding.GetEncoding(1255);

doc.Save(MyDir + "18.12.txt", so);

#16

Actually, I already tried that with 13.2, 13.9 and 18.11 but it didn’t work. I tried again now, just to be sure, using the evaluation of 18.11 both with and without setting AddBidiMarks = true (as you said it’s true by default) and I tried it with a bunch of different encodings (Windows-1255, Hebrew ISO-Visual: 28598, Hebrew ISO-Logical: 38598, UTF7, UTF8, UTF32, Unicode, UnicodeBigEndian) and it still didn’t work.

Even if it did work by setting encoding 1255, it would not be exactly as in MS Word since TextSaveOptions is using codepage 65001 (UTF8) by default.

Let me know if you need me to attach anything.


#17

@ors

Thanks for your your inquiry. We have tested the scenario using the latest version of Aspose.Words for .NET 18.12 and have not found the shared issue. Please check the attached image and output document. 18.12.zip (454 Bytes)


#18

Hello,

In the image you shared there are extra marks displayed near the numbers. I guess those might be the bi-directional marks. However, those are not present on my computer when I open the output from MS Word. Also, I opened the text file you attached and on my computer although it doesn’t display those marks as in the image you attached, the ordering is still incorrect:
image.png (35.3 KB)


#19

@ors

Thanks for your your inquiry. Please use the latest version of Aspose.Words for .NET 18.12 and generate the TXT file by using following code example at your end. Please share it here for further investigation. Thanks for your cooperation.

Document doc = new Document(MyDir + "HebrewDigitsOrder - Original.doc");

// Use these additional lines to write TXT in Hebrew encoding.
TxtSaveOptions so = new TxtSaveOptions();
so.Encoding = Encoding.GetEncoding(1255);

doc.Save(MyDir + "18.12.txt", so);

#20

Here’s the result:
18.12.zip (576 Bytes)