When we are trying to convert a PDF to DOCX using Aspose, a text in Arabic gets misrepresented, more preciselythe order of the words in the converted version is incorrect in several places. Here is the original PDF Arabic_Sample.pdf (114.5 KB)
and the docx file converted from the mentioned pdf, with the latest version of Aspose Arabic_SampleDocx.zip (50.2 KB)
Our team has mead a small research related this topic, and I put here a summary of our findings.
Looking at one specific row see the highlighted one from the original attached document (Arabic_Sample.pdf)
2Pdf.png (8.5 KB)
This is how the row looks like after an Aspose conversion (pdf to docx):
3AsposeSave.png (7.8 KB)
I made 2 docx documents from the highlighted row (one LTR and the other RTL) and wrote some numbers in the row to see more clearly.
image-20201118-081330.png (52.2 KB)
There are small differences when comparing this documents with KDiff. One of them is interesting. In the RTL formatted document there is with an empty tag more, than in the RTL formatted document (in the document.xml). See:
784b4564-1a56-434e-85e8-9725688045b1.png (60.2 KB)
By adding this empty tag to the LTR documents document.xml we get a RTL formatted document. See:image-20201118-082135.png (25.1 KB)
Another interesting finding is how Aspose handles the LTR/RTL formatting. If we look at the words order in the test document (LTR and RTL) both have the same order/there is no difference. See:
image-20201118-082943.png (49.1 KB)
I saved this 2 files as pdf with Microsoft Word and made an Aspose conversion. How I saw both saved formats are LTR and the text order is changed according to words order in the pdf (the text also gets ripped on more words/parts). See:
Could you please check this problem, and can you provide us with some detail about what is the background of the issue. Just so we can tell something to our customers.
Note for me: