PDF conversion to DOCX using Aspose.PDF - Arabic text does not work

sandor.kolumban · December 10, 2020, 1:47pm

When we are trying to convert a PDF to DOCX using Aspose, a text in Arabic gets misrepresented, more preciselythe order of the words in the converted version is incorrect in several places. Here is the original PDF Arabic_Sample.pdf (114.5 KB)
and the docx file converted from the mentioned pdf, with the latest version of Aspose Arabic_SampleDocx.zip (50.2 KB)
.
Our team has mead a small research related this topic, and I put here a summary of our findings.

Looking at one specific row see the highlighted one from the original attached document (Arabic_Sample.pdf)

2Pdf.png (8.5 KB)

This is how the row looks like after an Aspose conversion (pdf to docx):

3AsposeSave.png (7.8 KB)

I made 2 docx documents from the highlighted row (one LTR and the other RTL) and wrote some numbers in the row to see more clearly.

image-20201118-081330.png (52.2 KB)

There are small differences when comparing this documents with KDiff. One of them is interesting. In the RTL formatted document there is with an empty tag more, than in the RTL formatted document (in the document.xml). See:

784b4564-1a56-434e-85e8-9725688045b1.png (60.2 KB)

By adding this empty tag to the LTR documents document.xml we get a RTL formatted document. See:image-20201118-082135.png (25.1 KB)

Another interesting finding is how Aspose handles the LTR/RTL formatting. If we look at the words order in the test document (LTR and RTL) both have the same order/there is no difference. See:
image-20201118-082943.png (49.1 KB)

I saved this 2 files as pdf with Microsoft Word and made an Aspose conversion. How I saw both saved formats are LTR and the text order is changed according to words order in the pdf (the text also gets ripped on more words/parts). See:

image-20201118-084240.png (58.1 KB)
image-20201118-084554.png (53.5 KB)
image-20201118-084912.png (36.2 KB)

Could you please check this problem, and can you provide us with some detail about what is the background of the issue. Just so we can tell something to our customers.

Thank you,
Kolumbán Sándor

Note for me:
BUG-4193

asad.ali · December 10, 2020, 8:30pm

@sandor.kolumban

Could you please also share in which platform are you using the API e.g. .NET/Java? We need to further investigate the details against this whole scenario. We will log an investigation ticket for it and share with you.

sandor.kolumban · December 10, 2020, 8:45pm

@asad.ali I was trying it with the latest .Net API. Thanks for looking into it.

asad.ali · December 11, 2020, 10:49pm

@sandor.kolumban

We have logged an investigation ticket as PDFNET-49171 in our issue tracking system. We will further investigate your scenario in details and keep you posted with the status of ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

memoq · January 10, 2023, 10:06am

Hi, any news regarding this?

asad.ali · January 10, 2023, 6:21pm

@memoq

The issue is currently under the phase of the investigation and as soon as its investigation is complete, we will share news about its resolution or fix ETA with you. Please spare us little time.

We are sorry for the inconvenience.