Hebrew text conversion in pdf->docx behaves strange

memoq · September 10, 2017, 4:44pm

Dear Aspose team,

We have a single page pdf in Hebrew that we convert to docx. In the resulting docx file the order of words is not as it should be. The reason is that the words get the ltr flag in the converted docx file, but the separating spaces between them don’t. This makes their appearance quite erratic, especially if periods are also involved.

I uploaded a pack of samples files here: pack-heb.zip (219.0 KB)

The file page1.pdf is the pdf that we convert. The page1.docx is the file that comes out after the conversion. The page1-unitedmanuel.docx is the one that we think should come out.

I only focused on the sentences on the bottom of the pdf, with the text: “שירות לקוחות. מידע טכני”.

Please let us know if we can add anything more to help you.

Best regards,

Gergely Vándor
0030450

imran.rafique · September 11, 2017, 3:41am

@gergelyv,

We have compared both of your Word documents and could not find the difference. Kindly review and highlight the problematic area with the help of a screenshot. We will investigate and share our findings with you.

memoq · September 23, 2017, 3:06pm

@imran.rafique

Thanks for looking into this and sorry for not being more specific. I am attaching an image where I put all three documents on one screen, created a screenshot and marked the interesting part with red frame. markedlocations.png (145.9 KB)

The only visible difference on the images is that the ‘.’ is not in the right location. This is because in the docx file created by aspose (page1.docx) every word is a separate run with ltr turned on, but the spearating spaces have a separate run with no ltr flag (this is visible if you look into the document.xml inside the file).

The other file (page1-unitedmanuel.docx) is a file that I created by manually modifying the the page1.docx. In that file the full text with spaces and words is in a single text run with ltr turned on. This allows word to properly calculate the location of the neutral characters in the ltr run.

Can you proceed with this additional information?

imran.rafique · September 23, 2017, 4:57pm

@gergelyv,
We have tested your source PDF with the latest version 17.8 of Aspose.Pdf for Java API and managed to replicate the said problem in our environment. It has been logged under the ticket ID PDFJAVA-37106 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates. However, it is not reproducible with the latest version 17.9 of Aspose.Pdf for .NET API.

codewarior · October 4, 2017, 5:03pm

@gergelyv,

Thanks for your patience.

We are pleased to share that the issue PDFJAVA-37106 reported earlier is resolved in latest release of Aspose.Pdf for Java 17.9.

Please try using the latest release version and in case you face any issue, please feel free to contact.

memoq · October 14, 2017, 11:21am

@codewarior Thanks, we will try it.

Just to make sure, is the fix available in the .Net version as well? Because we are not using Java.

Best regards,
Gergely Vándor

codewarior · October 15, 2017, 11:44am

@gergelyv,

Thanks for contacting support.

I have again tested the scenario with Aspose.Pdf for .NET 17.10 and have managed to reproduce the same issue. For the sake of correction, I have logged it as PDFNET-43509 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. We are sorry for this inconvenience.

memoq · October 16, 2017, 10:47am

@codewarior
No problem. Thanks for looking into it. At least this way both the Java and the .Net version is fixed

codewarior · October 17, 2017, 2:54pm

@gergelyv,

As the development team manage to figure out the reasons behind this problem in Java version, so we hope they will be able to fix it soon. As soon as we have some definite updates, we will let you know.