We have a single page pdf in Hebrew that we convert to docx. In the resulting docx file the order of words is not as it should be. The reason is that the words get the ltr flag in the converted docx file, but the separating spaces between them don’t. This makes their appearance quite erratic, especially if periods are also involved.
I uploaded a pack of samples files here: pack-heb.zip (219.0 KB)
The file page1.pdf is the pdf that we convert. The page1.docx is the file that comes out after the conversion. The page1-unitedmanuel.docx is the one that we think should come out.
I only focused on the sentences on the bottom of the pdf, with the text: “שירות לקוחות. מידע טכני”.
Please let us know if we can add anything more to help you.
We have compared both of your Word documents and could not find the difference. Kindly review and highlight the problematic area with the help of a screenshot. We will investigate and share our findings with you.
Thanks for looking into this and sorry for not being more specific. I am attaching an image where I put all three documents on one screen, created a screenshot and marked the interesting part with red frame. markedlocations.png (145.9 KB)
The only visible difference on the images is that the ‘.’ is not in the right location. This is because in the docx file created by aspose (page1.docx) every word is a separate run with ltr turned on, but the spearating spaces have a separate run with no ltr flag (this is visible if you look into the document.xml inside the file).
The other file (page1-unitedmanuel.docx) is a file that I created by manually modifying the the page1.docx. In that file the full text with spaces and words is in a single text run with ltr turned on. This allows word to properly calculate the location of the neutral characters in the ltr run.
We have tested your source PDF with the latest version 17.8 of Aspose.Pdf for Java API and managed to replicate the said problem in our environment. It has been logged under the ticket ID PDFJAVA-37106 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates. However, it is not reproducible with the latest version 17.9 of Aspose.Pdf for .NET API.
I have again tested the scenario with Aspose.Pdf for .NET 17.10 and have managed to reproduce the same issue. For the sake of correction, I have logged it as PDFNET-43509 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. We are sorry for this inconvenience.
As the development team manage to figure out the reasons behind this problem in Java version, so we hope they will be able to fix it soon. As soon as we have some definite updates, we will let you know.