How can I put the RTL text string in the right order after they are replaced from LTR text string?
You’ll find “こんにちは、世界。” in helloworld_ja.pdf (51.3 KB)
I expect the result in helloworld_jaar.pdf (54.1 KB) should be like “مرحبا بالعالم”
But it actually is like “م ر ح ب ا ب ا ل ع ا ل م”
Please kindly advise me how to get the result I prefer; RTL text reads from right to left. Is there any configuration(s)?
I tested several tip found in this forum but none of them seem to work.
Please check the following code to reproduce the issue.
.Net core 2.1/Aspose.PDF 19.6.0
Document doc = new Document("helloworld_ja.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+");
textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;
doc.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
prevSegment.Text = "مرحبا بالعالم";
prevSegment.TextEditOptions = new TextEditOptions(TextEditOptions.LanguageTransformation.ExactlyAsISee);
//prevSegment.TextEditOptions.AllowLanguageTransformation = false;
//prevSegment.TextEditOptions.LanguageTransformationBehavior = TextEditOptions.LanguageTransformation.ExactlyAsISee;
}// Loop for TextSegment
}// Loop for TextFragment
doc.Save("helloworld_jaar.pdf");
We have tried to execute your code snippet and found the definition of prevSegment was missing in it. Would you please share the origin of this object or complete executable code snippet so that we can test the scenario accordingly.
We have tested the scenario in our environment while using Aspose.PDF for .NET 19.8 and received a different output PDF document than which you have shared with us. The result in the attached PDF file seems better as you can check it. helloworld_jaar.pdf (77.6 KB)
We request you to please try latest version of the API and in case you still face any issue, please feel free to let us know.
Question: Why does hello_1.pdf (Arabic string for “hello” ) get the reversed string in contrast to
hello_2.pdf ( Arabic string for “hello.” ) and hello_3.pdf ( Arabic string for “hello world” ) ?
Is there any way to prevent it from being reversed ?
string[,] testset = new string[3, 2] {
{ "مرحبا", "hello_1.pdf" },
{ "مرحبا.", "hello_2.pdf" },
{ "مرحبا بالعالم", "hello_3.pdf" }
};
for (int i = 0; i < testset.GetLength(0); i++)
{
Document doc = new Document("hello.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+");
textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;
doc.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
textSegment.Text = testset[i, 0];
textSegment.TextEditOptions = new TextEditOptions(TextEditOptions.LanguageTransformation.ExactlyAsISee);
break;
}
}
doc.Save(testset[i,1]);
doc.Dispose();
}
We were able to reproduce the issue that you have mentioned and logged it as PDFNET-46912 in our issue tracking system for further investigation. We will check this scenario details and keep you posted with the status of ticket resolution. Please be patient and spare us little time.
In case you are facing similar issue with these languages, you may please share some code snippet or sample files with us. We will also log tickets related to specific language characters so that every aspect of RTL languages issues can be addressed and investigated.
it seems I have the same problem than the one mentioned in this post.
I can’t access PDFNET-46912, so I don’t know what the last status is and if the problem will be solved in a future version. @asad.ali can you please
update this forum post with the current status?
Attached is a sample Word file with arabic text and the result of HTML conversion: doc_arabic.docx (12.9 KB) doc_arabic.html.pdf (356.3 KB)
The HTML is full of DIV elements with dir=ltr.
The extracted text is in right to left. So when searching (or highlighting) it misses the text in HTML.
For eg in Word you can find “صخري”, but not in HTML. In HTML it is “ﺻﺨﺮﻱ”.
It may look the same, but the direction is not. Put it in a textual editor and you’ll see.
Also open the HTML not in a browser, because it is smart enough to handle the direction.
You can check the issue status at the bottom of this forum thread. image.png (3.3 KB)
As you stated and shared Word and HTML files. Can you please also share the sample code snippet that you are using for conversion? Regarding the logged ticket, we are afraid that it has not been yet resolved. We will post an update in this forum thread as soon as it is fixed.
We port every change and fix in the equivalent version of Aspose.PDF for Java once it gets released. You can try using the latest version and let us know in case you still notice any issues.