Replacement of LTR Text to RTL Text Results RTL-Text-In-LTR-Order

KDSDEV · August 26, 2019, 6:40am

How can I put the RTL text string in the right order after they are replaced from LTR text string?

You’ll find “こんにちは、世界。” in helloworld_ja.pdf (51.3 KB)
I expect the result in helloworld_jaar.pdf (54.1 KB) should be like “مرحبا بالعالم”
But it actually is like “م ر ح ب ا ب ا ل ع ا ل م”

Please kindly advise me how to get the result I prefer; RTL text reads from right to left. Is there any configuration(s)?
I tested several tip found in this forum but none of them seem to work.

Please check the following code to reproduce the issue.

.Net core 2.1/Aspose.PDF 19.6.0

Document doc = new Document("helloworld_ja.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+");
textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;
doc.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
    foreach (TextSegment textSegment in textFragment.Segments)
    {
        prevSegment.Text = "مرحبا بالعالم";
        prevSegment.TextEditOptions = new TextEditOptions(TextEditOptions.LanguageTransformation.ExactlyAsISee);
        //prevSegment.TextEditOptions.AllowLanguageTransformation = false;
        //prevSegment.TextEditOptions.LanguageTransformationBehavior = TextEditOptions.LanguageTransformation.ExactlyAsISee;
    }// Loop for TextSegment
    
}// Loop for TextFragment
doc.Save("helloworld_jaar.pdf");

asad.ali · August 26, 2019, 7:15pm

@KDSSHO

Thanks for contacting support.

We have tried to execute your code snippet and found the definition of prevSegment was missing in it. Would you please share the origin of this object or complete executable code snippet so that we can test the scenario accordingly.

KDSDEV · August 27, 2019, 4:16am

Sorry here you are.

Document doc = new Document("helloworld_ja.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+");
textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;
doc.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
foreach (TextFragment textFragment in textFragmentCollection)
{
    foreach (TextSegment textSegment in textFragment.Segments)
    {
        textSegment .Text = "مرحبا بالعالم";
        textSegment .TextEditOptions = new TextEditOptions(TextEditOptions.LanguageTransformation.ExactlyAsISee);
        //textSegment .TextEditOptions.AllowLanguageTransformation = false;
        //textSegment .TextEditOptions.LanguageTransformationBehavior = TextEditOptions.LanguageTransformation.ExactlyAsISee;
    }// Loop for TextSegment
    
}// Loop for TextFragment
doc.Save("helloworld_jaar.pdf");

asad.ali · August 27, 2019, 3:54pm

@KDSSHO

Thanks for sharing complete code snippet.

We have tested the scenario in our environment while using Aspose.PDF for .NET 19.8 and received a different output PDF document than which you have shared with us. The result in the attached PDF file seems better as you can check it.
helloworld_jaar.pdf (77.6 KB)

We request you to please try latest version of the API and in case you still face any issue, please feel free to let us know.

KDSDEV · August 28, 2019, 6:37am

Thank you for your confirmation.
Files I sent did not exactly describe the point I wanted to ask. My apologies.

Please find the attached 1 input file and 3 output files.
hello.pdf (48.8 KB)
hello_1.pdf (75.0 KB)
hello_2.pdf (74.3 KB)
hello_3.pdf (75.0 KB)

Question: Why does hello_1.pdf (Arabic string for “hello” ) get the reversed string in contrast to
hello_2.pdf ( Arabic string for “hello.” ) and hello_3.pdf ( Arabic string for “hello world” ) ?
Is there any way to prevent it from being reversed ?

string[,] testset = new string[3, 2] {
	{ "مرحبا", "hello_1.pdf" },
	{ "مرحبا.", "hello_2.pdf" },
	{ "مرحبا بالعالم", "hello_3.pdf" }
};

for (int i = 0; i < testset.GetLength(0); i++)
{
	Document doc = new Document("hello.pdf");
	TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(".+");
	textFragmentAbsorber.TextSearchOptions = new TextSearchOptions(true);
	textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction = TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;
	doc.Pages.Accept(textFragmentAbsorber);
	TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

	foreach (TextFragment textFragment in textFragmentCollection)
	{
		foreach (TextSegment textSegment in textFragment.Segments)
		{
			textSegment.Text = testset[i, 0];
			textSegment.TextEditOptions = new TextEditOptions(TextEditOptions.LanguageTransformation.ExactlyAsISee);
			break;
		}
	}

	doc.Save(testset[i,1]);
	doc.Dispose();
}

asad.ali · August 28, 2019, 5:55pm

@KDSSHO

We were able to reproduce the issue that you have mentioned and logged it as PDFNET-46912 in our issue tracking system for further investigation. We will check this scenario details and keep you posted with the status of ticket resolution. Please be patient and spare us little time.

We are sorry for the inconvenience.

KDSDEV · August 29, 2019, 6:44am

Can I expect some other RTLs such as Persian/Farsi, Hebrew, and Urdu, not only Arabic, would be covered when the ticket PDFNET-46912 is resolved?

asad.ali · August 29, 2019, 10:40am

@KDSSHO

In case you are facing similar issue with these languages, you may please share some code snippet or sample files with us. We will also log tickets related to specific language characters so that every aspect of RTL languages issues can be addressed and investigated.

jmau2002 · December 2, 2022, 7:54am

Hi,

it seems I have the same problem than the one mentioned in this post.

I can’t access PDFNET-46912, so I don’t know what the last status is and if the problem will be solved in a future version.
@asad.ali can you please
update this forum post with the current status?

Attached is a sample Word file with arabic text and the result of HTML conversion:
doc_arabic.docx (12.9 KB)
doc_arabic.html.pdf (356.3 KB)

The HTML is full of DIV elements with dir=ltr.
The extracted text is in right to left. So when searching (or highlighting) it misses the text in HTML.

For eg in Word you can find “صخري”, but not in HTML. In HTML it is “ﺻﺨﺮﻱ”.
It may look the same, but the direction is not. Put it in a textual editor and you’ll see.
Also open the HTML not in a browser, because it is smart enough to handle the direction.

asad.ali · December 2, 2022, 6:02pm

@jmau2002

You can check the issue status at the bottom of this forum thread. image.png (3.3 KB)

As you stated and shared Word and HTML files. Can you please also share the sample code snippet that you are using for conversion? Regarding the logged ticket, we are afraid that it has not been yet resolved. We will post an update in this forum thread as soon as it is fixed.

aspose.notifier · December 19, 2022, 9:36pm

The issues you have found earlier (filed as PDFNET-46912) have been fixed in Aspose.PDF for .NET 22.12.

jmau2002 · June 8, 2023, 7:07pm

will it be fixed also in other versions, like Aspose.PDF for Java?

asad.ali · June 8, 2023, 11:59pm

@jmau2002

We port every change and fix in the equivalent version of Aspose.PDF for Java once it gets released. You can try using the latest version and let us know in case you still notice any issues.