Mixed Arabic and english text extraction

karrimrabi · February 15, 2016, 5:56am

Hello

We are using Aspose.Pdf 11.3 and are attempting to extract text from pdf files that sometimes have a mixture of languages in the same sentence, in our test example we are using a line of text that has both English and Arabic. The problem is the Arabic is being mirrored in the extraction and is getting altered. Is there any settings in the TextAbsorber to give us the raw text without any processing.

Here is our example code and I have attached a sample pdf to this post.

// open document
var pdfDoc = new Aspose.Pdf.Document(@“D:\temp\Arabic\mixedarabictest.pdf”);

// create TextAbsorber object to extract text
var absorber = new Aspose.Pdf.Text.TextAbsorber();
absorber.TextSearchOptions.LimitToPageBounds = true;

// accept the absorber for first page
pdfDoc.Pages[1].Accept(absorber);

// get the extracted text
string extractedText = absorber.Text;

Thank you in advance

Karrim

tilal.ahmad · February 16, 2016, 1:12am

Hi Karrim,

Thanks for your inquriy. I have tested the scenario using Aspose.Pdf for .NET 11.3.0 and unable to notice the mirrored or altered Arabic text. Please use latest version of Aspose.Pdf for .NET, it will resolve the issue. However if the issue persist then please share some more details about the issue i.e. output, so we will guide you accordingly.

We are sorry for the inconvenience caused.

Best Regards,

karrimrabi · February 16, 2016, 3:55am

Thank your trying but using the latest Aspose.pdf 11.3 the text in the pdf is:

قيمة English word الخاصة الفوائد

Extracted text:

الخاصة الفوائد English word قيمة

Here you can see the Arabic words that were on the right of the English words are now on the left and vice versa.

original text inside the PDF UTF8 encoded bytes
D9-82-D9-8A-D9-85-D8-A9-20-45-6E-67-6C-69-73-68-20-77-6F-72-64-20-D8-A7-D9-84-D8-AE-D8-A7-D8-B5-D8-A9-20-D8-A7-D9-84-D9-81-D9-88-D8-A7-D8-A6-D8-AF

extracted text UTF8 encoded bytes
D8-A7-D9-84-D8-AE-D8-A7-D8-B5-D8-A9-20-D8-A7-D9-84-D9-81-D9-88-D8-A7-D8-A6-D8-AF-20-45-6E-67-6C-69-73-68-20-77-6F-72-64-20-D9-82 D9-8A-D9-85-D8-A9

here you can clearly see the word “D9-82 D9-8A-D9-85-D8-A9” start at the beginning of the text inside the PDF but is on the end of the extracted bytes.

tilal.ahmad · February 17, 2016, 2:13am

Hi Karrim,

Thanks for sharing the additional information. I have noticed the reported issue and logged a ticket PDFNEWNET-40273 in our issue tracking system for further investigation and rectification. We will keep you updated about the issue resolution progress.

We are sorry for the inconvenience caused.

Best Regards,

aspose.notifier · February 7, 2019, 6:00pm

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan