We are using Aspose.Pdf 11.3 and are attempting to extract text from pdf files that sometimes have a mixture of languages in the same sentence, in our test example we are using a line of text that has both English and Arabic. The problem is the Arabic is being mirrored in the extraction and is getting altered. Is there any settings in the TextAbsorber to give us the raw text without any processing.
Here is our example code and I have attached a sample pdf to this post.
// open document
var pdfDoc = new Aspose.Pdf.Document(@“D:\temp\Arabic\mixedarabictest.pdf”);
// create TextAbsorber object to extract text
var absorber = new Aspose.Pdf.Text.TextAbsorber();
absorber.TextSearchOptions.LimitToPageBounds = true;
// accept the absorber for first page
// get the extracted text
string extractedText = absorber.Text;
Thank you in advance
Thank your trying but using the latest Aspose.pdf 11.3 the text in the pdf is:
قيمة English word الخاصة الفوائد
الخاصة الفوائد English word قيمة
Here you can see the Arabic words that were on the right of the English words are now on the left and vice versa.
original text inside the PDF UTF8 encoded bytes
extracted text UTF8 encoded bytes
here you can clearly see the word “D9-82 D9-8A-D9-85-D8-A9” start at the beginning of the text inside the PDF but is on the end of the extracted bytes.