Can I extract Hebrew correctly using ExtractText?

thefor · June 18, 2011, 6:47pm

Hi,

I’m using the demo version to test the different features. However, I need the extraction is Hebrew also and the text is reversed…

Can you point me to a solution (I saw IsBidi, but it’s readonly)?

Thanks,

Uri.

shahzadlatif · June 20, 2011, 3:45am

Hi,

Thank you very much for considering Aspose.

Please share the input PDF file with us, so we could investigate the issue at our end. You’ll be updated with the results accordingly. Also, please confirm that you’re using the .NET version for evaluation purposes.

We’re sorry for the inconvenience and looking forward to help you out.
Regards,

urisim · June 20, 2011, 1:51pm

Hi,

We are using the .NET version for evaluation (currently the PDF library and after that we’ll check the WORD library as well).

The code snippet is :

//Instantiate PdfExtractor object

PdfExtractor extractor = new PdfExtractor();

//Set Password for input PDF file

extractor.Password = “”;

extractor.ExtractTextMode = 1;

//Bind the input PDF document to extractor

extractor.BindPdf(inputFile1);

//Extract text from the input PDF document

extractor.ExtractText();

//Save the extracted text to a text file

extractor.GetText(inputFile1 + “.txt”);

A sample file is attached.

I have two issues with the result:

1. The Hebrew is inverted (see original PDF and result TXT).

2. The numbers (years in the CV) are wrong after conversion.

Looking forward for your reply.

Thanks,

Uri.

shahzadlatif · June 21, 2011, 12:03pm

Hi Uri,

Thank you very much for sharing the sample PDF and the code snippet. We’ll investigate this issue at our end and you’ll be updated shortly.

We’re sorry for the inconvenience.
Regards,

shahzadlatif · June 22, 2011, 8:38am

Hi Uri,

I have also noticed this problem at my end and logged it as PDFKITNET-28621 in our issue tracking system. Our team will look into this issue and you’ll be updated via this forum thread once it is resolved.

We’re sorry for the inconvenience.
Regards,

urisim · June 27, 2011, 10:43am

any news?

I should prepare an estimation for an upcoming project in which I need to support the extract text functionality… So, I want to be certain I have this functionality in your package…

I’d appreciate a quick response.

Thanks.

shahzadlatif · June 28, 2011, 7:19am

Hi Uri,

I’m sorry to inform you that this issue is not yet resolved. However, I have asked our team to share the ETA of this issue. Please spare us some time for the investigation. You’ll be updated via this forum thread as soon as the response is received.

We’re sorry for the inconvenience.
Regards,

urisim · June 28, 2011, 7:55am

Hi,

Yesterday I had the same problem using pdflib TET.

I have found out that when the dates where BOLD it happened while REGULAR was ok.

Is there a way to remove styles from the document before extracting the text? (I need plain text)

Thanks,

Uri.

shahzadlatif · June 29, 2011, 5:35am

Hi Uri,

I’m afraid, it is not feasible to remove the styles from the PDF. However, as you know, we have already logged this issue and if it is caused due to the styles, it will be handled and resolved accordingly.

If you find any further questions, please do let us know.
Regards,

aspose.notifier · November 5, 2011, 8:34am

The issues you have found earlier (filed as 28621) have been fixed in this update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.

drormu · October 24, 2018, 5:28am

Hi
Anysolution founded ?
This TRL is realy problematic, on pdf to text libraries.

asad.ali · October 24, 2018, 12:00pm

@drormu

Thanks for your inquiry.

The issue related to inverted Hebrew text was resolved. We request you to please try with latest version i.e. Aspose.PDF for .NET 18.10 and in case you still face any issue, please share your sample PDF document along with sample code snippet. We will test the scenario in our environment and address it accordingly.