Can I extract Hebrew correctly using ExtractText?

Hi,


I’m using the demo version to test the different features. However, I need the extraction is Hebrew also and the text is reversed…
Can you point me to a solution (I saw IsBidi, but it’s readonly)?

Thanks,
Uri.

Hi,

Thank you very much for considering Aspose.

Please share the input PDF file with us, so we could investigate the issue at our end. You’ll be updated with the results accordingly. Also, please confirm that you’re using the .NET version for evaluation purposes.

We’re sorry for the inconvenience and looking forward to help you out.
Regards,

Hi,


We are using the .NET version for evaluation (currently the PDF library and after that we’ll check the WORD library as well).

The code snippet is :
//Instantiate PdfExtractor object
PdfExtractor extractor = new PdfExtractor();

//Set Password for input PDF file
extractor.Password = “”;

extractor.ExtractTextMode = 1;

//Bind the input PDF document to extractor
extractor.BindPdf(inputFile1);

//Extract text from the input PDF document
extractor.ExtractText();

//Save the extracted text to a text file
extractor.GetText(inputFile1 + “.txt”);

A sample file is attached.

I have two issues with the result:
1. The Hebrew is inverted (see original PDF and result TXT).
2. The numbers (years in the CV) are wrong after conversion.

Looking forward for your reply.

Thanks,
Uri.


Hi Uri,

Thank you very much for sharing the sample PDF and the code snippet. We’ll investigate this issue at our end and you’ll be updated shortly.

We’re sorry for the inconvenience.
Regards,

Hi Uri,

I have also noticed this problem at my end and logged it as PDFKITNET-28621 in our issue tracking system. Our team will look into this issue and you’ll be updated via this forum thread once it is resolved.

We’re sorry for the inconvenience.
Regards,

any news?


I should prepare an estimation for an upcoming project in which I need to support the extract text functionality… So, I want to be certain I have this functionality in your package…

I’d appreciate a quick response.

Thanks.

Hi Uri,

I’m sorry to inform you that this issue is not yet resolved. However, I have asked our team to share the ETA of this issue. Please spare us some time for the investigation. You’ll be updated via this forum thread as soon as the response is received.

We’re sorry for the inconvenience.
Regards,

Hi,


Yesterday I had the same problem using pdflib TET.
I have found out that when the dates where BOLD it happened while REGULAR was ok.
Is there a way to remove styles from the document before extracting the text? (I need plain text)

Thanks,
Uri.

Hi Uri,

I’m afraid, it is not feasible to remove the styles from the PDF. However, as you know, we have already logged this issue and if it is caused due to the styles, it will be handled and resolved accordingly.

If you find any further questions, please do let us know.
Regards,

The issues you have found earlier (filed as 28621) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.

Hi
Anysolution founded ?
This TRL is realy problematic, on pdf to text libraries.

@drormu

Thanks for your inquiry.

The issue related to inverted Hebrew text was resolved. We request you to please try with latest version i.e. Aspose.PDF for .NET 18.10 and in case you still face any issue, please share your sample PDF document along with sample code snippet. We will test the scenario in our environment and address it accordingly.