The text order is reversed while extract text from pdf

lucy.hq · August 17, 2021, 10:03am

Hi,

I try to extract text from pdf. But when there is Arabic in the pdf, the extracted text order is reversed. I’m using 21.7 and jdk14.
Here is my code:

    BufferedInputStream bis = new BufferedInputStream(new FileInputStream("C:\\Users\\xxx\\Desktop\\lucy-test-1.pdf"));
    Document document = new Document(bis);
    PdfExtractor ext = new PdfExtractor();
    ext.setExtractTextMode(1);
    ext.bindPdf(document);
    ext.extractText(StandardCharsets.UTF_8);
    ext.getText(new FileOutputStream("C:\\Users\\xxx\\Desktop\\lucy-test-1.txt"))

Attachment is my pdf and result text file:
extract text from pdf.zip (158.3 KB)

asad.ali · August 17, 2021, 5:53pm

@lucy.hq

We have logged an issue as PDFJAVA-40785 in our issue tracking system. We will further look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

lucy.hq · August 19, 2021, 8:49am

Do we have any update on this ticket or do we have any plan for it?

asad.ali · August 19, 2021, 5:52pm

@lucy.hq

The ticket has been logged recently and is pending for a review. Please note that issues are resolved on first come first serve basis under normal/free support model. We will surely investigate and fix the issue and let you know as soon as the ticket is resolved. Please give us some time.

We are sorry for the inconvenience.