Incorrect word order when receiving text from MarkupParagraph

carambis · August 20, 2021, 3:19pm

I use aspose pdf library for java. Version: 21.3

I have a problem with incorrect word order when I try to get a text from MarkupParagraph(See the attached screenshot).
image.png (17.5 KB)

When I use the viewer(Acrobat Reader) to view the file everything is ok. Please see the attached file
page_745_issue.pdf (57.2 KB)
image.png (87.2 KB)

Example of code:

                var doc = new Document(docBytes);
                var paragraphAbsorber = new ParagraphAbsorber();
                paragraphAbsorber.visit(doc);

                for (PageMarkup markup : paragraphAbsorber.getPageMarkups()) {
                    for (MarkupSection section : markup.getSections()) {
                        for (MarkupParagraph paragraph : section.getParagraphs()) {
                            String text = paragraph.getText();
                            System.out.println(text);
                        }
                    }
                }

mudassir.fayyaz · August 20, 2021, 8:22pm

@carambis

I request you to use the code below from Extract Text from PDF using Java article and share your feedback if this is suitable for you.

// open document
Document pdfDocument = new Document("page_745_issue.pdf");
// text file in which extracted text will be saved
java.io.OutputStream text_stream = new java.io.FileOutputStream("ExtractedText.txt", false);

// iterate through all the pages of PDF file
for (Page page : (Iterable<Page>) pdfDocument.getPages()) {
    // create text device
    TextDevice textDevice = new TextDevice();
    // set text extraction options - set text extraction mode (Raw or
    // Pure)
    TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
    textDevice.setExtractionOptions(textExtOptions);
    // get the text from pages of PDF and save it to OutputStream object
    textDevice.process(page, text_stream);
}
// close stream object
text_stream.close();

carambis · August 23, 2021, 8:04am

@mudassir.fayyaz Thanks for your response.
Your code works for me and the word order is correct.
Also if I use TextAbsorber instead of ParagraphAbsorber it’s working correctly.
But I want to use ParagraphAbsorber.
ParagraphAbsorber is not working correctly at the moment. Should I consider this as aspose pfd library bug?

mudassir.fayyaz · August 23, 2021, 1:00pm

@carambis

It’s good to know that suggested option has proved to be working on your end. For ParagraphAbsorber, a ticket with ID PDFJAVA-40807 has been created in our issue tracking system to further investigate the issue on our end. This thread has been linked with the issue so that you may be notified once the issue will be fixed.